SIMD parallel MCMC sampling with applications for big-data Bayesian analytics
Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential opportunities to parallelize techniques such as Monte Carlo Markov Chain (MCMC) sampling, and the development of general strategies for mapping such parallel algorithms to modern CPUs in order to elicit the performance up the compute-based and/or memory-based hardware limits. Two opportunities for Single-Instruction Multiple-Data (SIMD) parallelization of MCMC sampling for probabilistic graphical models are presented. In exchangeable models with many observations such as Bayesian Generalized Linear Models (GLMs), child-node contributions to the conditional posterior of each node can be calculated concurrently. In undirected graphs with discrete-value nodes, concurrent sampling of conditionally-independent nodes can be transformed into a SIMD form. High-performance libraries with multi-threading and vectorization capabilities can be readily applied to such SIMD opportunities to gain decent speedup, while a series of high-level source-code and runtime modifications provide further performance boost by reducing parallelization overhead and increasing data locality for Non-Uniform Memory Access architectures. For big-data Bayesian GLM graphs, the end-result is a routine for evaluating the conditional posterior and its gradient vector that is 5 times faster than a naive implementation using (built-in) multi-threaded Intel MKL BLAS, and reaches within the striking distance of the memory-bandwidth-induced hardware limit. Using multi-threading for cache-friendly, fine-grained parallelization can outperform coarse-grained alternatives which are often less cache-friendly, a likely scenario in modern predictive analytics workflow such as Hierarchical Bayesian GLM, variable selection, and ensemble regression and classification. The proposed optimization strategies improve the scaling of performance with number of cores and width of vector units (applicable to many-core SIMD processors such as Intel Xeon Phi and Graphic Processing Units), resulting in cost-effectiveness, energy efficiency (‘green computing’), and higher speed on multi-core x86 processors.
Year of publication: |
2015
|
---|---|
Authors: | Mahani, Alireza S. ; Sharabiani, Mansour T.A. |
Published in: |
Computational Statistics & Data Analysis. - Elsevier, ISSN 0167-9473. - Vol. 88.2015, C, p. 75-99
|
Publisher: |
Elsevier |
Subject: | GPU | Hierarchical Bayesian | Intel Xeon Phi | Logistic regression | OpenMP | Vectorization |
Saved in:
Online Resource
Saved in favorites
Similar items by subject
-
Muresano, Ronal, (2016)
-
Accelerating Training Process in Logistic Regression Model using OpenCL Framework
Zahera, Hamada M., (2017)
-
A fast estimation procedure for discrete choice random coefficients demand model
Kim, Dong-hyuk, (2017)
- More ...