A parallel random forest classifier for R
ABSTRACT The statistical language R is favoured by many biostaticians for processing microarray data. In recent times, the quantity of data that can be obtained in experiments has risen significantly, making previously fast analyses time consuming, or even not possible at all with the existing software infrastructure. High Performance Computing (HPC) systems offer a solution to these problems, but at the expense of increased complexity for the end user. The Simple Parallel R Interface (SPRINT) is a library for R that aims to reduce the complexity of using HPC systems by providing biostatisticians with drop-in parallelized replacements of existing R functions. In this paper we describe the implementation of a parallel version of the Random Forest classifier in the SPRINT library.
- SourceAvailable from: Terence Matthew Sloan[Show abstract] [Hide abstract]
ABSTRACT: R is a free statistical programming language commonly used for the analysis of high-throughput microarray and other data. It is currently unable to easily utilise multi processor architectures without substantial changes to existing R scripts. Further, working with large volumes of data often leads to slow processing and even memory allocation faults. A recent survey highlighted clustering algorithms as both computation and data intensive bottlenecks in post-genomic data analyses. These algorithms aim to sort numeric vectors (such as gene expression profiles) into groups by minimising vector distances within groups and maximising them between groups. This paper describes the optimisation and parallelisation of a popular clustering algorithm, partitioning around medoids (PAM), for the Simple Parallel R INTerface (SPRINT). SPRINT allows R users to exploit high performance computing systems without expert knowledge of such systems. This paper reports on a serial optimisation of the original code and a subsequent parallel implementation. The parallel implementation enables the processing of data sets that exceed the available physical memory and can yield, depending on the data set, over 100-fold increase in performance.High Performance Computing and Simulation (HPCS), 2011 International Conference on; 08/2011
- [Show abstract] [Hide abstract]
ABSTRACT: The aim of this work is to propose modifications of the Random Forests algorithm which improve its prediction performance. The suggested modifications intend to increase the strength and decrease the correlation of individual trees of the forest and to improve the function which determines how the outputs of the base classifiers are combined. This is achieved by modifying the node splitting and the voting procedure. Different approaches concerning the number of the predictors and the evaluation measure which determines the impurity of the node are examined. Regarding the voting procedure, modifications based on feature selection, clustering, nearest neighbors and optimization techniques are proposed. The novel feature of the current work is that it proposes modifications, not only for the improvement of the construction or the voting mechanisms but also, for the first time, it examines the overall improvement of the Random Forests algorithm (a combination of construction and voting). We evaluate the proposed modifications using 24 datasets. The evaluation demonstrates that the proposed modifications have positive effect on the performance of the Random Forests algorithm and they provide comparable, and, in most cases, better results than the existing approaches.Data & Knowledge Engineering. 01/2013; 87:41–65.
Conference Paper: Learning Random Forests on the GPU[Show abstract] [Hide abstract]
ABSTRACT: Random Forests are a popular and powerful machine learning technique, with several fast multi-core CPU implementations. Since many other machine learn-ing methods have seen impressive speedups from GPU implementations, applying GPU acceleration to random forests seems like a natural fit. Previous attempts to use GPUs have relied on coarse-grained task parallelism and have yielded incon-clusive or unsatisfying results. We introduce CudaTree, a GPU Random Forest implementation which adaptively switches between data and task parallelism. We show that, for larger datasets, this algorithm is faster than highly tuned multi-core CPU implementations.Big Learning 2013: Advances in Algorithms and Data Management, Lake Tahoe; 12/2013