Article

MLlib: Machine Learning in Apache Spark

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... These tools are part of the Hadoop ecosystem [18]. MLlib, a Spark project, encompasses common learning algorithms and statistical utilities for Big Data and machine learning [23]. The ability of Spark to perform in-memory computation and accelerate iterative processes has made algorithms developed for this platform prevalent in the industry [23], [24]. ...
... MLlib, a Spark project, encompasses common learning algorithms and statistical utilities for Big Data and machine learning [23]. The ability of Spark to perform in-memory computation and accelerate iterative processes has made algorithms developed for this platform prevalent in the industry [23], [24]. ...
... This is the author's version which has not been fully edited and content may change prior to final publication. 23 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. ...
Article
Full-text available
Uplift modeling is a widely recognized predictive approach used to identify individuals who are more likely to respond positively to an intervention or treatment, such as a marketing campaign. However, this approach can be negatively affected by the class-imbalance problem, which occurs when the distribution of target classes is highly skewed. For instance, in a class-imbalanced uplift modeling task, only a small fraction typically responds to a marketing campaign that leads to a purchase. In this paper, we propose a novel resampling scheme that addresses the class-imbalance issue by combining intelligent oversampling and propensity score matching (PSM). By leveraging intelligent oversampling in observational studies, we alleviate the class-imbalance problem and mitigate the negative effects of PSM in terms of information loss. We introduce two efficient resampling schemes that intelligently combine these approaches. To ensure scalability and effectiveness, we adopt a distributed framework based on MapReduce and utilize a hybrid spill trees algorithm for efficient nearest neighbor search. Our experimental results demonstrate the advantages of the proposed method, achieving statistically superior predictive performance compared to other resampling approaches while maintaining efficiency in terms of overall running times.
... The selected dataset in LIBSVM [7] format is read from disk storage twice, following a schema similar to Scikit-Learn [42] and Apache Spark MLlib [37]. ...
... To the best of our knowledge, the best-known open-source ready-to-use solutions that can be used for distributing training logistic regression are Apache Spark [37] and Rays [38]. We provide the time required for FedNL, Rays, and Apache Spark to achieve a tolerance of |∇ ( )| ≈ 10 −9 via configuring final tolerance for solvers. ...
... we have selected to compete against solvers accessible via CVXPY [11], and industrial solutions such as Apache Spark [37], and Ray/Scikit-Learn [38]. Although these frameworks provide effective and robust industrial solutions, they do not fully adhere to FL principles, particularly regarding partial client participation or communication compression. ...
Preprint
Full-text available
Federated Learning (FL) is an emerging paradigm that enables intelligent agents to collaboratively train Machine Learning (ML) models in a distributed manner, eliminating the need for sharing their local data. The recent work (arXiv:2106.02969) introduces a family of Federated Newton Learn (FedNL) algorithms, marking a significant step towards applying second-order methods to FL and large-scale optimization. However, the reference FedNL prototype exhibits three serious practical drawbacks: (i) It requires 4.8 hours to launch a single experiment in a sever-grade workstation; (ii) The prototype only simulates multi-node setting; (iii) Prototype integration into resource-constrained applications is challenging. To bridge the gap between theory and practice, we present a self-contained implementation of FedNL, FedNL-LS, FedNL-PP for single-node and multi-node settings. Our work resolves the aforementioned issues and reduces the wall clock time by x1000. With this FedNL outperforms alternatives for training logistic regression in a single-node -- CVXPY (arXiv:1603.00943), and in a multi-node -- Apache Spark (arXiv:1505.06807), Ray/Scikit-Learn (arXiv:1712.05889). Finally, we propose two practical-orientated compressors for FedNL - adaptive TopLEK and cache-aware RandSeqK, which fulfill the theory of FedNL.
... Such assumptions, however, have never been tested in a distributed environment. Since the development of the Apache Spark framework and its machine learning library MLlib [12], many researchers have developed distributed sentiment analysis frameworks. For instance, in their study, Nodarakis and al. [15] presented a novel method for sentiment learning in the Spark framework that utilizes hashtags and emoticons as sentiment labels and employs a classification procedure in a parallel and distributed manner, which is proven to be efficient, robust, and scalable through extensive experimental evaluation. ...
... Apache Spark's MLlib (Machine Learning library) [12] is a powerful tool for building scalable machine learning applications in a distributed manner. MLlib provides a rich set of algorithms and utilities for tasks such as classification, regression, clustering. ...
Article
Sentiment analysis on big data presents unique challenges due to the volume of unstructured data. Traditional single-node systems struggle with this scale, necessitating the use of distributed computing systems like Apache Spark. This study investigates the role of large-scale data preprocessing and feature extraction in sentiment analysis tasks. We conducted a comprehensive set of experiments using four preprocessing techniques and two word vectorization methods to evaluate their impact on the performance of Multi-Layer Perceptrons (MLPs) in Apache Spark. Our results indicate that the choice of preprocessing and feature extraction methods significantly influences model performance. Furthermore, our MLP architecture demonstrated both computational scalability and high accuracy performance in Apache Spark. These findings highlight the importance of large-scale data preprocessing and feature extraction in sentiment analysis on big data, and the effectiveness of using MLPs in Apache Spark for these tasks.
... It serves as a template that captures the common structure shared among various instances of stack traces. To identify ML-related SO questions, we consider questions related to the use of seven popular Python ML libraries, including TensorFlow (Abadi et al. 2016), Keras (Chollet et al. 2015), Scikit-learn (Pedregosa et al. 2011), PyTorch (Paszke et al. 2019), NLTK (Loper and Bird 2002), HuggingFace (Wolf et al. 2019), and Spark ML (Meng et al. 2016). We focus on the Python language as it is the dominant programming language for ML application development (Atwi et al. 2021;. ...
... It is a data science platform that provides tools for users to build, train, and deploy open-source ML models. -Spark ML (Meng et al. 2016): provides a set of APIs for users to create and tune practical ML pipelines. Occasionally, "Spark ML" refers to the MLlib DataFrame-based API. ...
Article
Full-text available
Machine learning (ML), including deep learning, has recently gained tremendous popularity in a wide range of applications. However, like traditional software, ML applications are not immune to the bugs that result from programming errors. Explicit programming errors usually manifest through error messages and stack traces. These stack traces describe the chain of function calls that lead to an anomalous situation, or exception. Indeed, these exceptions may cross the entire software stack (including applications and libraries). Thus, studying the ML-related patterns in stack traces can help practitioners and researchers understand the causes of exceptions in ML applications and the challenges faced by ML developers. To that end, we mine Stack Overflow (SO) and study 18, 538 ML-related stack traces related to seven popular Python ML libraries. First, we observe that ML questions that contain stack traces are less likely to get accepted answers than questions that don’t, even though they gain more attention (i.e., more views and comments). Second, we observe that recurrent patterns exist in ML stack traces, even across different ML libraries, with a small portion of patterns covering many stack traces. Third, we derive five high-level categories and 26 low-level types from the stack trace patterns: most patterns are related to model training, python basic syntax, parallelization, subprocess invocation, and external module execution. Furthermore, the patterns related to external dependencies (e.g., file operations) or manipulations of artifacts (e.g., model conversion) are among the least likely to get accepted answers on SO. Our findings provide insights for researchers, ML library developers, and technical forum moderators to better support ML developers in writing error-free ML code. For example, future research can leverage the common patterns of stack traces to help ML developers locate solutions to problems similar to theirs or to identify experts who have experience solving similar patterns of problems. Researchers and ML library developers could prioritize efforts to help ML developers identify misuses of ML APIs, mismatches in data formats, and potential data/resource contentions so that ML developers can better avoid/fix model-related exception patterns, data-related exception patterns, and multi-process-related exception patterns, respectively.
... α 1 , α 2 , and α 3 are the curve fitting parameters of the utility function to real-world experiments, i.e., the ground truth. Big data platforms, e.g., Apache Mahout [29] and MLlib [30], can be used for running the realworld experiments at scale. In particular, a set of B real-world experiments r (i) , τ (i) B i=1 are executed at varying privacy level r (i) resulting in the real-world service quality of τ (i) , where r (i+1) > r (i) ≥ 0. α 1 , α 2 , and α 3 are obtained by minimizing the residuals of a nonlinear least squares fitting as follows: ...
Preprint
With the emerging sensing technologies such as mobile crowdsensing and Internet of Things (IoT), people-centric data can be efficiently collected and used for analytics and optimization purposes. This data is typically required to develop and render people-centric services. In this paper, we address the privacy implication, optimal pricing, and bundling of people-centric services. We first define the inverse correlation between the service quality and privacy level from data analytics perspectives. We then present the profit maximization models of selling standalone, complementary, and substitute services. Specifically, the closed-form solutions of the optimal privacy level and subscription fee are derived to maximize the gross profit of service providers. For interrelated people-centric services, we show that cooperation by service bundling of complementary services is profitable compared to the separate sales but detrimental for substitutes. We also show that the market value of a service bundle is correlated with the degree of contingency between the interrelated services. Finally, we incorporate the profit sharing models from game theory for dividing the bundling profit among the cooperative service providers.
... The whole pipeline is built on top of Apache Spark and the MLlib [22] machine learning library and is publicly available at GitHub 2 . ...
Preprint
We present a new distributed fuzzy partitioning method to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems. The proposed algorithm builds a fixed number of fuzzy sets for all variables and adjusts their shape and position to the real distribution of training data. A two-step process is applied : 1) transformation of the original distribution into a standard uniform distribution by means of the probability integral transform. Since the original distribution is generally unknown, the cumulative distribution function is approximated by computing the q-quantiles of the training set; 2) construction of a Ruspini strong fuzzy partition in the transformed attribute space using a fixed number of equally distributed triangular membership functions. Despite the aforementioned transformation, the definition of every fuzzy set in the original space can be recovered by applying the inverse cumulative distribution function (also known as quantile function). The experimental results reveal that the proposed methodology allows the state-of-the-art multi-way fuzzy decision tree (FMDT) induction algorithm to maintain classification accuracy with up to 6 million fewer leaves.
... The same argument was proven in [24] where the standard version of both algorithms were compared in a large list of small synthetic datasets. FS schemes are evaluated using two classification algorithms belonging to the MLlib library [14]: Support Vector Machines (SVM) and Decision Trees (DT). SVMs in Spark internally optimizes the Hinge Loss using Orthant-Wise Limited-memory Quasi-Newton optimizer, whereas DTs perform recursive binary partitioning optimizing an information gain measure (Gini impurity or InfoGain). ...
Preprint
With the advent of Big Data era, data reduction methods are highly demanded given its ability to simplify huge data, and ease complex learning processes. Concretely, algorithms that are able to filter relevant dimensions from a set of millions are of huge importance. Although effective, these techniques suffer from the "scalability" curse as well. In this work, we propose a distributed feature weighting algorithm, which is able to rank millions of features in parallel using large samples. This method, inspired by the well-known RELIEF algorithm, introduces a novel redundancy elimination measure that provides similar schemes to those based on entropy at a much lower cost. It also allows smooth scale up when more instances are demanded in feature estimations. Empirical tests performed on our method show its estimation ability in manifold huge sets --both in number of features and instances--, as well as its simplified runtime cost (specially, at the redundancy detection step).
... In this specific work, we focus on using Spark MLlib, a scalable ML library that encompasses various ML algorithms. Spark MLlib's capabilities enable efficient implementation and execution of ML tasks on large datasets, highlighting its significance within the Spark ecosystem for advanced data analysis and predictive modeling [23], [24]. ...
Article
Full-text available
p class="Abstract">Internet of things (IoT) systems have experienced significant growth in data traffic, resulting in security and real-time processing issues. Intrusion detection systems (IDS) are currently an indispensable tool for self-protection against various attacks. However, IoT systems face serious challenges due to the functional diversity of attacks, resulting in detection methods with machine learning (ML) and limited static models generated by the linear discriminant analysis (LDA) algorithm. The process entails adjusting the model parameters in real time as new data arrives. This paper proposes a new method of an IDS based on the LDA algorithm with the incremental model. The model framework is trained and tested on the IoT intrusion dataset (UNSW-NB15) using the streaming linear discriminant analysis (SLDA) ML algorithm. Our approach increased model accuracy after each training, resulting in continuous model improvement. The comparison reveals that our dynamic model becomes more accurate after each batch and can detect new types of attacks.</p
... AutoML automates the process of trying and tuning various machine learning regression models to optimize prediction performance. In this instance, the algorithm explored eXtreme Randomized Trees (XRT), a variant of random forests with increased randomization [45]; Distributed Random Forest (DRF), an implementation of the random forest algorithm for big data [46]; Generalized Linear Models (GLM), which extend linear regression to handle non-normal distributions [47]; Gradient Boosting Machines (GBM), an ensemble technique that builds trees sequentially to correct previous errors [48]; Deep Learning (fully connected neural networks), which use multiple layers of interconnected nodes to learn complex patterns [49]; and a stacked ensemble combining these models, which leverages the strengths of multiple base models [50]. The sole inputs provided to AutoML were the dataset, cross-validation parameters, stopping metric, and stopping time. ...
... Developing and implementing scalable versions of popular ML algorithms that can efficiently process massive datasets in distributed cloud environments is an active area of research. This includes techniques such as distributed stochastic gradient descent, parallel decision tree learning, and scalable clustering algorithms [14]. ...
Article
Full-text available
The Internet of Things (IoT) has revolutionized data collection across various domains, generating massive amounts of heterogeneous data at unprecedented rates. This surge in data volume and velocity presents both opportunities and challenges for data analytics. Cloud computing environments offer a promising solution for processing and analyzing IoT data due to their scalability and resource elasticity. This paper presents a comprehensive review and analysis of scalable machine learning models designed for IoT data analytics in cloud environments. We explore the synergies between IoT, cloud computing, and machine learning, discussing the challenges of processing IoT data at scale and the advantages of cloud-based solutions. The paper examines various machine learning algorithms and architectures optimized for cloud deployment, including distributed learning frameworks, federated learning, and edge-cloud collaborative models. We also present case studies demonstrating the application of these models in real-world IoT scenarios, such as smart cities, industrial IoT, and healthcare. Our findings highlight the importance of scalable machine learning models in extracting valuable insights from IoT data and the role of cloud environments in enabling efficient, large-scale data analytics.
... Each node generates frequent Trees (FP-Tree) tocompress the transactional database and reduce the generation of candidate itemsets. The FPGrowth algorithm in Spark is extended bythe RDD's Distributed Computing Model for scaling, which improves the ability and efficiency to process large datasets, includes the FPGrowth algorithm, which uses Frequent Pattern Trees (FP-Tree) to compress data and reduce candidate generation,optimized through Spark's RDD distributed computing model for scalability, This work demonstrates the extension of FPGrowth using FP-Trees on Spark's RDDs, providingenhanced performance and scalability in processing large datasets through parallel computing [26][27][28]) have used Frequent Pattern Trees (FP-Tree) to compress the transactional database and reduce the generation of candidate itemsets. The FPGrowth algorithm in Spark is extended by the RDD's distributed computing model for scaling, which improves the ability and efficiency to handle large datasets.Researchers and developers within and outside the Spark community (This paper discusses the use of memory-optimized data structures, such as Trie, for frequent itemset mining. ...
Article
Full-text available
In this paper, we propose a new method that combines the parallelism of the Spark-based platform with fast frequent mining, called STB_Apriori. Previous research has shown that traditional frequent itemset mining algorithms have high overhead when faced with large datasets and high-dimensional data computation, and generate a large number of candidate itemsets; at the same time, when faced with diverse user requirements, they often generate very sparse and diverse data. In order to solve the problem of fast mining of massive data, our idea originates from the capability of Spark distributed computing and the common optimisation ideas in Apriori mining, by using the efficient operator BitSet to achieve transaction compression, bit storage and data manipulation by Boolean matrices, and at the same time by parallelising the processing and optimising the algorithmic logic to achieve fast and frequent mining. In experiments on real-world datasets, our model consistently outperforms five widely used methods by a significant margin on very large data and maintains its excellence in the remaining cases, proving its effectiveness on real-world tasks, while further analysis shows that increasing the number of distributed nodes also incrementally and continuously improves performance.
... This flexibility results in a proposition that is likely to meet the needs required for data processing for various applications within industries. Looking at the strengths of each tool, organizations can design their pipeline appropriately and get the most out of big data investments [9][10][11][12][13][14][15]. ...
Article
With the rapid increase of data in today's organizations, there is a need to have sustainable and effective ETL solutions. The current paper covers a detailed performance evaluation of Hadoop-based tools such as MapReduce, Oozie, and Spark applications on large-volume ETL operations.
... The integration stage produces a single tabular output which must be encoded to matrix data to serve as training and test data for the subsequent model training (3). A common abstraction for conducting such feature encoding steps are so-called estimator/transformer pipelines, which have been popularised by scikit-learn [7] and have also been adapted by dataflow systems for ML like Spark's MLlib [8] or Google's Tensorflow Extended (TFX) platform [9]. The output of the feature encoding stage is trainining data in matrix form, based on which the typical ML model training and evaluation process can begin. ...
Article
Full-text available
Software systems that learn from data via machine learning (ML) are being deployed in increasing numbers in real world application scenarios. These ML applications contain complex data preparation pipelines, which take several raw inputs, integrate, filter and encode them to produce the input data for model training. This is in stark contrast to academic studies and benchmarks, which typically work with static, already prepared datasets. It is a difficult and tedious task to ensure at development time that the data preparation pipelines for such ML applications adhere to sound experimentation practices and compliance requirements. Identifying potential correctness issues currently requires a high degree of discipline, knowledge, and time from data scientists, and they often only implement one-off solutions, based on specialised frameworks that are incompatible with the rest of the data science ecosystem. We discuss how to model data preparation pipelines as dataflow computations from relational inputs to matrix outputs, and propose techniques that use record-level provenance to automatically screen these pipelines for many common correctness issues (e.g., data leakage between train and test data). We design a prototypical system to screen such data preparation pipelines and furthermore enable the automatic computation of important metadata such as group fairness metrics. We discuss how to extract the semantics and the data provenance of common artifacts in supervised learning tasks and evaluate our system on several example pipelines with real-world data.
... MLlib is designed to simplify the ML pipeline in big data and its main functionalities include classification, regression, clustering, collaborative filtering, optimization, and dimensionality reduction (Meng et al., 2016). Spark Streaming allows the use of Spark's API to quickly process data, which can come from different data sources such as HDFS, Flume or Kafka in streaming environments by using mini-batches (García-Gil et al., 2017). ...
Thesis
Natural disasters result in devastating losses in human life, environmental assets, and personal-, regional-, and national economies. The availability of different big data such as satellite images, Global Positioning System (GPS)traces, mobile Call Detail Records (CDR), social media posts, etc., in conjunction with advances in data analytic techniques (e.g., data mining and big data processing, machine learning and deep learning) can facilitate the extraction of geospatial information that is critical for rapid and effective disaster response. However, disaster response system development usually requires the integration of data from different sources (streaming data sources and data sources at rest) with different characteristics and types, which consequently have different processing needs. Deciding which processing framework to use for specific big data to perform a given task is usually a challenge for researchers from the disaster management field. While many things can be accomplished with population and movement data, for disaster management key, and arguably most important task is to analyze the displacement of the population during and after a disaster. Therefore, in this Licentiate, the knowledge and framework resulting from a literature review were used to select tools, and processing strategies to perform population displacement analysis after a disaster. This is a use case of the framework as well as an illustration of the value and challenges (e.g., gaps in data due to power outages) of using CDR data analysis to support disaster management. Using CDR data, the displaced population was inferred by analyzing the variation of home cell-tower for each anonymized mobile phone subscriber before and after a disaster. The effectiveness of the proposed method is evaluated using remote sensing-based building damage assessment data and Displacement Tracking Matrix (DTM) from individuals’ survey responses at shelters after a severe cyclone in Beira city, central Mozambique, in March 2019. The results show an encouraging correlation coefficient (over 70%) between the number of arrivals in each neighborhood estimated using CDR data and from DTM. In addition to this, CDR-based analysis derives the spatial distribution of displaced populations with high coverage of people, i.e., including not only people in shelters but everyone who used a mobile phone before and after a disaster. Moreover, results suggest that if CDR data are available after a disaster, population displacement can be estimated and this information can be used for response activities and for example to contribute to reducing waterborne diseases (e.g., diarrheal disease) and diseases associated with crowding (e.g., acute respiratory infections) in shelters and host communities.
... Federated Learning (FL) [1] is a distributed machine learning approach designed to enable multiple parties to jointly train a machine learning model without sharing their data. Unlike traditional distributed learning [2], federated learning shares model parameters instead of data, thus protecting the data privacy of participants. The system primarily consists of two parts: clients, which train local models, and an aggregation server, which aggregates these local models and broadcasts the global model. ...
Preprint
Full-text available
Federated learning is a distributed machine learning method that enables multiple participants to jointly train a machine learning model while preserving data privacy. However, its distributed nature makes federated learning vulnerable to Byzantine attacks, leading to degraded model performance or failure to converge. Existing model poisoning attacks primarily target all model parameter dimensions, which limits attackers in evading server defense methods and reduces the effectiveness of the attack. To address this, we propose a new fragment model poisoning attack method—FMPA. This method focuses on specific dimensions of model parameters, achieving a more concentrated attack to evade defense methods while significantly degrading model performance. Experimental results show that FMPA can effectively impair model performance even in the face of five different Byzantine robustness defense methods.
... These machine learning techniques, when combined with PySpark's streaming capabilities and Hive's data storage, enable the creation of sophisticated, real-time predictive analytics systems [6]. ...
Article
Full-text available
This article investigates the integration of PySpark with Hive data warehouses toenable high-performance real-time analytics. We explore the synergies betweenPySpark's distributed computing capabilities and Hive's data storage infrastructure,focusing on performance optimization techniques for large-scale data processing
... So, IoVs has high requirements for the security of intrusion detection. Combined with the characteristics of large traffic and multi-dimensional complexity of IoVs, with the advantages of deep neural network detection and the fast effective characteristics of distributed parallel computing, the authors proposed to apply the combined algorithm to Spark framework for intrusion detection in [52]. The CNNLSTM algorithm model was proposed by combining CNN and LSTM to analyze the NSL-KDD dataset and UNSWNB15 dataset and minimize the security attacks on connected vehicles. ...
Article
Full-text available
Recently, the internet of vehicles (IoVs), mobile edge computing (MEC), and deep learning have attracted many research attentions in the applications of autonomous driving. MEC can help to reduce the network load and transmission delay by offloading the computing tasks to the powerful edge servers while deep learning can effectively improve the accuracy of obstacle detection to enhance the stability and safety of automatic driving. In this article, we first present a comprehensive overview of distributed deep learning based on edge computing over IoVs. Then, the related key techniques, including the distributed characteristics of IoVs, mobile edge collaborative computing architecture for high quality of service (QoS) requirements in terms of vehicle transmission delay and energy consumption, distributed deep learning, and its applications for vehicular networking, are discussed. Finally, the article identifies several important open challenges, opportunities, and potential research directions to provide a reference for readers in this field.
... PyTorch also supports dynamic computation graphs, which can be adjusted in real-time, providing additional flexibility and efficiency. Spark MLlib 45[64] is a scalable machine learning library built on Apache Spark, offering a unified API for tasks like classification, regression, and clustering. It uses DataFrames for flexible data handling and supports pipelines forstreamlined model building. ...
Preprint
The integration of big data into nephrology research has opened new avenues for analyzing and understanding complex biological datasets, driving advancements in personalized management of cardiovascular and kidney diseases. This paper explores the multifaceted challenges and opportunities presented by big data in nephrology, emphasizing the importance of data standardization, sophisticated storage solutions, and advanced analytical methods. We discuss the role of data science workflows, including data collection, preprocessing, integration, and analysis, in facilitating comprehensive insights into disease mechanisms and patient outcomes. Furthermore, we highlight the potential of predictive and prescriptive analytics, as well as the application of large language models (LLMs), in improving clinical decision-making and enhancing the accuracy of disease predictions. The use of high-performance computing (HPC) is also examined, showcasing its critical role in processing large-scale datasets and accelerating machine learning algorithms. Through this exploration, we aim to provide a comprehensive overview of the current state and future directions of big data analytics in nephrology, with a focus on enhancing patient care and advancing medical research.
... On the other hand, large-scale computing naturally makes ML algorithm specification more challenging, particularly in terms of scalable and effective execution [6]. The most common tools for ML of large-scale tasks nowadays are ML libraries of large-scale type such as MLlib (aka SparkML) [7], Mahout [8], and MADlib [9,10]. These libraries offer algorithms with predefined dispersed runtime schedules and frequently reveal the representation of physical data that lies beneath them. ...
... Spark has the characteristics of memory computing, distributed computing, integration, etc. It can rely on memory to process data in parallel on the cluster, which can significantly improve the computing speed [17]. Spark also provides a wealth of high-level libraries (such as SparkSQL, GraphX), which simplify data preprocessing, feature extraction and other processes. ...
Article
Full-text available
As data sets and data streams continue to expand, traditional machine learning is becoming less effective in predicting fake news. This paper is a review of deep learning in fake news detection and prevention. Author takes the model based on convolutional neural network as an example to illustrate the principle and application of deep learning in fake news detection, including OPCNN-FAKE, Dual-channel Convolutional Neural Networks with Attention-pooling (DC-CNN) model which is completely based on Convolutional Neural Network (CNN), and Convolutional Neural Network-Long Short Term Memory (CNN-LSTM) model which combines convolutional neural network with long-short time model. These models have obvious advantages in accuracy over traditional machine learning models. This paper then points out the problems of deep learning in the field of fake news identification: it does not have good scalability and slow training speed. The author proposes possible solutions, and widely uses transfer learning and uses distributed computing platforms, such as spark, to train models. Hope this review can help the research on fake news prediction using deep learning.
... The concatenated output, comprising message-aggregated (AX) matrices and parameters, now generates the embedding matrix 8 , Z ∈ R |V I |×D , processed by the final weight matrix W z ∈ R D(K+1)×D with nonlinearities σ and ξ. Notably, these elements X I , A I (1) X 1 , A I (k) X I can be efficiently precomputed and obtained via techniques like Apache Spark [37] or sparse matrix multiplication [42] before entering training, as they do not depend on learnable model parameters and remain static during training. This approach significantly enhances our algorithm's efficiency, which will be discussed later in Sec 4.3. ...
Preprint
Recent advancements in graph-based approaches for multiplexed immunofluorescence (mIF) images have significantly propelled the field forward, offering deeper insights into patient-level phenotyping. However, current graph-based methodologies encounter two primary challenges: (1) Cellular Heterogeneity, where existing approaches fail to adequately address the inductive biases inherent in graphs, particularly the homophily characteristic observed in cellular connectivity and; (2) Scalability, where handling cellular graphs from high-dimensional images faces difficulties in managing a high number of cells. To overcome these limitations, we introduce Mew, a novel framework designed to efficiently process mIF images through the lens of multiplex network. Mew innovatively constructs a multiplex network comprising two distinct layers: a Voronoi network for geometric information and a Cell-type network for capturing cell-wise homogeneity. This framework equips a scalable and efficient Graph Neural Network (GNN), capable of processing the entire graph during training. Furthermore, Mew integrates an interpretable attention module that autonomously identifies relevant layers for image classification. Extensive experiments on a real-world patient dataset from various institutions highlight Mew's remarkable efficacy and efficiency, marking a significant advancement in mIF image analysis. The source code of Mew can be found here: \url{https://github.com/UNITES-Lab/Mew}
... A few machine learning systems based on Scikit-learn constraints are MLLib [2]. Though they are not "data science languages," MLLib is enterprise-level environments for creating machine learning models for applications, in contrast to Scikit-learn and comparable to ML.NET. ...
Article
Full-text available
Emerging as game-changing technologies, artificial intelligence (AI) and machine learning (ML) have the power to upend a number of industries and inspire creativity. This study offers an extensive analysis of how AI and ML affect innovation. It looks at how various facets of innovation, such as business models, process optimisation, and product development, have been impacted by these technologies. The framework known as ML.NET was created by Microsoft ten years ago with the goal of easing the integration of machine learning models into sizable software programmes. This article presents the framework. We outline its design as well as the application requirements that influenced it. With regard to ML.NET, we specifically describe DataView, the central data abstraction that enables it to reliably and quickly capture whole prediction pipelines throughout the training and inference lifespan. After comparing ML.NET's performance against that of more recent arrivals, we conclude the research with a surprisingly positive analysis and a discussion of key lessons learned.
... Spark supports a wide range of machine learning algorithms through its MLlib library, making it well-suited for predictive modeling tasks. These capabilities enable healthcare organizations to rapidly iterate on complex predictive models, uncovering insights from massive datasets in realtime and facilitating timely, data-driven decisions in patient care [8]. ...
Article
Full-text available
The integration of predictive analytics into personalized medicine has become a promising approach for improving patient outcomes and treatment efficacy. This paper provides a review of the field, examining the tools, methodologies, and challenges associated with this advanced statistical methodology. Predictive analytics leverages machine learning algorithms to analyze vast datasets, including Electronic Health Records (EHRs), genomic data, medical imaging, and real-time data from wearable devices. The review explores key tools such as the Hadoop Distributed File System (HDFS), Apache Spark, and Apache Hive, which facilitate scalable storage, efficient data processing, and comprehensive data analysis. Key challenges identified include managing the immense volume of healthcare data, ensuring data quality and integration, and addressing privacy and security concerns. The paper also highlights the difficulties in achieving real-time data processing and integrating predictive insights into clinical practice. Effective data governance and ethical considerations are critical to maintaining trust and transparency. The strategic use of big data tools, combined with investment in skill development and interdisciplinary collaboration, is essential for harnessing the full potential of predictive analytics in personalized medicine. By overcoming these challenges, healthcare providers can enhance patient care, optimize resource management, and drive medical discoveries, ultimately revolutionizing healthcare delivery on a global scale.
Article
Full-text available
The exponential growth of diverse digital data continues to present significant challenges in efficient storage and meaningful analysis. Apache Spark, with its in-memory cluster computing capabilities, has evolved into a cornerstone solution for effective big data analytics. This study evaluates the analytical performance of Spark's machine learning library (MLlib) using classification algorithms on a real-world banking dataset, while also exploring recent advancements in big data processing and machine learning. Three models-Logistic Regression, Decision Tree, and Random Forest-were trained on the dataset to predict loan approval outcomes, showcasing MLlib's scalability and processing speed. The study demonstrates MLlib's efficiency in parallelizing computation and model training across distributed datasets, making it well-suited for large-scale data processing. Recent developments, including improved integration with deep learning frameworks, enhanced AutoML capabilities, and advancements in real-time processing, are examined. Performance benchmarks are updated to reflect the latest versions of Spark and MLlib, providing current insights into their capabilities. The study's findings align with industry trends, indicating the increasing adoption of Apache Spark and MLlib by enterprises aiming to harness the full potential of big data, particularly in the banking and fintech sectors. By exploring these recent developments and their implications, this research underscores the ongoing significance of Apache Spark MLlib in real-world applications, especially in domains requiring accurate predictive analytics like banking.
Article
Full-text available
In the contemporary landscape of big data, efficiently processing and analyzing vast volumes of information is crucial for organizations seeking actionable insights. Apache Spark has emerged as a leading distributed computing framework that addresses these challenges with its in-memory processing capabilities and scalability. This article explores the implementation of Spark DataFrames as a pivotal tool for advanced data analysis. We delve into how DataFrames provide a higher-level abstraction over traditional RDDs (Resilient Distributed Datasets), enabling more intuitive and efficient data manipulation through a schema-based approach. By integrating SQL-like operations and supporting a wide range of data sources, Spark DataFrames simplify complex analytical tasks. The discussion includes methodologies for setting up the Spark environment, loading diverse datasets into DataFrames, and performing exploratory data analysis and transformations. Advanced techniques such as user-defined functions (UDFs), machine learning integration with MLlib, and real-time analytics using Structured Streaming are examined. Performance optimization strategies, including caching, broadcast variables, and utilizing efficient file formats like Parquet, are highlighted to demonstrate how to enhance processing speed and resource utilization. Through a practical case study, we illustrate the application of these concepts in a real-world scenario, showcasing the effectiveness of Spark DataFrames in handling large-scale data analytics. This comprehensive exploration underscores the significance of adopting Spark DataFrames for organizations aiming to leverage big data effectively, ultimately facilitating faster, more insightful decision-making processes. Keywords: Apache Spark,Spark DataFrames,Big Data Analytics,In-Memory Computation,Advanced Data Analysis.
Article
Full-text available
Data pipelining is a basic component in managing and processing data at scale, especially in large organizations. Optimal utilization of the pipeline must encompass all aspects that ensure scalability, cost effectiveness, and reliability. It is against this background that this research paper takes a central focus on the strategies and best practices for improving pipeline efficiency through design principles, optimization techniques, management of resources, automation, and security. We base our work on the recent works and industrial and academic frameworks to examine the impact of emergence technologies and suggest how pipeline performance may be measured and benchmarked, with respect to operational improvements and data-driven decision-making.
Article
Deep learning recommendation models (DLRM) rely on large embedding tables to manage categorical sparse features. Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU/CPU/memory usage. Meanwhile, tech companies have built extensive cloud-based services to accelerate training DLRM models at scale. In this paper, we conduct a deep investigation of the DLRM training platforms at AntGroup and reveal two critical challenges: low resource utilization due to suboptimal configurations by users and the tendency to encounter abnormalities due to an unstable cloud environment. To overcome them, we introduce DLRover, an elastic training framework for DLRMs designed to increase resource utilization and handle the instability of a cloud environment. DLRover develops a resource-performance model by considering the unique characteristics of DLRMs and a three-stage heuristic strategy to automatically allocate and dynamically adjust resources for DLRM training jobs for higher resource utilization. Further, DLRover develops multiple mechanisms to ensure efficient and reliable execution of DLRM training jobs. Our extensive evaluation shows that DLRover reduces job completion times by 31%, increases the job completion rate by 6%, enhances CPU usage by 15%, and improves memory utilization by 20%, compared to state-of-the-art resource scheduling frameworks. DLRover has been widely deployed at AntGroup and processes thousands of DLRM training jobs on a daily basis. DLRover is open-sourced and has been adopted by 10+ companies.
Article
Full-text available
Post-Acute Sequelae of SARS-CoV-2 infection (PASC), also known as Long-COVID, encompasses a variety of complex and varied outcomes following COVID-19 infection that are still poorly understood. We clustered over 600 million condition diagnoses from 14 million patients available through the National COVID Cohort Collaborative (N3C), generating hundreds of highly detailed clinical phenotypes. Assessing patient clinical trajectories using these clusters allowed us to identify individual conditions and phenotypes strongly increased after acute infection. We found many conditions increased in COVID-19 patients compared to controls, and using a novel method to associate patients with clusters over time, we additionally found phenotypes specific to patient sex, age, wave of infection, and PASC diagnosis status. While many of these results reflect known PASC symptoms, the resolution provided by this unprecedented data scale suggests avenues for improved diagnostics and mechanistic understanding of this multifaceted disease.
Preprint
Full-text available
In the modern world, the development of Artificial Intelligence (AI) has contributed to improvements in various areas, including automation, computer vision, fraud detection, and more. AI can be leveraged to enhance the efficiency of Autonomous Smart Traffic Management (ASTM) systems and reduce traffic congestion rates. This paper presents an Autonomous Smart Traffic Management (STM) system that uses AI to improve traffic flow rates. The system employs the YOLO V5 Convolutional Neural Network to detect vehicles in traffic management images. Additionally, it predicts the number of vehicles for the next 12 hours using a Recurrent Neural Network with Long Short-Term Memory (RNN-LSTM). The Smart Traffic Management Cycle Length Analysis manages the traffic cycle length based on these vehicle predictions, aided by AI. From the results of the RNN-LSTM model for predicting vehicle numbers over the next 12 hours, we observe that the model predicts traffic with a Mean Squared Error (MSE) of 4.521 vehicles and a Root Mean Squared Error (RMSE) of 2.232 vehicles. After simulating the STM system in the CARLA simulation environment, we found that the Traffic Management Congestion Flow Rate with ASTM (21 vehicles per minute) is 50\% higher than the rate without STM (around 15 vehicles per minute). Additionally, the Traffic Management Vehicle Pass Delay with STM (5 seconds per vehicle) is 70\% lower than without STM (around 12 seconds per vehicle). These results demonstrate that the STM system using AI can increase traffic flow by 50\% and reduce vehicle pass delays by 70\%.
Article
Full-text available
Self-driving cars, or autonomous vehicles (AVs), represent a transformative technology with the potential to revolutionize transportation. This review delves into the critical role of 3D object detection in enhancing the safety and efficiency of AVs, emphasizing its significance within the broader context of autonomous driving systems. We provide a comprehensive analysis of methodologies, including deep learning architectures such as Convolutional Neural Networks (CNNs) and recurrent neural networks (RNNs), evaluating their strengths and limitations in the context of 3D object detection. The evolution of benchmark datasets, including KITTI, Waymo, and NuScenes, is discussed, highlighting their importance in advancing detection algorithms and facilitating comparative analyses across various approaches. Key performance evaluation metrics, including Average Precision (AP) and Intersection over Union (IoU), are emphasized as essential tools for assessing detection accuracy. Furthermore, we investigate the integration of computer vision and deep learning techniques in object recognition, showcasing their impact on improving the perceptual capabilities of AVs. The paper also addresses significant challenges in 3D object detection, such as occlusion, scale variation, and the need for real-time processing, while proposing future research directions to overcome these obstacles. This comprehensive survey aims to provide valuable insights for researchers and practitioners, guiding the development of robust 3D object detection systems that are crucial for the safe deployment of autonomous driving technologies.
Article
In the post-Moore era, main performance gains of black-box optimizers are increasingly depending on parallelism, especially for large-scale optimization (LSO). Here we propose to parallelize the well-established covariance matrix adaptation evolution strategy (CMA-ES) and in particular its one latest LSO variant called limited-memory CMA-ES (LM-CMA). To achieve efficiency while approximating its powerful invariance property, we present a multilevel learning-based meta-framework for distributed LM-CMA. Owing to its hierarchically organized structure, Meta-ES is well-suited to implement our distributed meta-framework, wherein the outer-ES controls strategy parameters while all parallel inner-ESs run the serial LM-CMA with different settings. For the distribution mean update of the outerES, both the elitist and multi-recombination strategy are used in parallel to avoid stagnation and regression, respectively. To exploit spatiotemporal information, the global step-size adaptation combines Meta-ES with the parallel cumulative stepsize adaptation. After each isolation time, our meta-framework employs both the structure and parameter learning strategy to combine aligned evolution paths for CMA reconstruction. Experiments on a set of large-scale benchmarking functions with memory-intensive evaluations, arguably reflecting many data-driven optimization problems, validate the benefits (e.g., effectiveness w.r.t. solution quality, and adaptability w.r.t. secondorder learning) and costs of our meta-framework
Article
Full-text available
Artificial Intelligence (AI) is revolutionizing the healthcare industry by enabling advanced diagnostics, personalized treatment, and efficient operational workflows. The integration of AI in healthcare promises to enhance patient outcomes, streamline clinical processes, and reduce costs. However, the successful implementation of AI in healthcare presents significant data engineering challenges. This paper explores the critical data engineering issues in AI for healthcare, including data heterogeneity, data privacy and security, data quality, and data integration. Additionally, it addresses the complexities of handling large-scale datasets, the need for real-time data processing, and the importance of interoperability between different healthcare systems. Addressing these challenges is essential to harness the full potential of AI in healthcare, ensuring accurate, reliable, and ethical AI-driven solutions. This comprehensive exploration provides insights into the current state of AI in healthcare, highlights key obstacles, and proposes strategies to overcome these barriers, paving the way for a future where AI can be seamlessly integrated into healthcare practices.
Article
Full-text available
In the era of big data, efficient data processing is crucial for timely insights and decision-making. Traditional data pipelines face challenges such as latency, scalability, and fault tolerance. This paper explores the application of machine learning (ML) techniques to optimize data pipeline efficiency. We propose a framework that integrates ML models for predictive resource allocation, anomaly detection, and dynamic scaling within data pipelines. Our experiments demonstrate significant improvements in processing speed, resource utilization, and reliability. Key Words: Data Engineering, Data Pipelines, Machine Learning, Predictive Resource Allocation, Anomaly Detection, Dynamic Scaling
Article
Full-text available
The rapid development of telecommunications services is increasingly attracting millions of users due to the convenience of interaction, promotion and communication. The abundance of daily transaction information has led to the creation of large data sources that are collected over time. This data source is a valuable resource for analyzing and understanding user habits and needs, devising a strategy to maintain and attract potential customers. Therefore, it is necessary to have a suitable system capable of collecting, storing and analyzing large datasets with efficient performance. In this article, we introduce Florus, a big data framework based on Lakehouse architecture, which can tackle these challenges. By applying this framework, we are able to propose an approach to analyzing customer behaviors in the telecommunication industry with a large dataset. Our work focuses on specific analysis of a huge volume of data presented in tables of different schemas, reflecting the business operation over time. Clustering based on the Bisecting K-Means algorithm will support the exploration of customer segments varying in density and complexity, and then characterize them into homogeneous groups to gain a better understanding of the market demand. Furthermore, the enterprise can forecast the revenue income at different levels, which can be applied to every customer. The work was tested with the Gradient Boosted Tree at the end of a data enriching and transformation pipeline. Overall, this work highlights the potential of Florus in supporting customer analysis experiments. Implementing the framework would significantly enhance our ability to conduct comprehensive analyses across the entire customer lifecycle.
Article
Full-text available
This paper presents a real-time intrusion detection system (IDS) aimed at detecting the Internet of Things (IoT) attacks using multiclass classification models within the PySpark architecture. The research objective is to enhance detection accuracy while reducing the prediction time. Various machine learning algorithms are employed using the OneVsRest (OVR) technique. The proposed method utilizes the IoT-23 dataset, which consists of network traffic from smart home IoT devices, for model development. Data preprocessing techniques, such as data cleaning, transformation, scaling, and the synthetic minority oversampling technique (SMOTE), are applied to prepare the dataset. Additionally, feature selection methods are employed to identify the most relevant features for classification. The performance of the classifiers is evaluated using metrics such as accuracy, precision, recall, and F1 score. The results indicate that among the evaluated algorithms, extreme gradient boosting achieves a high accuracy of 98.89%, while random forest demonstrates the most efficient training and prediction times, with a prediction time of only 0.0311 s. The proposed method demonstrates high accuracy in real-time intrusion detection of IoT attacks, outperforming existing approaches.
Article
Full-text available
Machine learning (ML) and statistical techniques are key to transforming big data into actionable knowledge. In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming—many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Fur-thermore, existing scalable systems that support machine learning are typically not accessible to ML researchers with-out a strong background in distributed systems and low-level primitives. In this work, we present our vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers. MLbase provides (1) a simple declarative way to specify ML tasks, (2) a novel opti-mizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML re-searchers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a new run-time optimized for the data-access patterns of these high-level operators.
Conference Paper
Full-text available
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.
Article
Full-text available
MLI is an Application Programming Interface designed to address the challenges of building Machine Learn- ing algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Conference Paper
In order to recommend products to users we must ultimately predict how a user will respond to a new product. To do so we must uncover the implicit tastes of each user as well as the properties of each product. For example, in order to predict whether a user will enjoy Harry Potter, it helps to identify that the book is about wizards, as well as the user's level of interest in wizardry. User feedback is required to discover these latent product and user dimensions. Such feedback often comes in the form of a numeric rating accompanied by review text. However, traditional methods often discard review text, which makes user and product latent dimensions difficult to interpret, since they ignore the very text that justifies a user's rating. In this paper, we aim to combine latent rating dimensions (such as those of latent-factor recommender systems) with latent review topics (such as those learned by topic models like LDA). Our approach has several advantages. Firstly, we obtain highly interpretable textual labels for latent rating dimensions, which helps us to `justify' ratings with text. Secondly, our approach more accurately predicts product ratings by harnessing the information present in review text; this is especially true for new products and users, who may have too few ratings to model their latent factors, yet may still provide substantial information from the text of even a single review. Thirdly, our discovered topics can be used to facilitate other tasks such as automated genre discovery, and to identify useful and representative reviews.
Conference Paper
Many "big data" applications must act on data in real time. Running these applications at ever-larger scales requires parallel platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requiring hot replication or long recovery times, and do not handle stragglers. We propose a new processing model, discretized streams (D-Streams), that overcomes these challenges. D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers. We show that they support a rich set of operators while attaining high per-node throughput similar to single-node systems, linear scaling to 100 nodes, sub-second latency, and sub-second fault recovery. Finally, D-Streams can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes. We implement D-Streams in a system called Spark Streaming.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google's computing infrastructure is based on commodity hardware. In this paper, we describe PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning, and demonstrate the scalability of this approach by applying it to a real world learning task from the domain of computational advertising.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Spark SQL: Relational data processing in spark
  • Michael Armbrust
  • Reynold Xin
  • Cheng Lian
  • Yin Yuai
  • Davies Liu
  • Joseph Bradley
  • Xiangrui Meng
  • Tomer Kaftan
  • Michael Franklin
  • Ali Ghodsi
  • Matei Zaharia
Michael Armbrust, Reynold Xin, Cheng Lian, Yin Yuai, Davies Liu, Joseph Bradley, Xiangrui Meng, Tomer Kaftan, Michael Franklin, Ali Ghodsi, and Matei Zaharia. Spark SQL: Relational data processing in spark. In SIGMOD, 2015.
Topic modeling with LDA: MLlib meets GraphX
  • Joseph Bradley
Joseph Bradley. Topic modeling with LDA: MLlib meets GraphX. https://databricks. com/?p=3135, 2015.
API design for machine learning software: experiences from the scikitlearn project
  • Lars Buitinck
Lars Buitinck et al. API design for machine learning software: experiences from the scikitlearn project. arXiv:1309.0238, 2013.
Introducing streaming k-means in spark 1
  • Jeremy Freeman
Jeremy Freeman. Introducing streaming k-means in spark 1.2. https://databricks.com/ ?p=2382, 2015.
Graphx: Graph processing in a distributed dataflow framework
  • Joseph E Gonzalez
  • Reynold S Xin
  • Ankur Dave
  • Daniel Crankshaw
  • Michael J Franklin
  • Ion Stoica
Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. Graphx: Graph processing in a distributed dataflow framework. In Conference on Operating Systems Design and Implementation, 2014.
Ml pipelines: A new high-level api for MLlib
  • Xiangrui Meng
  • Joseph Bradley
  • Evan Sparks
  • Shivaram Venkataraman
Xiangrui Meng, Joseph Bradley, Evan Sparks, and Shivaram Venkataraman. Ml pipelines: A new high-level api for MLlib. https://databricks.com/?p=2473, 2015.
TuPAQ: An efficient planner for large-scale predictive analytic queries
  • Ameet Evan R Sparks
  • Michael J Talwalkar
  • Michael I Franklin
  • Tim Jordan
  • Kraska
Evan R Sparks, Ameet Talwalkar, Michael J. Franklin, Michael I. Jordan, and Tim Kraska. TuPAQ: An efficient planner for large-scale predictive analytic queries. arXiv:1502.00068, 2015.
BerkeleyX CS190-1x: Scalable machine learning
  • Ameet Talwalkar
Ameet Talwalkar. BerkeleyX CS190-1x: Scalable machine learning. https://www.edx. org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x, 2015.
Spark 1.1: MLlib performance improvements
  • Burak Yavuz
  • Xiangrui Meng
Burak Yavuz and Xiangrui Meng. Spark 1.1: MLlib performance improvements. https: //databricks.com/?p=1393, 2014.