Conference PaperPDF Available

Comparing Apache Spark and Apache Flink on clustering crystallographic data

Authors:

Abstract

Data clustering is an important task in data analytics. In recent years many opensource projects have been developed to do such tasks like Apache Spark and Apache Flink, on big unstructured data. In this work, we cluster data using these tools and compare the performance in clustering crystallographic data to find out which of these tools are better specifically for data clustering in crystallography. The clustering of these crystallographic data has been accomplished with Spark's MLlib and Flink's FlinkML using K-means clustering algorithm. The results then would show the difference in performance and resource usage between Apache Spark and Apache Flink.


1


۱



۱

۱









1









Comparing Apache Spark and Apache Flink on clustering
crystallographic data
Iman Imani۱, Ali Khaleghi, Hadi Haedar۱, Kamran Mahmoudi۱, Javad Rahighi2, Tahereh Sadat Parvini2,
Mohsen Akbari2, Morteza Jafarzadeh Khatibani2, Fatemeh Ahmad Mehrabi2, Ali Khalilzadeh2, Pedram
Navidpour 2 , Samira Mohammadi2
1Department of Computer Engineering, Faculty of Engineering, Imam Khomeini International University, Qazvin, Iran
2 Iranian Light Source Facility, Institute for research in fundamental sciences, Tehran
Abstract
Data clustering is an important task in data analytics. In recent years many opensource projects have been
developed to do such tasks like Apache Spark and Apache Flink, on big unstructured data. In this work, we
cluster data using these tools and compare the performance in clustering crystallographic data to find out
which of these tools are better specifically for data clustering in crystallography. The clustering of these
crystallographic data has been accomplished with Spark's MLlib and Flink's FlinkML using K-means
clustering algorithm. The results then would show the difference in performance and resource usage
between Apache Spark and Apache Flink.
Keywords: Crystallography, Clustering, Big Data, Apache Flink, Apache Spark








1


2



1


۳

۴
2

۳
۴




۵


۳

۶

۷










۶
1



۱

Andal usi t e
Karr ooi t e
Magnesi t e
MgSi Perov

DS



DS



DS



DS


DS




DS











1.۹2.۴.۳
2



2
CPU
AMD Ryzen 1600x
Cores
12
Memory
8Gb
OS
Ubuntu 19.04
Java version
openjdk version "1.8.0_222"

1

۶۳۳۹22
۳۰۴
۳

ms
ms
DS1
2۰۰
1۴۰۸
DS2
2۰۰
1۴1۸
DS3
۳۰۰
1۴۳۸
DS4
1۰۰
1۴1۰
DS5
1۰۰
1۴۰۵
DS6
2۰۰
1۳۸۶












[1] M. Martínez-Ripoll, Crystallography-Cristalografia.
http://www.xtal.iqfr.csic.es/Cristalografia/index-en.html CSIC, 2۰15.
[2] W. A. Hendrickson, “Synchrotron crystallography,” Trends Biochem. Sci., vol. 25, no.
12, pp. 6۳۷64۳, 2۰۰۰.
[۳] X. Meng et al., “MLlib : Machine Learning in Apache Spark,” vol. , pp. 1۷, 2۰16.
[4] O. C. Marcu, A. Costan, G. Antoniu, and M. S. Pérez-Hernández, “Spark versus
flink: Understanding performance in big data analytics frameworks,” Proc. - IEEE Int.
Conf. Clust. Comput. ICCC, pp. 4۳۳442, 2۰16.
[5] D. García-Gil, S. Ramírez-Gallego, S. García, and F. Herrera, “A comparison on
scalability for batch big data processing on Apache Spark and Apache Flink,” Big
Data Anal., vol. 2, no. 1, pp. 111, 2۰1۷.
[6] B. Akil, Y. Zhou, and U. Rohm, “On the usability of Hadoop MapReduce, Apache
Spark & Apache flink for data science,” Proc. - 2۰1۷ IEEE Int. Conf. Big Data, Big
Data 2۰1۷, vol. 2۰1۸-Janua, pp. ۳۰۳۳1۰, 2۰1۸.
[] J. Veiga, R. R. Exposito, X. C. Pardo, G. L. Taboada, and J. Tourifio, “Performance evaluation of
big data frameworks for large-scale data analytics,” Proc. -  IEEE Int. Conf. Big Data, Big
Data , pp. , .
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The large amounts of data have created a need for new frameworks for processing. The MapReduce model is a framework for processing and generating large-scale datasets with parallel and distributed algorithms. Apache Spark is a fast and general engine for large-scale data processing based on the MapReduce model. The main feature of Spark is the in-memory computation. Recently a novel framework called Apache Flink has emerged, focused on distributed stream and batch data processing. In this paper we perform a comparative study on the scalability of these two frameworks using the corresponding Machine Learning libraries for batch data processing. Additionally we analyze the performance of the two Machine Learning libraries that Spark currently has, MLlib and ML. For the experiments, the same algorithms and the same dataset are being used. Experimental results show that Spark MLlib has better perfomance and overall lower runtimes than Flink.
Conference Paper
Full-text available
The increasing adoption of Big Data analytics has led to a high demand for efficient technologies in order to manage and process large datasets. Popular MapReduce frameworks such as Hadoop are being replaced by emerging ones like Spark or Flink, which improve both the programming APIs and performance. However, few works have focused on comparing these frameworks. This paper addresses this issue by performing a comparative evaluation of Hadoop, Spark and Flink using representative Big Data workloads and considering factors like performance and scalability. Moreover, the behavior of these frameworks has been characterized by modifying some of the main parameters of the workloads such as HDFS block size, input data size, interconnect network or thread configuration. The analysis of the results has shown that replacing Hadoop with Spark or Flink can lead to a reduction in execution times by 77% and 70% on average, respectively, for non-sort benchmarks.
Article
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.
Article
The past decade has seen an explosive growth in atomic-level structures determined by X-ray crystallography. Synchrotron radiation and a number of technical advances related quite directly to its development have fueled this growth. With the most recent advances coming to be used collectively and new resources being built, the foundation is laid for a dramatic further expansion of synchrotron crystallography in the next decade. Both the high-throughput applications of structural genomics and also the challenging studies of macromolecular machinery are expected to flourish.