Content uploaded by Kamran Mahmoudi
Author content
All content in this area was uploaded by Kamran Mahmoudi on Dec 05, 2019
Content may be subject to copyright.
1
۱
۱
۱
1
––
Comparing Apache Spark and Apache Flink on clustering
crystallographic data
Iman Imani۱, Ali Khaleghi, Hadi Haedar۱, Kamran Mahmoudi۱, Javad Rahighi2, Tahereh Sadat Parvini2,
Mohsen Akbari2, Morteza Jafarzadeh Khatibani2, Fatemeh Ahmad Mehrabi2, Ali Khalilzadeh2, Pedram
Navidpour 2 , Samira Mohammadi2
1Department of Computer Engineering, Faculty of Engineering, Imam Khomeini International University, Qazvin, Iran
2 Iranian Light Source Facility, Institute for research in fundamental sciences, Tehran
Abstract
Data clustering is an important task in data analytics. In recent years many opensource projects have been
developed to do such tasks like Apache Spark and Apache Flink, on big unstructured data. In this work, we
cluster data using these tools and compare the performance in clustering crystallographic data to find out
which of these tools are better specifically for data clustering in crystallography. The clustering of these
crystallographic data has been accomplished with Spark's MLlib and Flink's FlinkML using K-means
clustering algorithm. The results then would show the difference in performance and resource usage
between Apache Spark and Apache Flink.
Keywords: Crystallography, Clustering, Big Data, Apache Flink, Apache Spark
1
2
1
۳
۴
2
۳
۴
۵
۳
۶
۷
۶
1
۱
Andal usi t e
For st er i t e
Karr ooi t e
Magnesi t e
MgSi Perov
DS
–
–
DS
–
–
DS
–
–
DS
–
–
–
DS
–
DS
1.۹2.۴.۳
2
2
CPU
AMD Ryzen 1600x
Cores
12
Memory
8Gb
OS
Ubuntu 19.04
Java version
openjdk version "1.8.0_222"
1
۶۳۳۹22
۳۰۴
۳
ms
ms
DS1
2۰۰
1۴۰۸
DS2
2۰۰
1۴1۸
DS3
۳۰۰
1۴۳۸
DS4
1۰۰
1۴1۰
DS5
1۰۰
1۴۰۵
DS6
2۰۰
1۳۸۶
1۰
[1] M. Martínez-Ripoll, Crystallography-Cristalografia.
http://www.xtal.iqfr.csic.es/Cristalografia/index-en.html CSIC, 2۰15.
[2] W. A. Hendrickson, “Synchrotron crystallography,” Trends Biochem. Sci., vol. 25, no.
12, pp. 6۳۷–64۳, 2۰۰۰.
[۳] X. Meng et al., “MLlib : Machine Learning in Apache Spark,” vol. 1۷, pp. 1–۷, 2۰16.
[4] O. C. Marcu, A. Costan, G. Antoniu, and M. S. Pérez-Hernández, “Spark versus
flink: Understanding performance in big data analytics frameworks,” Proc. - IEEE Int.
Conf. Clust. Comput. ICCC, pp. 4۳۳–442, 2۰16.
[5] D. García-Gil, S. Ramírez-Gallego, S. García, and F. Herrera, “A comparison on
scalability for batch big data processing on Apache Spark and Apache Flink,” Big
Data Anal., vol. 2, no. 1, pp. 1–11, 2۰1۷.
[6] B. Akil, Y. Zhou, and U. Rohm, “On the usability of Hadoop MapReduce, Apache
Spark & Apache flink for data science,” Proc. - 2۰1۷ IEEE Int. Conf. Big Data, Big
Data 2۰1۷, vol. 2۰1۸-Janua, pp. ۳۰۳–۳1۰, 2۰1۸.
[] J. Veiga, R. R. Exposito, X. C. Pardo, G. L. Taboada, and J. Tourifio, “Performance evaluation of
big data frameworks for large-scale data analytics,” Proc. - IEEE Int. Conf. Big Data, Big
Data , pp. –, .