Conference Paper

Performance Comparison of Bitcoin Prediction in Big Data Environment

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Over a million Bitcoin-related tweets are available to researchers for processing and application in the field of predicting future Bitcoin prices. Processing a large amount of Bitcoinrelated tweets normally consumes a high level of computer resources (CPU, RAM, memory) and time [31]- [34]. Most of the previous works is focused on how to reduce the resource, so maximizing the prediction result at the same time is not considered. ...
... GPU over CPU [31], [33], GPU based system for SVM [32], Apache Spark [34], Q-Learning [This paper]. ...
... As the dataset for the model to learn increases, Sumarsih et. al. [34] compared GPU performance with the Apache Spark cluster, which is an in-memory data processing engine that uses RAM instead of an I/O disk. Their data processing simulation using linear regression (LR) to learn Bitcoin trading showed faster results when run on the Apache Spark cluster. ...
Preprint
Full-text available
Aspiring to achieve an accurate Bitcoin price prediction based on people's opinions on Twitter usually requires millions of tweets, using different text mining techniques (preprocessing, tokenization, stemming, stop word removal), and developing a machine learning model to perform the prediction. These attempts lead to the employment of a significant amount of computer power, central processing unit (CPU) utilization, random-access memory (RAM) usage, and time. To address this issue, in this paper, we consider a classification of tweet attributes that effects on price changes and computer resource usage levels while obtaining an accurate price prediction. To classify tweet attributes having a high effect on price movement, we collect all Bitcoin-related tweets posted in a certain period and divide them into four categories based on the following tweet attributes: $(i)$ the number of followers of the tweet poster, $(ii)$ the number of comments on the tweet, $(iii)$ the number of likes, and $(iv)$ the number of retweets. We separately train and test by using the Q-learning model with the above four categorized sets of tweets and find the best accurate prediction among them. Especially, we design several reward functions to improve the prediction accuracy of the Q-leaning. We compare our approach with a classic approach where all Bitcoin-related tweets are used as input data for the model, by analyzing the CPU workloads, RAM usage, memory, time, and prediction accuracy. The results show that tweets posted by users with the most followers have the most influence on a future price, and their utilization leads to spending 80\% less time, 88.8\% less CPU consumption, and 12.5\% more accurate predictions compared with the classic approach.
... They proposed a deep learning-based random sampling model (RMS) for cryptocurrency time series that are non-stationary. Also, the study of Purbarani et al. [9] applied Pearson correlation to select the most correlated features and found that OHLC were the most correlated features to predict the weighted price of Bitcoin. ...
Article
Full-text available
Bitcoin price prediction based on people’s opinions on Twitter usually requires millions of tweets, using different text mining techniques, and developing a machine learning model to perform the prediction. These attempts lead to the employment of a significant amount of computer power, central processing unit (CPU) utilization, random-access memory (RAM) usage, and time. To address this issue, in this paper, we consider a classification of tweet attributes that effects on price changes and computer resource usage levels while obtaining an accurate price prediction. To classify tweet attributes having a high effect on price movement, we collect all Bitcoin-related tweets posted in a certain period and divide them into four categories based on the following tweet attributes: $(i)$ the number of followers of the tweet poster, $(ii)$ the number of comments on the tweet, $(iii)$ the number of likes, and $(iv)$ the number of retweets. We separately train and test by using the Q-learning model with the above four categorized sets of tweets and find the best accurate prediction among them. We compare our approach with a classic approach where all Bitcoin-related tweets are used as input data for the model, by analyzing the CPU workloads, RAM usage, memory, time, and prediction accuracy. The results show that tweets posted by users with the most followers have the most influence on a future price, and their utilization leads to spending 80% less time, 88.8% less CPU consumption, and 12.5% more accurate predictions compared with the classic approach.
Article
Bitcoin is the world’s most traded cryptocurrency and highly popular among cryptocurrency investors and miners. However, its volatility makes it a risky investment, which leads to the need for accurate and fast price-prediction models. This article proposes a Bitcoin price-prediction model using a long short-term memory (LSTM) network in a distributed environment. A tensor processing unit (TPU) has been used to provide the distributed environment for the model. The results show that the TPU-based model performed significantly better than a conventional CPU-based model.
Chapter
The development of blockchain has led to the emergence and widespread use of decentralized cryptocurrencies around the globe. As of 2021, the global market capitalization of cryptocurrencies has crossed two trillion dollars. With increasing popularity and adoption, investors have begun to see cryptocurrencies as an alternative to conventional financial assets. However, the volatility associated with cryptocurrencies makes them a highly risky investment. This gives rise to the need for accurate and efficient price prediction models which can help reduce risks associated with cryptocurrency investments. The model aims at predicting the price of two popular cryptocurrencies: Bitcoin and Ethereum. Tensor processing unit (TPU) is used for providing a distributed environment for the proposed model. The results show that the distributed TPU-trained model performed significantly better than the conventional CPU-trained model in terms of training time while maintaining a high degree of accuracy.
Article
Full-text available
Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics. [As part of the Springer Nature content sharing initiative, a version of this paper can be accessed on ReadCube: http://rdcu.be/k2YR ]
Article
Full-text available
Ant-based document clustering is a cluster method of measuring text documents similarity based on the shortest path between nodes (trial phase) and determines the optimal clusters of sequence document similarity (dividing phase). The processing time of trial phase Ant algorithms to make document vectors is very long because of high dimensional Document-Term Matrix (DTM). In this paper, we proposed a document clustering method for optimizing dimension reduction using Singular Value Decomposition-Principal Component Analysis (SVDPCA) and Ant algorithms. SVDPCA reduces size of the DTM dimensions by converting freq-term of conventional DTM to score-pc of Document-PC Matrix (DPCM). Ant algorithms creates documents clustering using the vector space model based on the dimension reduction result of DPCM. The experimental results on 506 news documents in Indonesian language demonstrated that the proposed method worked well to optimize dimension reduction up to 99.7%. We could speed up execution time efficiently of the trial phase and maintain the best F-measure achieved from experiments was 0.88 (88%).
Conference Paper
Full-text available
Analytics algorithms on big data sets require tremendous computational capabilities. Spark is a recent development that addresses big data challenges with data and computation distribution and in-memory caching. However, as a CPU only framework, Spark cannot leverage GPUs and a growing set of GPU libraries to achieve better performance and energy efficiency. We present HeteroSpark, a GPU-accelerated heterogeneous architecture integrated with Spark, which combines the massive compute power of GPUs and scalability of CPUs and system memory resources for applications that are both data and compute intensive. We make the following contributions in this work: (1) we integrate the GPU accelerator into current Spark framework to further leverage data parallelism and achieve algorithm acceleration; (2) we provide a plug-n-play design by augmenting Spark platform so that current Spark applications can choose to enable/disable GPU acceleration; (3) application acceleration is transparent to developers, therefore existing Spark applications can be easily ported to this heterogeneous platform without code modifications. The evaluation of HeteroSpark demonstrates up to 18x speedup on a number of machine learning applications.
Article
Full-text available
Size is the first, and at times, the only dimension that leaps out at the mention of big data. This paper attempts to offer a broader definition of big data that captures its other unique and defining characteristics. The rapid evolution and adoption of big data by industry has leapfrogged the discourse to popular outlets, forcing the academic press to catch up. Academic journals in numerous disciplines, which will benefit from a relevant discussion of big data, have yet to cover the topic. This paper presents a consolidated description of big data by integrating definitions from practitioners and academics. The paper's primary focus is on the analytic methods used for big data. A particular distinguishing feature of this paper is its focus on analytics related to unstructured data, which constitute 95% of big data. This paper highlights the need to develop appropriate and efficient analytical methods to leverage massive volumes of heterogeneous data in unstructured text, audio, and video formats. This paper also reinforces the need to devise new tools for predictive analytics for structured big data. The statistical methods in practice were devised to infer from sample data. The heterogeneity, noise, and the massive size of structured big data calls for developing computationally efficient algorithms that may avoid big data pitfalls, such as spurious correlation.
Conference Paper
Full-text available
Automatic heart beats classification has attracted much interest for research recently and we are interested to determine the type of arrhythmia from electrocardiogram (ECG) signal automatically. This paper will discuss a new extension of GLVQ that employ fuzzy logic concept as the discriminant function in order to develop a robust algorithm and improve the classification performance. The overall classification system is comprised of three components including data preprocessing, feature extraction and classification. Data preprocessing related to how the initial data prepared, in this case, we cut the signal beat by beat using R peak as pivot point, while for the feature extraction, we used wavelet algorithm. The ECG signals were obtained from MIT-BIH arrhythmia database. Our experiment showed that our proposed method, FN-GLVQ, was able to increase the accuracy of classifier compared with original GLVQ that used euclidean distance. By using 10-Fold Cross Validation, the algorithm produced an average accuracy 93.36% and 95.52%, respectively for GLVQ and FNGLVQ.
Conference Paper
The field of data analytics is currently going through a renaissance as a result of ever-increasing dataset sizes, the value of the models that can be trained from those datasets, and a surge in flexible, distributed programming models. In particular, the Apache Hadoop and Spark programming systems, as well as their supporting projects (e.g. HDFS, SparkSQL), have greatly simplified the analysis and transformation of datasets whose size exceeds the capacity of a single machine. While these programming models facilitate the use of distributed systems to analyze large datasets, they have been plagued by performance issues. The I/O performance bottlenecks of Hadoop are partially responsible for the creation of Spark. Performance bottlenecks in Spark due to the JVM object model, garbage collection, interpreted/managed execution, and other abstraction layers are responsible for the creation of additional optimization layers, such as Project Tungsten. Indeed, the Project Tungsten issue tracker states that the "majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory". In this work, we address the CPU and memory performance bottlenecks that exist in Apache Spark by accelerating user-written computational kernels using accelerators. We refer to our approach as Spark With Accelerated Tasks (SWAT). SWAT is an accelerated data analytics (ADA) framework that enables programmers to natively execute Spark applications on high performance hardware platforms with co-processors, while continuing to write their applications in a JVM-based language like Java or Scala. Runtime code generation creates OpenCL kernels from JVM bytecode, which are then executed on OpenCL accelerators. In our work we emphasize 1) full compatibility with a modern, existing, and accepted data analytics platform, 2) an asynchronous, event-driven, and resource-aware runtime, 3) multi-GPU memory management and caching, and 4) ease-of-use and programmability. Our performance evaluation demonstrates up to 3.24x overall application speedup relative to Spark across six machine learning benchmarks, with a detailed investigation of these performance improvements.
Article
Logistic regression and linear SVM are useful methods for large-scale classification. However, their distributed implementations have not been well studied. Recently, because of the inefficiency of the MapReduce framework on iterative algorithms, Spark, an in-memory cluster-computing platform, has been proposed. It has emerged as a popular framework for large-scale data processing and analytics. In this work, we consider a distributed Newton method for solving logistic regression as well linear SVM and implement it on Spark. We carefully examine many implementation issues significantly affecting the running time and propose our solutions. After conducting thorough empirical investigations, we release an efficient and easy-to-use tool for the Spark community.
Article
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.
Conference Paper
Molecular dynamics simulation is a simulation modeling of proteins and some chemical compounds in the pharmaceutical field. Molecular dynamics simulations are used as a way for drug discovery. This paper is going to propose about cloud computing model of molecular dynamics simulations using Amber and Gromacs. Cloud computing applications can be used as a bridge between molecular dynamics applications running on parallel computing and a multiplatform client, so that end-users can use the applications of molecular dynamics simulations easily.