Conference Paper

Performance Comparison of Bitcoin Prediction in Big Data Environment

May 2018

May 2018

DOI:10.1109/IWBIS.2018.8471691

Conference: 2018 International Workshop on Big Data and Information Security (IWBIS)

Authors:

Wisnu Jatmiko

University of Indonesia

Twitter Attribute Classification with Q-Learning on Bitcoin Price Prediction

Preprint

Full-text available

Aug 2022

Aspiring to achieve an accurate Bitcoin price prediction based on people's opinions on Twitter usually requires millions of tweets, using different text mining techniques (preprocessing, tokenization, stemming, stop word removal), and developing a machine learning model to perform the prediction. These attempts lead to the employment of a significant amount of computer power, central processing unit (CPU) utilization, random-access memory (RAM) usage, and time. To address this issue, in this paper, we consider a classification of tweet attributes that effects on price changes and computer resource usage levels while obtaining an accurate price prediction. To classify tweet attributes having a high effect on price movement, we collect all Bitcoin-related tweets posted in a certain period and divide them into four categories based on the following tweet attributes: $(i)$ the number of followers of the tweet poster, $(ii)$ the number of comments on the tweet, $(iii)$ the number of likes, and $(iv)$ the number of retweets. We separately train and test by using the Q-learning model with the above four categorized sets of tweets and find the best accurate prediction among them. Especially, we design several reward functions to improve the prediction accuracy of the Q-leaning. We compare our approach with a classic approach where all Bitcoin-related tweets are used as input data for the model, by analyzing the CPU workloads, RAM usage, memory, time, and prediction accuracy. The results show that tweets posted by users with the most followers have the most influence on a future price, and their utilization leads to spending 80\% less time, 88.8\% less CPU consumption, and 12.5\% more accurate predictions compared with the classic approach.

Bitcoin Candlestick Prediction with Deep Neural Networks Based on Real Time Data

Article

Full-text available

Jan 2021
CMC-COMPUT MATER CON

Twitter Attribute Classification With Q-Learning on Bitcoin Price Prediction

Article

Full-text available

Sep 2022

Otabek Sattarov

Bitcoin price prediction based on people’s opinions on Twitter usually requires millions of tweets, using different text mining techniques, and developing a machine learning model to perform the prediction. These attempts lead to the employment of a significant amount of computer power, central processing unit (CPU) utilization, random-access memory (RAM) usage, and time. To address this issue, in this paper, we consider a classification of tweet attributes that effects on price changes and computer resource usage levels while obtaining an accurate price prediction. To classify tweet attributes having a high effect on price movement, we collect all Bitcoin-related tweets posted in a certain period and divide them into four categories based on the following tweet attributes: $(i)$ the number of followers of the tweet poster, $(ii)$ the number of comments on the tweet, $(iii)$ the number of likes, and $(iv)$ the number of retweets. We separately train and test by using the Q-learning model with the above four categorized sets of tweets and find the best accurate prediction among them. We compare our approach with a classic approach where all Bitcoin-related tweets are used as input data for the model, by analyzing the CPU workloads, RAM usage, memory, time, and prediction accuracy. The results show that tweets posted by users with the most followers have the most influence on a future price, and their utilization leads to spending 80% less time, 88.8% less CPU consumption, and 12.5% more accurate predictions compared with the classic approach.

Bitcoin Price Prediction in a Distributed Environment Using a Tensor Processing Unit: A Comparison With a CPU-Based Model

Article

Apr 2022

Bitcoin is the world’s most traded cryptocurrency and highly popular among cryptocurrency investors and miners. However, its volatility makes it a risky investment, which leads to the need for accurate and fast price-prediction models. This article proposes a Bitcoin price-prediction model using a long short-term memory (LSTM) network in a distributed environment. A tensor processing unit (TPU) has been used to provide the distributed environment for the model. The results show that the TPU-based model performed significantly better than a conventional CPU-based model.

Cryptocurrency Price Prediction Using TPU-Based Distributed Machine Learning

Chapter

Jan 2022

The development of blockchain has led to the emergence and widespread use of decentralized cryptocurrencies around the globe. As of 2021, the global market capitalization of cryptocurrencies has crossed two trillion dollars. With increasing popularity and adoption, investors have begun to see cryptocurrencies as an alternative to conventional financial assets. However, the volatility associated with cryptocurrencies makes them a highly risky investment. This gives rise to the need for accurate and efficient price prediction models which can help reduce risks associated with cryptocurrency investments. The model aims at predicting the price of two popular cryptocurrencies: Bitcoin and Ethereum. Tensor processing unit (TPU) is used for providing a distributed environment for the proposed model. The results show that the distributed TPU-trained model performed significantly better than the conventional CPU-trained model in terms of training time while maintaining a high degree of accuracy.

Big data analytics on Apache Spark

Article

Full-text available

Nov 2016

Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics. [As part of the Springer Nature content sharing initiative, a version of this paper can be accessed on ReadCube: http://rdcu.be/k2YR ]

WEB NEWS DOCUMENTS CLUSTERING IN INDONESIAN LANGUAGE USING SINGULAR VALUE DECOMPOSITION-PRINCIPAL COMPONENT ANALYSIS (SVDPCA) AND ANT ALGORITHMS

Article

Full-text available

Feb 2016

Ant-based document clustering is a cluster method of measuring text documents similarity based on the shortest path between nodes (trial phase) and determines the optimal clusters of sequence document similarity (dividing phase). The processing time of trial phase Ant algorithms to make document vectors is very long because of high dimensional Document-Term Matrix (DTM). In this paper, we proposed a document clustering method for optimizing dimension reduction using Singular Value Decomposition-Principal Component Analysis (SVDPCA) and Ant algorithms. SVDPCA reduces size of the DTM dimensions by converting freq-term of conventional DTM to score-pc of Document-PC Matrix (DPCM). Ant algorithms creates documents clustering using the vector space model based on the dimension reduction result of DPCM. The experimental results on 506 news documents in Indonesian language demonstrated that the proposed method worked well to optimize dimension reduction up to 99.7%. We could speed up execution time efficiently of the trial phase and maintain the best F-measure achieved from experiments was 0.88 (88%).

Algal growth rate modeling and prediction optimization using incorporation of MLP and CPSO algorithm

Conference Paper

Full-text available

Nov 2015

HeteroSpark: A Heterogeneous CPU/GPU Spark Platform for Machine Learning Algorithms

Conference Paper

Full-text available

Aug 2015

Analytics algorithms on big data sets require tremendous computational capabilities. Spark is a recent development that addresses big data challenges with data and computation distribution and in-memory caching. However, as a CPU only framework, Spark cannot leverage GPUs and a growing set of GPU libraries to achieve better performance and energy efficiency. We present HeteroSpark, a GPU-accelerated heterogeneous architecture integrated with Spark, which combines the massive compute power of GPUs and scalability of CPUs and system memory resources for applications that are both data and compute intensive. We make the following contributions in this work: (1) we integrate the GPU accelerator into current Spark framework to further leverage data parallelism and achieve algorithm acceleration; (2) we provide a plug-n-play design by augmenting Spark platform so that current Spark applications can choose to enable/disable GPU acceleration; (3) application acceleration is transparent to developers, therefore existing Spark applications can be easily ported to this heterogeneous platform without code modifications. The evaluation of HeteroSpark demonstrates up to 18x speedup on a number of machine learning applications.

Beyond the hype: Big data concepts, methods, and analytics

Article

Full-text available

Apr 2015
INT J INFORM MANAGE

Size is the first, and at times, the only dimension that leaps out at the mention of big data. This paper attempts to offer a broader definition of big data that captures its other unique and defining characteristics. The rapid evolution and adoption of big data by industry has leapfrogged the discourse to popular outlets, forcing the academic press to catch up. Academic journals in numerous disciplines, which will benefit from a relevant discussion of big data, have yet to cover the topic. This paper presents a consolidated description of big data by integrating definitions from practitioners and academics. The paper's primary focus is on the analytic methods used for big data. A particular distinguishing feature of this paper is its focus on analytics related to unstructured data, which constitute 95% of big data. This paper highlights the need to develop appropriate and efficient analytical methods to leverage massive volumes of heterogeneous data in unstructured text, audio, and video formats. This paper also reinforces the need to devise new tools for predictive analytics for structured big data. The statistical methods in practice were devised to infer from sample data. The heterogeneity, noise, and the massive size of structured big data calls for developing computationally efficient algorithms that may avoid big data pitfalls, such as spurious correlation.

Arrhytmia classification using Fuzzy-Neuro Generalized Learning Vector Quantization

Conference Paper

Full-text available

Jan 2011

Automatic heart beats classification has attracted much interest for research recently and we are interested to determine the type of arrhythmia from electrocardiogram (ECG) signal automatically. This paper will discuss a new extension of GLVQ that employ fuzzy logic concept as the discriminant function in order to develop a robust algorithm and improve the classification performance. The overall classification system is comprised of three components including data preprocessing, feature extraction and classification. Data preprocessing related to how the initial data prepared, in this case, we cut the signal beat by beat using R peak as pivot point, while for the feature extraction, we used wavelet algorithm. The ECG signals were obtained from MIT-BIH arrhythmia database. Our experiment showed that our proposed method, FN-GLVQ, was able to increase the accuracy of classifier compared with original GLVQ that used euclidean distance. By using 10-Fold Cross Validation, the algorithm produced an average accuracy 93.36% and 95.52%, respectively for GLVQ and FNGLVQ.

Preliminary research on continuous conditional random fields in predicting high-dimensional data

Conference Paper

Oct 2017

Big sensor-generated data streaming using Kafka and Impala for data storage in Wireless Sensor Network for CO2 monitoring

Conference Paper

Oct 2016

Spark-GPU: An accelerated in-memory data processing engine on clusters

Conference Paper

Dec 2016

Accelerating Spark RDD Operations with Local and Remote GPU Devices

Conference Paper

Dec 2016

Perceptron rule improvement on FIMT-DD for large traffic data stream

Conference Paper

Jul 2016

SWAT: A Programmable, In-Memory, Distributed, High-Performance Computing Platform

Conference Paper

May 2016

The field of data analytics is currently going through a renaissance as a result of ever-increasing dataset sizes, the value of the models that can be trained from those datasets, and a surge in flexible, distributed programming models. In particular, the Apache Hadoop and Spark programming systems, as well as their supporting projects (e.g. HDFS, SparkSQL), have greatly simplified the analysis and transformation of datasets whose size exceeds the capacity of a single machine. While these programming models facilitate the use of distributed systems to analyze large datasets, they have been plagued by performance issues. The I/O performance bottlenecks of Hadoop are partially responsible for the creation of Spark. Performance bottlenecks in Spark due to the JVM object model, garbage collection, interpreted/managed execution, and other abstraction layers are responsible for the creation of additional optimization layers, such as Project Tungsten. Indeed, the Project Tungsten issue tracker states that the "majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory". In this work, we address the CPU and memory performance bottlenecks that exist in Apache Spark by accelerating user-written computational kernels using accelerators. We refer to our approach as Spark With Accelerated Tasks (SWAT). SWAT is an accelerated data analytics (ADA) framework that enables programmers to natively execute Spark applications on high performance hardware platforms with co-processors, while continuing to write their applications in a JVM-based language like Java or Scala. Runtime code generation creates OpenCL kernels from JVM bytecode, which are then executed on OpenCL accelerators. In our work we emphasize 1) full compatibility with a modern, existing, and accepted data analytics platform, 2) an asynchronous, event-driven, and resource-aware runtime, 3) multi-GPU memory management and caching, and 4) ease-of-use and programmability. Our performance evaluation demonstrates up to 3.24x overall application speedup relative to Spark across six machine learning benchmarks, with a detailed investigation of these performance improvements.

Large-scale logistic regression and linear support vector machines using spark

Article

Jan 2015

Logistic regression and linear SVM are useful methods for large-scale classification. However, their distributed implementations have not been well studied. Recently, because of the inefficiency of the MapReduce framework on iterative algorithms, Spark, an in-memory cluster-computing platform, has been proposed. It has emerged as a popular framework for large-scale data processing and analytics. In this work, we consider a distributed Newton method for solving logistic regression as well linear SVM and implement it on Spark. We carefully examine many implementation issues significantly affecting the running time and propose our solutions. After conducting thorough empirical investigations, we release an efficient and easy-to-use tool for the Spark community.

MLlib: Machine Learning in Apache Spark

Article

May 2015

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

Cloud computing model and implementation of molecular dynamics simulation using Amber and Gromacs

Conference Paper

Jan 2012

Molecular dynamics simulation is a simulation modeling of proteins and some chemical compounds in the pharmaceutical field. Molecular dynamics simulations are used as a way for drug discovery. This paper is going to propose about cloud computing model of molecular dynamics simulations using Amber and Gromacs. Cloud computing applications can be used as a bridge between molecular dynamics applications running on parallel computing and a multiplatform client, so that end-users can use the applications of molecular dynamics simulations easily.

Performance Comparison of Bitcoin Prediction in Big Data Environment

No full-text available

Recommended publications

Smart heart disease prediction system using Improved K-means and ID3 on big data

Clustering and Predicting Driving Violations Using Web-Enabled Big Data Techniques: Proceeding of CI...

What happens when big data blunders?

Health Services Data: Big Data Analytics for Deriving Predictive Healthcare Insights

A Survey on Disease Prediction by Machine Learning Using Big Data Analytics

A flood prediction method based on streaming big data processing

Big Data Analytics for Crop Prediction Mode Using Optimization Technique

Virtual Network Topology Reconfiguration based on Big Data Analytics for Traffic Prediction

Stocks Analysis and Prediction Using Big Data Analytics

Big Data and Due Process: Toward a Framework to Redress Predictive Privacy Harms

Artificial Intelligence Predicts Progress of Diabetic Kidney Disease-Novel Prediction Model Construc...

Predicting outcomes for big data projects: Big Data Project Dynamics (BDPD): Research in progress