ChapterPDF Available

A Study of Cloud-Based Solution for Data Analytics

Authors:

Abstract and Figures

In this new age of cutting-edge computerized advancements, a gigantic measure of information is being produced quickly from varying backgrounds. All the industries dealing with large and complex datasets have been facing the need to deal with enormous information being generated/delivered by different sources. This humongous data exists in all forms, for example, structured, unstructured, and semi-structured, which are produced through various sources, such as IOT devices, sensors, social media data, imaging, and geographical data, from day-to-day trials and experiments. This humongous and distinguishable plethora of data accumulated using different data sources needs to be accumulated, studied, and analyzed through statistical tools to perform smarter predictions and analysis. With the introduction of big data technologies, cloud computing, and different types of data analytics technique, it now became easier to combine real-world data and data generated from scientific experiments to extract meaningful insights and use them in real-world scenarios. The cloud platforms such as Amazon Web Services (AWS) and Google Cloud Platform (GCP) can be used for data analytics in different industries. Cloud computing services from AWS and GCP are used for collecting, processing, storing, and mining of various types of data. This chapter studies the provisioning and usage of cloud-based architecture of AWS and GCP for building a data analytics platform.
Content may be subject to copyright.
A Study of Cloud-Based Solution for
Data Analytics
Urvashi Gupta and Rohit Sharma
1 Introduction
In this era, organizations worldwide have been integrating various new cloud
technologies for digitizing, analyzing data, and ultimately gain the predictions and
insights that drive the businesses. New opportunities in various fields of healthcare,
e-commerce, and marketing can be seen with the changing time as Internet of
Things is being integrated along with big data analysis and cloud computing
[1]. All firms nowadays need to be computerized, but big investments in infras-
tructure, IT resources (strong connections, computers, programs, and memory),
and IT employees to administer them are not feasible. Cloud computing offers a
Web framework for autonomously gathering, using, and managing computational
resources [2]. Through its tightly connected resources and equipped data centers,
cloud computing offers a wide range of services to users and businesses. Available
resources may be dynamically provided to end users to meet their demands. Users or
organizations can use the whole pool of compute resources without worrying about
the source, scalability, or resource constraints of cloud services. Hence, consumers
and organizations do not need to make any initial investments to access cloud
resources; instead, they only need to pay for the services they use [3, 4].
Combining technology and medicine has been facilitated by cell phones and IoT
(Internet of Things) that works at a very fast pace and keeps on developing the
healthcare being given to people around the world [5]. Data is proving to be a highly
efficient and effective tool in improving health. The advancements in the field of
data science and data analytics using AI/ML are being used to achieve adherence
and success in the field of healthcare, telecommunication, BFSI, retail trade, and
U. Gupta · R. Sharma ()
Department of Electronics and Communication Engineering, SRM Institute of Science and
Technology, Delhi NCR Campus, Ghaziabad, India
e-mail: ug3398@srmist.edu.in
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
R. Sharma et al. (eds.), Data Analytics for Internet of Things Infrastructure,
Internet of Things, https://doi.org/10.1007/978-3- 031-33808-3_9
145
146 U. Gupta and R. Sharma
e-commerce [6]. This real-world data coming from multiple heterogeneous sources
poses significant challenges in their storage, analysis, and study. Considering the
volume (where and how to store the huge data flowing from multiple sources),
variety (data flowing in all forms, structured and unstructured, etc.), and velocity
(the speed required for the data to be stored and processed) of this data [7], it is
important to use new-age components of cloud computing and big data technologies
for processing [8]. Cloud operators provide a catalogue of components to build an
end-to-end encrypted data platform for ingesting, storing, and analyzing, but there
is always a question of which tools and technologies are to be provisioned and how
the architecture will be built using all these components.
The area of cloud computing has had remarkable expansion in recent years;
according to Gartner [9], public cloud is anticipated to reach $411 million by the
end of 2020. As more and more businesses migrate to the cloud, it is getting harder
to choose a cloud service provider that will be suitable for respective needs over
the long run from a wide pool of cloud service providers. There are numerous
cloud service providers now, but there is no common standard and their growth
is happening in parallel. Many of these providers are focused on computing power
and offering end customers CPU, storage, database, and networking services. Some
network operators prioritize cost cutting, while the others prioritize continuous
service quality and adaptability. These many aspects have made it extremely difficult
to select an appropriate service provider depending on the needs of the customer or
company [10, 11].
Section 2 of this chapter presents the methodology required for data analytics.
This section discusses architecture of AWS and GCP-based platform of the overall
data analytics solution for business industry. Section 3 of this chapter offers a
comparative analysis of features offered by the major two-cloud service providers
Amazon (AWS) and Google’s (GCP) cloud computing platform.
Section 4 covers the major challenges faced in data analytics using different
cloud platforms, to handle, understand, and reconcile the data flowing from struc-
tured and unstructured source systems for adoption of cloud-based data analytics
framework for improving the business services. Section 5 of this chapter is the
conclusion.
2 Methodology
The cloud platform plays vital role for data ingestion. In this section, the architecture
of data informatics platform using AWS and GCP is discussed. A standard data
analytics platform can be built using multiple services offered by AWS and GCP
cloud platform. Figure 1 illustrates a high-level architecture of a data analytics
system. This figure shows the various stages of data input from data source till the
final visualization layer available to the end users of the analytics system.
A Study of Cloud-Based Solution for Data Analytics 147
Fig. 1 Data visualization block diagram
2.1 Amazon Web Services (AWS) Cloud Platform for Data
Analytics
With over 200 fully featured services accessible from data centers all over the globe,
Amazon Web Services (AWS) is the most complete and commonly utilized cloud
platform in the world. Millions of customers, including the fastest-growing start-
ups, biggest enterprises, and top government agencies, rely on AWS to save costs,
enhance agility, and speed up innovation. AWS has more services and features
offered by any other cloud provider, i ncluding traditional infrastructure technologies
such as computation, storage, and databases, as well as emerging technologies such
as machine learning and artificial intelligence, Data Lake and analytics, and the
Internet of Things. This enables migrating existing programs to the cloud quicker,
simpler and less expensive, as well as building practically anything. AWS offers a
variety of big data analytics frameworks that allow you to quickly and easily create
and deploy big data analytics applications.
Big Data Analytics Framework (Amazon EMR) for running large-scale dis-
tributed data processing jobs delivers capabilities like real-time data processing,
clickstream analytics and big data analytics. It has built-in connectors with other
AWS services [12] that is simple to use, has a scalable cluster, highly accessible,
and supports open-source frameworks like Apache Hadoop and Spark [1316].
148 U. Gupta and R. Sharma
The Amazon S3 (Large Data Storage Framework) is a widely utilized, scalable,
secure, and reliable large data storage platform. Amazon S3 provides elastic
management, flexible administration, and robustness; the framework provides con-
sistency, scalability, and in-place query options; it offers integration with the widest
range of vendor solutions; and it offers a broad range of security and compliance
capabilities, as well as easier and faster data transport [1315, 17, 18].
The Amazon Redshift (Data Warehousing Framework) is a fully managed
AWS data warehouse platform that allows for quick, simple, and cost-effective
data analysis using conventional SQL and popular business intelligence tools.
Amazon Redshift’s elastic and scalable cluster provides significant expandability
and compatibility with various SQL clients, as well as fast query performance, cost-
effectiveness, simplicity, and a highly secure foundation [15, 17, 19].
2.1.1 Architecture Study of a Data Analytics System Using AWS
Amazon offers a world leading analytics and data science platform with extensive
configurable cloud-based services for industries. It provided integrated, sustainable,
and scalable analysis and reporting methods for meeting multi-domain industrial
needs. Various steps and components involved for using data from various sources
for analytics and ML modeling are mentioned below. Figure 2 shows the architec-
ture of amazon web services (AWS).
Fig. 2 Architecture of AWS (Amazon Web Services)
A Study of Cloud-Based Solution for Data Analytics 149
2.1.2 Data Ingestion and Processing
Data required for building a data analytics system will come from variety of
sources. The range of data varies from social media data, images, files, documents,
photographs, generic health databases, smart phones, wearable devices, electronic
health, medical records, and government agencies [20]. While the data in the Data
Lake would be stored in a source-native format for easy consumption for data
scientist who would be working on developing machine learning models, consuming
applications like AWS Glue and AWS EMR jobs would transform the heterogeneous
data from various sources in different formats to a single homogenous format to
make the best use. The data would need to be cleaned, enriched, correlated, and
aggregated to be made available to different users for varied use cases in data
analytics [21, 22].
The transformation layer is responsible for identifying the structure of data it
needs, as an output and then reading the data from the Data Lake (stored in native
formats) and transforming it to the relevant structures, determined based on the use
cases. The data transformation layer is built using Glue and/or EMR based on the use
cases. Glue is used for smaller workloads where the time to process data is smaller
and is primarily batch in nature. AWS EMR is used for any real-time processing
and/or batch processes that require larger volumes of data where the runtime is
high, and Glue would become cost prohibitive. The data after the transformation
phase will be stored in the S3 in a processed data store in well-defined structures to
enable further usability by the downstream applications [23].
2.1.3 Data Preparation
Data Modelling Different departments and users would need access to different
data and KPIs at multiple levels of aggregation. The KPIs and the aggregation will
be performed by the transformation tier, and the data will be made available in a
ready-to-use format in the tables. The dimension model is designed and deployed
using AWS Redshift, which provides parallel query processing, helping to reduce
the time of analysis. The model will be designed as per the requirements and
use cases defined. Ad hoc query and analysis will be facilitated using Redshift
Spectrum, to contain the costs at Athena and a high number of ad hoc queries fired
[23, 24].
Master Data and Metadata Data Catalogue is the source containing information
related to data residing in the Data Lake and its corresponding metadata contains
source and structure of data. AWS provides Glue as a tool to generate the data
source catalogue and update it with the changes in the sources [20]. The catalogue
created using Glue is available for view and update using the console and SDKs.
Apart from Glue, cataloguing would be maintained using the ETL tool used.
Data Classification, Encryption, and Access The security considerations are very
critical when we are dealing with confidential data such as patient medical records,
150 U. Gupta and R. Sharma
old history, kin relationships, and their medical data. Data security in data analytics
includes aspects like the following:
Data encryption in transit and at rest [25, 26]
Access control
Network isolation
Audit Trail and Access logs [27, 28]
Data at rest in S3 would be encrypted and secured using S3 encryption
capabilities. The encryption is done using the keys from AWS KMS. Access to
the lake and other AWS services is restricted using IAM Roles and Users [29, 30].
The following measures are taken to ensure security and limit access control:
IAM accounts with the least privileged policy of providing only the rights
required for performing the required tasks.
Password policy, MFA enabled for IAM users.
Security groups and a network access list are created to limit access to resources.
Access control, bucket policies, and IAM policy-based access to data on S3
buckets.
AWS CloudTrail would be responsible to maintain an Audit Trail and Access
logs for all the users [24].
Data Storage The data acquired from the data sources will be persisted and
maintained in a Data Lake in AWS S3. This Data Lake architecture will be keeping
in mind the data formats and various usage of the Data Lake [31]. Data Lake will
be tiered based on the processing stage of the data. The different zones that will be
built within the lake are as follows:
Landing Zone: Also called the Bronze Zone, it is the place where all the raw data
from different source systems land in its as-is format. This is the place where
further downstream tools use data based on their implementation use cases [32].
Refinery Zone: This is an intermediate zone called the Silver Zone, primarily
with some limited processing like optimizing storage and data cleaning. It is the
place where medical data can be discovered, explored, and experimented with
for hypothesis validation and tests [33].
Production Zone: Also called the Gold Zone, clean, well-structured data is saved
in the best manner to inform crucial business choices and promote efficient
operations, often known as the Gold Zone. An operational data store that
feeds standard data warehouses and data marts is frequently included [23].
Transformation rules will be applied to the data lying in the Lake to create
tables/views that will be used by various users for their concerned use cases.
All the data stored in Lake will be stored in compressed formats.
Data Archival, Backup, and Recovery The platform employs the S3 Glacier
Instant Retrieval storage class for archive material that requires quick access,
such as photographs, audio-visual assets, or genomics data. This is an archive
storage class that provides the lowest cost storage with millisecond retrieval.
A Study of Cloud-Based Solution for Data Analytics 151
Data Visualization The dashboards built in AWS Quick sight can contain
important Key Performance Indicators (KPIs) for monitoring the variances of
data and their significance as per the expected calculations.
2.1.4 AI/ML Workbench
ML Model Evaluation Evaluation of a machine learning model is a major part of
the overall AI/ML strategy development process. It is necessary to estimate how best
the chosen model represents the datasets which have been captured and how well
this data can be used for machine learning model training and output generation
in future [34]. The overall objective is to enhance the efficiency of the model,
which required tuning of model parameters to increase accuracy and observing the
confusion matrix for increasing the number of true positives and negatives.
ML Model Training As mentioned before, the training of machine learning model
is a crucial step as it helps in getting the correct outcome. The data is initially
split into three sections “Training data,” “Validation data,” and “Testing data.”
The training data is the data which is used by the machine learning algorithm to
learn how to process the fed information. The validation dataset is used for cross
validating the skill of machine learning model on unseen dataset. This data is used
to fit the parameters of classification algorithms. A set of unseen data is used from
the training dataset for tuning the parameters of model classifier [34].
The training data set can be studied using the AWS Athena tool which can be
used to analyze data using interactive SQL queries. In the overall data set, a training
set is used for building the model, while the testing (or validation) dataset is to
validate the model built. The training dataset and the testing dataset are mutually
exclusive of each other so that it does not create co-relation issues. The validation
and testing datasets can be inclusive of each other and can form a single dataset
[35].
ML Model Deployment After the model has been trained and evaluated, it is
important to create value from the machine learning codes sitting in notebooks.
Model deployment can be done by containerization of applications using docker.
This clean containerized code can be put behind the Application Program Interface
to interface with the machine learning models. This step is then followed by creation
of a front-end application for end users. AWS Sage maker is an effective cloud
computing service which can be used for creating machine learning models on
refined data sets and can be used for generating insights and foresights for data
analytics [35].
Amazon Web Services (AWS) has announced the full launch of Amazon Health
Lake, a HIPAA-eligible service for healthcare and life sciences organizations for
large-scale ingestion, storage, query, and analysis of their health data. Health Lake
enables healthcare organizations to use machine learning to store, transform, and
query health data in the cloud to fully understand the health status of patients and
152 U. Gupta and R. Sharma
populations. This service is part of AWS for Health, which provides a suite of cloud-
based services for healthcare, biopharmaceutical, and genomic users to personalize
patient care, rapidly innovate, and bring new therapies to market.
2.2 Google Cloud Platform (GCP) for Data Analytics
Google offers a large suite of cloud computing services under the name Google
Cloud Platform. GCP includes a large variety of services for computation, applica-
tion development, storage, and big data analytics. GCP suite of services is ever
evolving and Google keeps introducing cloud services based on user demand
and to maintain competitive advantage over AWS and MS Azure. Some standard
frameworks and services offered by GCP for big data analytics, big data storage,
and big data warehouse framework are as follows:
Framework for Big data Analytics (Service Name Google Cloud Dataproc) is
a fully scalable and automated cloud-based Apache Hadoop and Spark service
for speedy, simplified, and economical cluster management operations. It is an
open-source framework and offers major features and benefits such as fast cluster
scaling and cost-effectiveness [15, 16, 36].
Framework for Big Data Storage (Service Name Google Cloud Storage) is an
inline object storage framework which can complete several tasks, such as real-
time data processing, data archiving (Cold and real-time), and data analytics. It
offers high availability at low pricing, along with streamlined data transition with
enhanced security for critical resources [15, 18, 36].
Framework for Big Data Warehousing (Service Name Google Big Query) is
a highly scalable, fast, and low-price data warehouse for big data analytics.
It provides a simple way for infrastructure set-up, which can be scaled up
seamlessly and can generate quick insights into the data [15, 36].
2.2.1 Architecture Study of a Data Analytics System Using GCP
Google provides several cloud-based services to help researchers realize the value of
data by providing comprehensive insights for IT industry and life sciences solutions.
Researchers may use tools like Big Query, Cloud Machine Learning Engine, and
TensorFlow to apply analytics and artificial intelligence to data. Users may utilize
these technologies to create auto-detect patterns, anticipate clinical outcomes, and
evaluate enormous volumes of data efficiently [37, 38]. From machine learning to
data analytics, Google Cloud supports a broad range of different datasets used in any
data analytics system. Figure 3 shows the architecture of Google Cloud Platform
(GCP) for data analytics. Various steps and components involved for using data
from various sources for analytics and machine learning modelling are mentioned
below in the architecture.
A Study of Cloud-Based Solution for Data Analytics 153
Fig. 3 Architecture of GCP (Google Cloud Platform) for data analytics
2.2.2 Data Ingestion and Processing
The data ingestion using Google Cloud Platform (GCP) with description and its use
cases is listed in Table 1.
Unstructured and semi-structured data storage can be handled with the data flow
into Google Cloud’s Object Storage, namely, Cloud Storage (provides transition to
storage classes for any workloads easily with multiple redundancy options), before
sending the data for transformations [36]. Cloud Scheduler is an enterprise-grade
task scheduler that is fully managed. It enables the scheduling of nearly any process,
including batch, data processing, and cloud infrastructure operations, among others.
To save time and effort, it may automate everything, even retries in the case of a
failure. Cloud Scheduler may also be used as a single pane of glass to manage all of
automation jobs in one place [39].
Cloud Pub/Sub [39] is used for messaging and ingestion for event-driven systems
and streaming analytics, and it enables for scalable, in-order message delivery with
pull and push modes. Auto-scaling and auto-provisioning from 0 to hundreds of
GB per second are supported. It also provides individual quotas and pricing for
publishers and subscribers, as well as global message routing to make multi-region
systems simpler to operate.
Cloud Dataproc [39] runs Apache Spark, Apache Flink, Presto, and more than
30 open-source tools and frameworks that are all part of this fully managed and
154 U. Gupta and R. Sharma
Table 1 Data ingestion using GCP
Method Description Use cases
Data pull from DB An event-driven script running on
Cloud Function with timing
provided by Cloud Scheduler
Data pull from different
databases provides an
export and not direct
access
Data extracted from SaaS
applications onto a staging
server for pickup
Data pull from API Batch scripts running on GKE or
event-driven scripts running on
Cloud Function
SaaS or business
applications expose an API
Direct DB One-time data transfer using Cloud
Data Fusion and subsequent
updates through Debezium
connector [39]
Database (backend)
connections URLs
available
High volume data from
smartphones, wearable
devices, search engine data
Transfer services available to
Cloud Storage [39]
Direct DB connect from Generic
Databases directly
Install Docker agents on
on-premises and initiate Transfer
jobs from Google Cloud
Based on gsutil commands
Usually for terabytes of
data or more at once
Table 2 Factors for choosing Dataproc or Dataflow
Factors Dataproc Dataflow
DevOps requirement Provisioning clusters manually Serverless, i.e., automatic
provisioning of clusters
SDK based on Apache Spark and Hadoop Apache Beam
Use cases For data science/ML ecosystem,
largely batch
For batch and stream processing
Nature of data Suitable for small volumes of
large chunks of data
Suitable for large volumes of
small chunks of data
scalable service. Cloud Dataflow [19] is an Apache Beam SDK-based serverless,
fast, and cost-effective unified stream and batch data processing solution. Table 2
depicts the various instances in which an appropriate tool can be selected.
Cloud Composer [39] is an Apache Airflow-based workflow orchestration
solution that is fully managed. It is used to create, schedule, and monitor pipelines
in hybrid and multi-cloud settings. It is based on the Apache Airflow that runs on
Python, allowing it to be free of lock-in and simple to use.
A Study of Cloud-Based Solution for Data Analytics 155
2.2.3 Data Preparation
Data Modelling B
igQuery[36] is a serverless, highly scalable, and cost-effective
multi-cloud data warehouse. It is designed for corporate flexibility. It analyzes
petabytes of data at breakneck rates using ANSI SQL, with no operational overhead
(Pay-as use model with cost mostly determined by the amount of data queried.)
It democratizes insights by providing a reliable and secure platform that can be
integrated into a variety of reporting tools, including Excel and Looker, and most of
the BI tools such as PowerBI.
Master Data and Metadata Data Catalogue is scalable data discovery and meta-
data management solution that uses a simple but effective faceted-search interface to
find usable data. It is also used to automatically synchronize technical information
and generate schematized tags for business metadata. It uses Cloud Data Loss
Prevention integration to automatically flag sensitive data and allows users to get
access and grow without having to set up or maintain infrastructure [40].
For data analytics products, Dataproc Metastore is a fully managed, highly
available within a region, auto healing serverless Apache Hive Metastore (HMS) on
Google Cloud [36]. It facilitates interoperability across data processing programmed
in the open-source data ecosystem, as well as supporting HMS and serving as a
vital component for managing the metadata of relational entities. Its uses include
the following:
A centralized metadata store that can be shared among numerous ephemeral
Dataproc clusters that use open-source engines like Apache Hive, Apache Spark,
and Presto.
Interoperability between cloud-native services like Dataproc and other open
source-based partner offerings on Google Cloud thanks to a single view of tables
[41].
Data Classification, Encryption, and Access Data Loss Prevention (DLP) on the
Cloud aids in the understanding, management, and protection of sensitive data.
It can quickly categorize and redact sensitive data in text-based information and
images, including content stored in Google Cloud storage repositories. Google
Cloud [36], without needing any action from the customer, encrypts all client
material kept at rest using one or more encryption algorithms.
For storage, data is separated into chunks, and each chunk is encrypted with
its own data encryption key. These data encryption keys are encrypted (or
“wrapped”) with key encryption keys that are only kept and used inside Google’s
central Key Management Service (GCP KMS) [39].
IAM provides technologies that make managing resource permissions simple and
automated. For data platform administrators, a complete audit trail of permission
authorization, removal, and delegation is instantly displayed.
156 U. Gupta and R. Sharma
Data Storage Cloud Storage in Google [18] is the Google Cloud object storage
service. It comes with several useful capabilities out of the box, including as object
versioning and fine-grained permissions (by object or bucket), which may simplify
development and save operating costs. Several Google services are built on top of
Google Cloud Storage.
Data Archival, Backup, and Recovery Google Nearline [18]isa public cloud
storage service for archiving, backup, and disaster recovery. It is one of the
Google Cloud Storage platform’s four cloud storage classes.
Data Visualization Looker [15] is used to serve real-time dashboards (updated
every few seconds) for more in-depth, consistent analysis. It provides insights
easily (slide and dive into data dimensions within the visualization layer). Data
Visualization (Looker) is a highly customizable semantic layer to make custom
apps that give data experiences that are as unique to the business. Data Studio
is a free offering from Google that provides community-based connectors to
various databases. It is customizable data transformation and visualization layer
and gives lesser real-time in a sense (updated every few tens of minutes).
AI/ML Workbench There are three options for ML modelling. In either case
model results are stored back in Big Query:
Big Query [36] can be used to execute and create machine learning models in
Big Query using SQL queries. This ML workbench can be implemented through
Looker cloud service.
Build Custom ML models using Python libraries on Jupyter Notebook and
deploy models on Vertex AI. Vertex AI can be used to manage several stages in
the ML workflow applicable to the business use cases such as building a dataset
and uploading data, training an ML model on the data, evaluating the model’s
accuracy, and tuning hyper parameters (custom training only). It can also be used
to save and upload models to Vertex AI. The trained model is deployed to an
endpoint to serve predictions and send prediction queries to the endpoints.
Cloud AutoML makes it possible to train models without writing code on picture,
tabular, text, and video datasets. It automates data pre-processing, parameter
extraction, selection, and engineering, algorithm selection, and hyperparameter
tweaking. Vertex AI [39] unifies the AutoML and AI Platform APIs, client
libraries, and user interface. Both AutoML training and bespoke training are
accessible with Vertex AI.
The Google Cloud Healthcare and Life Sciences team has announced a collection
of APIs and datastores for healthcare data analysis and machine learning appli-
cations, data-level healthcare system integration, and safe storage and retrieval of
many forms of electronic patient healthcare data (ePHI). The Google Healthcare
API, which focuses on healthcare apps, provides developers a robust toolbox for
converting ideas into practical solutions. Google’s machine learning has accurate
recognition and can be taught to execute a range of high-accuracy prediction tasks.
A Study of Cloud-Based Solution for Data Analytics 157
3 Comparative Analysis of Services Required from AWS
and GCP
Table 3 shows the various stages required for building a cloud-based big data
analytics platform and the corresponding services required from AWS and GCP
product suites.
Other than Amazon Web Services (AWS) and Google Cloud Platform (GCP), big
data analytics cloud platform provider such as IBM Cloud [42] and Microsoft Azure
[16, 4346] also works in similar manner to provide data analytics for predictive and
prescriptive insights.
4 Challenges
Data collection is one of the primary challenges when considering any cloud-based
big data application for business analytics or IT analytics, the data sources may
be incomplete, contain interferences and errors, the consent needs to be taken from
organizations or individuals before considering any analytics application which uses
personal identifiable data [47]. It also becomes very important to integrate data
generated from external/internal system in order to arrive at meaningful analytics.
It also becomes very difficult to process such heterogeneous data, which varies
widely [48]. When integrating data in numerous formats, ensuring accuracy and
consistency in decision-making becomes a concern. In order to have processing of
such varied data and keeping the infrastructure costs in check, there needs to be a
lot of thought which needs to go in at the time of inception which is time taking
and extensive [49]. The volume of data poses another challenge, live feeds of data
across globe related to organizations or individuals are challenging storage tasks
and need highly efficient computational capabilities with high input/output speeds
for efficient storage management [50].
Data analysis of such multi-dimensional data also poses a significant challenge,
as it is not interpreted in appropriate graphical representations, often it may require
more than two to three charts to derive significant inferences from the data sets
across visualizations. Confidential data is very sensitive and thus data security is a
crucial part of building the cloud data platform for data analytics [48]. It is necessary
to have proper controls, authentication mechanisms, and encryption of data done to
enhance the security of data, which is available in cloud.
5 Conclusion
This chapter reviews the cloud computing services which are used to build an
architecture of a cloud-enabled data platform which is used in business analytics
158 U. Gupta and R. Sharma
Table 3 Services required from AWS and GCP
Sr No. Stage Requirement GCP Services
AWS
Services
AData
Ingestion and
Processing
Integration with existing source
systems (internal and external)
To plan data transformation, you
need have a workflow
management and scheduling
system
Data acquisition or data delivery
jobs
Should have audit and error logs
for auditing and troubleshooting
GUI interface for checking the
errors, scheduling, and restarting
of ETL jobs
Dataflow and
Dataproc
(using Apache
Hadoop
framework)
AWS G lue
and AWS
EMR (using
Apache
Hadoop
framework)
BData
Preparation
Data
Modeling
Develop data models based on
data flowing in from various data
sources
Big Query AWS
Redshift
Master Data
and Metadata
Ensure data lineage
Data quality, integrity,
compliance, auditing
Data Governance Framework
Cloud Data
Catalogue
AWS M DM
Data Classifi-
cation,
Encryption,
and Access
Use proper encryption keys for
customer datasets
Individual user-level access
Role-based access
Group-based access (e.g.,
cardiology, medicine, cancer)
Access for the applications
whichintendtousethedata
present in the data platform
DLP API,
KMS, IAM
AWS K MS
and IAM
Data Storage Where and how data is being
stored?
How is data going to be accessed
by different user groups?
Methodology for reconciliation
of data from various source
systems
Google Cloud
Storage
(Landing
Zone)
Big Query
(Data Lake)
AWS S 3 a nd
AWS
Redshift
CData
Archival,
Backup, and
Recovery
Define data backup, archival, and
recovery policies based on
industry requirement
GCS AWS Glacier
DData
Visualization
Visualizations and advanced
analytics dashboards
Looker AWS Q ui ck
sight
EAI/ML
Workbench
Experimentation and Model
building
Model evaluation, deployment,
and visualization
Vert ex A I AWS S age
maker
A Study of Cloud-Based Solution for Data Analytics 159
using two popular cloud service providers, that is, AWS and GCP. During the
study, it was concluded that both these cloud computing service providers offer their
own set of individual services for accessing and collating data from different input
data sources used in business industry. These services can be easily procured and
employed for studying and analyzing the data and then deriving valuable inputs from
it, which is used in devising the machine learning model. While AWS is more stable,
reliable, and offers a more global reach owing to early adoption by the business
industry, GCP is a container-based model with flexible pricing models and enhanced
computing capabilities. The mix of services or the evaluation of a particular platform
can be done based on the actual use cases, which are to be implemented for advanced
analytics use cases.
As discussed earlier, both AWS and GCP have come up with their targeted
solutions for different industries and the competition will grow more over time.
The ultimate objective of using a cloud platform for data analytics in industries is
to achieve the creative vision necessary to advance IT analytics to the next level
of innovation, so enabling it to both drive and support substantial improvements in
building quality measures and outcomes.
References
1. Sestino, A., Prete, M. I., Piper, L., & Guido, G. (2020). Internet of Things and Big Data as
enablers for business digitalization strategies. Technovation, 98, 102173, ISSN 0166-4972.
https://doi.org/10.1016/j.technovation.2020.102173
2. Srinivasan, A. Cloud computing. Pearson India ISBN: 9789332537439. Cloud computing:
Concepts, technology & architecture. Prentice Hall Service Technology Series (1st ed.), ISBN-
10: 9780133387520
3. Kavis, J. Architecting the cloud: Design decisions for cloud computing service models (SaaS,
PaaS, and IaaS) (1st ed.). Wiley, ISBN-10: 1118617614.
4. Alreshidi, E. (2019). Comparative review of well known cloud service providers. Science
International (Lahore), 31(8), 65–170, ISSN-1013-5316.
5. Stankovic, J. A. (2016). Research directions for cyber physical systems in wireless and mobile
healthcare. ACM Transactions on Cyber-Physical Systems, 1(1), 1–12.
6. Kune, R., Konugurthi, P. K., Agarwal, A., Chillarige, R. R., & Buyya, R. (2016). The anatomy
of big data computing. Software: Practice and Experience, 46, 79–105. https://doi.org/10.1002/
spe.2374
7. Rizwan, A., Zoha, A., Zhang, R., et al. (2018). A review on the role of Nano communication
in future healthcare systems: A big data analytics perspective. IEEE Access, 6, 41903–41920.
8. Khan, S., Shakil, K. A., & Alam, M. (2018). Cloud-based big data analytics—A survey of
current research and future directions. ©Springer Nature Singapore Pte Ltd. Aggarwal, B.,
et al. (Eds.), Big data analytics. Advances in Intelligent Systems and Computing 654. https://
doi.org/10.1007/978-981-10-6620-7_57
9. https://www.gartner.com/. Accessed 31 Jan 2020.
10. Rajendran, V. V., & Swamynathan, S. Parameters for comparing cloud service providers: A
comparative analysis. https://doi.org/10.1109/CESYS.2016.7889826, IEEE Xplore: 30 March
2017. 11. Dutta, P., & Dutta, P. (2019). Comparative study of cloud services offered by
Amazon, Microsoft, and Google. International Journal of Trends in scientific Research and
Development (IJTSRD), 3(3), 981–985.
160 U. Gupta and R. Sharma
11. Zhing, L. C. (2020). Cloud computing: State of Art and research challenges. Journal of
International Services and Applications, 1(1), 7–18.
12. Practical Amazon EC2, SQS, Kinesis, and S3. eBook. SpringerDoi: https://doi.org/10.1007/
978-1-4842-2841-8
13. Pradhananga, Y., Karande, S., & Karande, C. High performance analytics of big
data with dynamic and optimized Hadoop cluster. IEEE. https://doi.org/10.1109/
ICACCCT.2016.7831733
14. Dawelbeit, O., & McCrindle, R. A novel cloud based elastic framework for big data prepro-
cessing. In IEEE Conference Publications. https://doi.org/10.1109/CEEC.2014.6958549
15. Gonzales, J. U., & Krishnan, S. P. T. Building your next big thing with Google Cloud Platform.
SpringerDOI: https://doi.org/10.1007/978-1-4842-1004-8
16. Singh, M. P., Hoque, M. A., & Tarkoma, S. A survey of systems for massive stream analytics,
arXiv1605.09021v2
17. Ambeth Kumar, V. D., Ashok Kumar, V. D., Divakar, H., & Gokul, R. Cloud enabled media
streaming using Amazon Web Services. IEEE. https://doi.org/10.1109/ICSTM.2017.8089150
18. Subia, S. (2018). Data Storage SpringerDOI: 978-3-319-21569-3_7 10, Procedia Computer
Science.
19. Nakhimovsky, A., & Myers, T. Google, Amazon, and beyond: Creating and consuming Web
services. SpringerDOI: 9781590591314.
20. Mohanty, H., Bhuyan, P., & Chenthati, D. Chapter 2: Big data architecture. In Big data: A
primer. Springer DOI: 9788132224938
21. Begam, S. S., Selvachandran, G., Ngan, T. T., & Sharma, R. (2020). Similarity measure of
lattice ordered multi-fuzzy soft sets based on set theoretic approach and its application in
decision making. Mathematics, 8, 1255.
22. Thanh, V.,Rohit, S., Raghvendra, K., Le Hoang, S., Thai, P. B., Dieu, T. B., Ishaani, P., Manash,
S., & Tuong, L. (2020). Crime rate detection using social media of different crime locations and
Twitter part-of-speech tagger with Brown clustering. Journal of Intelligent & Fuzzy Systems,
38, 4287–4299.
23. The Old Bailey and OCR: Benchmarking AWS, Azure, and GCP with 180,000 Page Images
DocEng ‘20: In Proceedings of the ACM Symposium on Document Engineering, September
2020. Article No.: 19, pp. 1–4. https://doi.org/10.1145/3395027.3419595
24. Ta, V.-D., Liu, C.-M., & Nkabinde, G. W. (2016). Big data stream computing in healthcare
real-time analytics. In 2016 IEEE International Conference on Cloud Computing and Big Data
Analysis (ICCCBDA), pp. 37–42, https://doi.org/10.1109/ICCCBDA.2016.7529531
25. Saraswat, M., & Tripathi, R. C. (2020). Cloud computing: Comparison and analysis of
cloud service providers-AWS, Microsoft and Google. In 2020 9th International Conference
System Modeling and Advancement in Research Trends (SMART), pp. 281–285. https://doi.org/
10.1109/SMART50582.2020.9337100
26. Daniels, M., Rose, J., & Farkas, C. (2018). Protecting patients’ data: An efficient method
for health data privacy. In Proceedings of the 13th International Conference on Availability,
Reliability and Security,ACM,p.9.
27. He, Z., Cai, Z., Sun, Y., et al. (2017). Customized privacy preserving for inherent data and
latent data. Personal and Ubiquitous Computing, 21(1), 43–54.
28. Mahmoud, M. M. E., Rodrigues, J. J. P. C., Ahmed, S. H., et al. (2018). Enabling technologies
on cloud of things for smart healthcare. IEEE Access, 6, 31950–31967.
29. Nguyen, P. T., Ha, D. H., Avand, M., Jaafari, A., Nguyen, H. D., Al-Ansari, N., Van Phong, T.,
Sharma, R., Kumar, R., Le, H. V., Ho, L. S., Prakash, I., & Pham, B. T. (2020). Soft computing
ensemble models based on logistic regression for groundwater potential mapping. Applied
Sciences, 10, 2469.
30. Jha, S., et al. (2019). Deep learning approach for software maintainability metrics prediction.
IEEE Access, 7, 61840–61855.
31. Emara, K. (2017). Safety-aware location privacy in VANET: Evaluation and comparison. IEEE
Transactions on Vehicular Technology, 66(12), 10718–10731.
A Study of Cloud-Based Solution for Data Analytics 161
32. Sharma, R., Kumar, R., Sharma, D. K., Son, L. H., Priyadarshini, I., Pham, B. T., Bui, D. T.,
& Rai, S. (2019). Inferring air pollution from air quality index by different geographical areas:
Case study in India. Air Quality, Atmosphere & Health, 12, 1347–1357.
33. Sharma, R., Kumar, R., Singh, P. K., Raboaca, M. S., & Felseghi, R.-A. (2020). A systematic
study on the analysis of the emission of CO, CO2 and HC for four-wheelers and its impact on
the sustainable ecosystem. Sustainability, 12, 6707.
34. Xiao, W., Miao, Y., Fortino, G., Wu, D., Chen, M., & Hwang, K. (2022). Collaborative cloud-
edge service cognition framework for DNN configuration toward smart IIoT. IEEE Transac-
tions on Industrial Informatics, 18(10), 7038–7047. https://doi.org/10.1109/TII.2021.3105399
35. Erhan, L., Ndubuaku, M.U., Mauro, M.D., Song, W., Chen, M., Fortino, G., Bagdasar, O.,
& Liotta, A. (2020). Smart anomaly detection in sensor systems: A multi-perspective review.
arXiv: Learning.
36. Krishnan, S. P. T., & Ugia Gonzalez, J. L. Google BigQuery. SpringerDOI: https://doi.org/
10.1007/978-1-4842-1004-8_10
37. Dansana, D., Kumar, R., Das Adhikari, J., Mohapatra, M., Sharma, R., Priyadarshini, I., &
Le, D.-N. (2020). Global forecasting confirmed and fatal cases of COVID-19 outbreak using
autoregressive integrated moving average model. Frontiers in Public Health, 8, 580327. https:/
/doi.org/10.3389/fpubh.2020.580327
38. Malik, P. K., Sharma, R., Singh, R., Gehlot, A., Satapathy, S. C., Alnumay, W. S., Pelusi, D.,
Ghosh, U., & Nayak, J. (2021). Industrial Internet of Things and its applications in industry 4.0:
State of the art. Computer Communications, 166, 125–139, ISSN 0140-3664. https://doi.org/
10.1016/j.comcom.2020.11.016
39. Gupta, Y. K., & Mittal, T. (2020). Comparative study of Apache Pig & Apache Cassandra
in Hadoop Distributed Environment. In 2020 4th International Conference on Electronics,
Communication and Aerospace Technology (ICECA), pp. 1562–1567. https://doi.org/10.1109/
ICECA49313.2020.9297532
40. Sharma, R., Kumar, R., Satapathy, S. C., Al-Ansari, N., Singh, K. K., Mahapatra, R. P.,
Agarwal, A. K., Le, H. V., & Pham, B. T. (2020). Analysis of water pollution using different
physicochemical parameters: A Study of Yamuna River. Frontiers in Environmental Science,
8, 581591. https://doi.org/10.3389/fenvs.2020.581591
41. Dansana, D., Kumar, R., Parida, A., Sharma, R., Adhikari, J. D., et al. (2021). Using
susceptible-exposed-infectious-recovered model to forecast coronavirus outbreak. Computers,
Materials & Continua, 67(2), 1595–1612.
42. Patil, A., Rangarao, D., Seipp, H., Lasota, M., dos Santos, R. M., Markovic, R., Casey, S.,
Bollers, S., Gucer, V., Lin, A., Richardson, C., Rios, R., VanAlstine, R., & Medlin, T. Cloud
Object Storage as a Service IBM Redbooks. https://www.redbooks.ibm.com/redbooks/pdfs/
sg248385.pdf
43. Klein, S. IoT Solutions in Microsoft’s Azure IoT Suite. SpringerDOI: 9781484221426.
44. Copeland, M., Soh, J., Puca, A., Manning, M., & Gollob, D. Microsoft Azure. SpringerDOI:
9781484210444.
45. Moemeka, E. Azure in the enterprise. Springer DOI: 9781484230862.
46. Reagan, R. Web applications on Azure. Springer DOI: 9781484229750.
47. Morshed, M. G., & Yuan, L. (2017). Big data in cloud computing: An analysis of issues and
challenges. International Journal of Advanced Studies in Computer Science and Engineering,
6(4), 345–350.
48. Abouelmehdi, K., Beni-Hssane, A., Khaloufi, H., & Saadi, M. (2017). Big data security and
privacy in healthcare: A review. Procedia Computer Science, 113, 73–80, ISSN 1877-0509.
https://doi.org/10.1016/j.procs.2017.08.292
49. Google Cloud Platform Products. https://cloud.google.com/gcp/
50. Fortino, G., Messina, F., Rosaci, D., & Sarné, G. M. L. (2020). Using blockchain in a
reputation-based model for grouping agents in the Internet of Things. IEEE Transactions on
Engineering Management, 67(4), 1231–1243. https://doi.org/10.1109/TEM.2019.2918162
... To overcome these disparities industry members, need to find feasible and efficient approaches. Gupta & Sharma (2023) mentioned that cloud-based predictive analytics platforms could pose lower costs and have much better access for small start-ups. More so, Ali et al. (2024) pined that government policies that offer 'financial incentive and training' in the uptake of intelligent technologies could go a long way in closing the technical skills divide. ...
Article
One of the biggest issues that have been observed with start-ups in India is that of employee attrition which currently stands at an average of 18%-25%. This paper aims to examine these identified retention challenges, in the view of answering how predictive analytics can help start-ups to avoid high levels of employee turnover. Conducted from an interpretivist perspective, the study takes an inductive and exploratory stance, that involves analysing secondary qualitative data gathered from reputable databases. Predictive analytics has been identified to enhance the rates at which employees are retained due to risk analysis and follow-up intervention. For example, where companies use predictive instruments, they are likely to see up to 25% less staff turnover. However, financial barriers, poor availability of data, and lack of sophisticated skills are the problems that prevent global use, especially among new start-ups. Recommendations are more likely to be in the form of multi-tenanted, cloud-based, predictive analytics models, building HR’s analytics capability and including predictive insights into more people-oriented processes such as career management. These steps are designed to improve relative stability in the workforce while tackling particularities in the Indian start-up environment. Thus, the findings of this study should be followed by the primary data collection for further elucidation of the discussed topic and the broadening of the usage of predictive analytics in various organisational environments. Keywords: Employee Retention, Predictive Analytics, Indian Start-Ups, Workforce Attrition, HR Analytics, Risk Analysis, Employee Turnover, Cloud-Based Models, Organizational Stability, Talent Management.
Article
This research paper presents a comprehensive review of data analytics in healthcare management, focusing on leveraging big data for decision-making. The literature review explores the historical evolution of data analytics, emphasizing its growing importance in clinical support, resource allocation, and operational efficiency within the healthcare sector. The paper discusses fundamental concepts, methodologies, and emerging trends, including integrating artificial intelligence, real-time analytics, and the impact of wearable technologies. Challenges such as data quality, privacy, and interoperability are identified, along with recommendations for future research. The findings underscore the pivotal role of data analytics in transforming healthcare decision-making processes, with implications for precision medicine, preventive healthcare, and enhanced patient outcomes.
Article
Full-text available
The amount of medical text data is increasing dramatically. Medical text data record the progress of medicine and imply a large amount of medical knowledge. As a natural language, they are characterized by semistructured, high-dimensional, high data volume semantics and cannot participate in arithmetic operations. Therefore, how to extract useful knowledge or information from the total available data is very important task. Using various techniques of data mining can extract valuable knowledge or information from data. In the current study, we reviewed different approaches to apply for medical text data mining. The advantages and shortcomings for each technique compared to different processes of medical text data were analyzed. We also explored the applications of algorithms for providing insights to the users and enabling them to use the resources for the specific challenges in medical text data. Further, the main challenges in medical text data mining were discussed. Findings of this paper are benefit for helping the researchers to choose the reasonable techniques for mining medical text data and presenting the main challenges to them in medical text data mining.
Article
Full-text available
The Coronavirus disease 2019 (COVID-19) outbreak was first discovered in Wuhan, China, and it has since spread to more than 200 countries. The World Health Organization proclaimed COVID-19 a public health emergency of international concern on January 30, 2020. Normally, a quickly spreading infection that could jeopardize the well-being of countless individuals requires prompt action to forestall the malady in a timely manner. COVID-19 is a major threat worldwide due to its ability to rapidly spread. No vaccines are yet available for COVID-19. The objective of this paper is to examine the worldwide COVID-19 pandemic, specifically studying Hubei Province, China; Taiwan; South Korea; Japan; and Italy, in terms of exposed, infected, recovered/deceased, original confirmed cases, and predict confirmed cases in specific countries by using the susceptible-exposed-infectious-recovered model to predict the future outbreak of COVID-19. We applied four differential equations to calculate the number of confirmed cases in each country, plotted them on a graph, and then applied polynomial regression with the logic of multiple linear regression to predict the further spread of the pandemic. We also compared the calculated and predicted cases of confirmed population and plotted them in the graph, where we could see that the lines of calculated and predicted cases do intersect with each other to give the perfect true results for the future spread of the virus. This study considered the cases from 22 January 2020 to 25 April 2020.
Article
Full-text available
The Yamuna river has become one of the most polluted rivers in India as well as in the world because of the high-density population growth and speedy industrialization. The Yamuna river is severely polluted and needs urgent revival. The Yamuna river in Dehradun is polluted due to exceptional tourist activity, poor sewage facilities, and insufficient wastewater management amenities. The measurement of the quality can be done by water quality assessment. In this study, the water quality index has been calculated for the Yamuna river at Dehradun using monthly measurements of 12 physicochemical parameters. Trend forecasting for river water pollution has been performed using different parameters for the years 2020–2024 at Dehradun. The study shows that the values of four parameters namely, Temperature, Total Coliform, TDS, and Hardness are increasing yearly, whereas the values of pH and DO are not rising heavily. The considered physicochemical parameters for the study are TDS, Chlorides, Alkalinity, DO, Temperature, COD, BOD, pH, Magnesium, Hardness, Total Coliform, and Calcium. As per the results and trend analysis, the value of total coliform, temperature, and hardness are rising year by year, which is a matter of concern. The values of the considered physicochemical parameters have been monitored using various monitoring stations installed by the Central Pollution Control Board (CPCB), India.
Preprint
Full-text available
Anomaly detection is concerned with identifying data patterns that deviate remarkably from the expected behaviour. This is an important research problem, due to its broad set of application domains, from data analysis to e-health, cybersecurity, predictive maintenance, fault prevention, and industrial automation. Herein, we review state-of-the-art methods that may be employed to detect anomalies in the specific area of sensor systems, which poses hard challenges in terms of information fusion, data volumes, data speed, and network/energy efficiency, to mention but the most pressing ones. In this context, anomaly detection is a particularly hard problem, given the need to find computing-energy accuracy trade-offs in a constrained environment. We taxonomize methods ranging from conventional techniques (statistical methods, time-series analysis, signal processing, etc.) to data-driven techniques (supervised learning, reinforcement learning, deep learning, etc.). We also look at the impact that different architectural environments (Cloud, Fog, Edge) can have on the sensors ecosystem. The review points to the most promising intelligent-sensing methods, and pinpoints a set of interesting open issues and challenges.
Article
Industrial Internet of Things (IIoT) is a convincing stage by interfacing different sensors around us to the Internet, giving incredible chances for the acknowledgment of brilliant living. It is a fast growing technology in the present scenario. IIoT has its effect on almost every advanced field in the society. It has impact not only on work, but also on the living style of individual and organization. Due to high availability of internet, the connecting cost is decreasing and more advanced systems has been developed with Wi-Fi capabilities. The concept of connecting any device with internet is “IIoT”, which is becoming new rule for the future. This manuscript discusses about the applications of Internet of Things in different areas like — automotive industries, embedded devices, environment monitoring, agriculture, construction, smart grid, health care, etc. A regressive review of the existing systems of the automotive industry, emergency response, and chain management on IIoT has been carried out, and it is observed that IIoT found its place almost in every field of technology.
Article
With the widespread application of artificial intelligence and the Internet of Things, the intellectualization of the IIoT (Industrial Internet of Things, IIoT) has received more and more attention. There is quite much demand from machine learning services in such scenarios. Therefore, it is urgent to find an adaptive solution for the machine learning service reasoning system rather than rely on manually configuring hardware or pulling DNN (Deep Neural Network, DNN) models. Compared with the system using only edge or only cloud, the collaborative cloud-edge architecture is a better solution in the IIoT scenario. Thus, this paper presents a collaborative cloud-edge cognitive service framework for DNN model service configuration, which is deployed to industrial scenarios and makes full use of the available resources of edge servers and cloud. Meanwhile, a self-adaptive model selection mechanism is established and the reinforcement learning algorithm to realize the optimal model selection strategy.