Conference PaperPDF Available

Big Data: The V’s of the Game Changer Paradigm



The Big Data is the most prominent paradigm now-a-days. The Big Data starts rule slowly from 2003, and expected to rule and dominate the IT industries at least up to 2030. Furthermore, the Big Data conquer the technological war and easily capture the entire market since 2009. The Big Data is blasting everywhere around the World in every domain. The Big Data, a massive amount of data, able to generate billions of revenue. The secret behind of these billions of revenue is ever growing volume. This paper presents the redefinition of volume of Big Data. The volume is redefined by engaging three other V's, namely, voluminosity, vacuum, and vitality. Furthermore, this paper augments two new V's to the Big Data paradigm, namely, vendee and vase. This paper explores all V's of Big Data. There are lots of controversy and confusion regarding V's of Big Data. This paper uncovers the confusions of the V family of the Big Data.
Big Data: The V’s of the Game Changer Paradigm
Ripon Patgiri, and Arif Ahmed
Department of Computer Science & Engineering
National Institute of Technology Silchar
Assam, India-788010
{ripon, arif}
Abstract—The Big Data is the most prominent paradigm
now-a-days. The Big Data starts rule slowly from 2003, and
expected to rule and dominate the IT industries at least up to
2030. Furthermore, the Big Data conquer the technological war
and easily capture the entire market since 2009. The Big Data
is blasting everywhere around the World in every domain. The
Big Data, a massive amount of data, able to generate billions
of revenue. The secret behind of these billions of revenue is
ever growing volume.
This paper presents the redefinition of volume of Big Data.
The volume is redefined by engaging three other V’s, namely,
voluminosity, vacuum, and vitality. Furthermore, this paper
augments two new V’s to the Big Data paradigm, namely,
vendee and vase. This paper explores all V’s of Big Data. There
are lots of controversy and confusion regarding V’s of Big Data.
This paper uncovers the confusions of the V family of the Big
Keywords-Big Data, V family of Big Data, All V’s of Big Data,
Trends of Big Data, Redefinition of Big Data, Redefinition of
The Big Data technology, we say it as a game changer
technology and it is the most popular buzzword. The Big
Data is a game changer paradigm in any field and it has
almost nothing untouched area, for instance, earth science,
Genome, Oceanology, Aeronautical, Physics, and almost all
fields where massive data are generated. Many researchers
would like to develop low cost commodity hardware to
perform High Performance Computation. Nevertheless, the
data-intensive and task-intensive computation differs. Albeit,
the aim and objective of both computations may be same, but
Big Data refers about data-intensive tasks. However, the Big
Data capable to engage thousands of researchers, capable
to attract millions of audience, able to generate billions
of revenue in a few years. What is the mystery of this
magic? The answer is ever growing volume. Furthermore,
the Big Data technology has always dealt with gigantic data
size to store and process. For instance, the most popular
Big Data technology, Hadoop can scale petabytes of data.
Notwithstanding, the Hadoop technology has a limitation
of scalability and the most of the researchers have already
started working on the issue of infinite scalability. Dr.
Hadoop [4], Xu et. al. [5], DROP [6], etc. are the few
examples of this journey. However, there are many research
challenges in Big Data technology and still, the Big Data
technology is not mature enough to serve infinite scale of
data. Notably, the Big Data technology can scale up to
petabytes of data. Many technology has been introduced to
solve the dilemma of Big Data, for instance, Hadoop stack
(MapReduce, Cassandra, Hive, Mahout, HBase), Google
File System, BigTable, CephFS, NoSQL (Giraph, Pregel,
Mizan, MongoDB, CouchDB, Berkeley DB, DynamoDb,
MemcachedDB, etc.), etc.
Everywhere data are growing. e-health and wearable tech-
nology, for instance. There are huge volume of data which
is collected in the form of sensor data, weather data, video
surveillance data, road traffic data, e-health, earthquake data,
oil and natural gas data, atmospheric data and many more.
According to IDC [27], the digital universal data will reach
44 zettabytes in 2020. Further, every year the size of data is
doubled. In addition, data from embedded system contributes
10% of digital universal data in 2020 [27].
On the other hand, the volume is analogous to the Big
Data. The Big Data concerns mostly on volume. However,
there is lots of controversy and confusion about V’s of Big
Data. This paper exposes all the V family of Big Data. The
V family of Big Data is abbreviated as 𝑉11
𝑉3denotes voluminosity, vacuum, and vitality of volume.
The 𝑉11 denotes all other V’s of Big Data and C denotes
A. Motivation
The volume is major part of Big Data, and one can
definitely state that if we remove volume from Big Data then
the Big Data never be the big enough. It becomes a small
set of data which is well fitted with conventional system to
store and process.
AllVsofBigData-Volume=Big Data”
However, there many V’s in Big Data paradigm. The
3V’s of Big Data is defined by Doug Laney [1] in
2001 to describe the data management in 3-dimension.
The 3V’s where Volume, Velocity and Variety. Nowadays,
many V’s are added to the Big Data, that is, 6V’s and
1C. The 6V’s are Veracity[16], [21], [18], [23], [20],
2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International
Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems
978-1-5090-4297-5/16 $31.00 © 2016 IEEE
DOI 10.1109/HPCC-SmartCity-DSS.2016.8
Value [16], [21], [18], [23], [20], Validity [16], Variabil-
ity [2], [18], [21], [23]/Volatility [16], Virtual [21], [2],
Visualization[23]/Visibility, and Complexity[24]. There are
lots of controversy and confusion among the V’s of Big
Data. There are many blog and website published about
V’s which are different from each other. Some confusing
questions are listed below:
Controversy. How many V’s are there?
Information. What are the widely accepted V’s of Big
Confusion. Which are the correct V’s of Big Data?
The most widely accepted V’s are volume, velocity, variety,
veracity and value. However, the other V’s are also important
for Big Data paradigm.
B. Contribution
The article present unique contribution to the Big Data
architecture which is enlisted below.
𝑉3.This article presents three more V to define the
characteristics of volume in 3V’s of Big Data defined
by Doug Laney[1]. The three more V’s of Volume are
voluminosity, Vacuum, and vitality. Correspondingly,
we emphasize on necessity vs demands of volume of
Big Data.
Confusions. There are many confusions of V’s of
Big Data. For instance, whether variability is correct
or volatility. This article clear the all confusion with
proper justification.
Two mo re V’s. This paper adds two more V’s to the
Big Data paradigm, namely, vendee and vase.
Insight. The article presents in-depth insight on 9V+C
of Big Data.
C. Organization
This article is organized as follows. The section II
introduces the bigness of Big Data. The section III illustrate
the volume of Big Data and redefine the volume in XIV.
The section IV illustrate the velocity in terms of growth
and transmission. In addition, the section V demonstrates
the variety in terms of structured, unstructured and semi-
structured data. The section VI, VII, X and VIII discuss
about veracity, validity, virtual and value respectively. The
section IX and XI justify the definition of confusion among
visualization vs. visibility, and variability vs. volatility re-
spectively. The section XV provides authors future vision
on these new V’s. And finally, the section XVI concludes
the article.
Cloud Computing and Big Data are associated with each
other. Big data processing involves handling and process-
ing of a petabytes of data or beyond. Big Data helps
the user to use utility computing to leverage distributed
queries over gigantic data size and in return gives the
intended result. The integration of cloud computing and
Big Data is appealing, and many research work has been
conducted. Cloud computing is an enabler of Big Data
technology. Moreover, the Big Data as a Service is the most
prominent research field now-a-days. In addition, Internet
of Things(IoT) is a new paradigm of computing where
every devices are intelligent and connected to each other
through sophisticated networking infrastructure. IoT is very
popular both in research and academia in the recent years.
The applications of IoT are ranges from Home appliance to
military equipment and sensor devices. It allows to connect
devices with low computing capacity to the central data
center. Millions of such smart devices creates huge data
daily. As a consequence, the data are growing very fast as
well as the IoT. The features of IoT have impacted many
applications, and of course, day to day life of people. The
realization of computing outside the desktop environment
can be achieved in the IoT paradigm [34]. In the year 2008,
US National Intelligence Council reported that IoT will be
a potential technology on the US interests 2025 [32]. In
the year 2011, the number of devices interconnected had
overcome the total population of the world [34]. In a survey
reported in 2012, the number of IoT devices in the world was
estimated to be 9 billion, and the growth was expected to
reach around 21 billions by 2020 [33]. The growing number
of IoT devices will be the main source of data in the future
and it is growing as shown in the figure 4. The data collected
from the IoT devices are used for data analysis and complex
analysis using Big Data. But, the Big Data was defined by
Doug Laney with 3D [1] in 2001.
1991 1995 2000 2003 2005 2008 2010 2011 2012 2013 2014 2015
Number of Websites
Figure 1. Growth of new websites [12]
On the other hand, the revenue is the key objective of a
technology, from the perspective of IT industries. As shown
in the figure 3, the revenue is growing linearly and it is a
good indication of Big Data technology. The revenue of Big
Data technology impact on the researchers and able to draw
the attention of millions of audience. As per report [35],
the Big Data software revenue will exceed $7 billion over
hardware revenue by 2020. However, the report [35] expects
compound annual growth rate (CAGR) of 14% growth in the
Big Data investment by 2020. On the other hand, the Hadoop
technology is forecast to grow at CAGR of 58% with worth
1,432.9 1,482.5
1,873 1,862.3 1,839.7
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Shipment in Millions of Unit
Figure 2. Shipment of smart phone [30]
Figure 3. Big Data market forecast [26]
of $16 billion alone by 2020 [36]. This $ in billions able
to attract more audiences. Therefore, the Big Data has close
relation with $ and 𝐺𝐵. The Big Data deals with conversion
of 𝐺𝐵 to $. Consequently, there is a strong quest for defining
Big Data more accurately. The characteristics and dilemmas
of Big Data are addressed using these V’s. Let us explore
the most confusion and controversial terminologies of Big
The volume is a huge set of data to be stored and
processed [1]. The volume grows exponentially and it does
not have any bound. The volume in the Big Data is large
set of data which are very perplex to manage. There are
lots of technology arise to manage these huge volumes of
data set, but can the technology process beyond Exabytes?
2012 2013 2014 2015 2016 2017 2018 2019 2020
Figure 4. Number of connected objects World wide [39]
The volume can span to Zettabytes, Yottabytes or beyond in
the future. The technology must ensure to cope up with the
growing size of data. The data are collected manually and
automatically in the databases and data warehouse. These
data are to be managed, processed, and served. The every
bit of those data is very important for clients as well as to the
companies. It is very difficult to store and manage 1 MB as
well as 1 TB because those service providers have to ensure
fault-tolerance, high-availability, disaster management, and
security. Any time a machine can be faulty or crash, a link
may fail and there is no certainty of natural calamities.
Moreover, security takes vital a role in involvement of any
kind of data. It is a very difficult task to provide service in an
unsecured network environment. Security can be broken at
any time, and therefore, invent a new security system that is
to be broken. This phenomenon will continue. That’s why, a
bit of data also important in security system as well as 1TB
of data or beyond. Now, let us come back to the volume of
Big Data, where the data are unmanageable huge set of data.
The volume is a state of being gigantic amount of data, and
it is termed as big size, understood as unmanageable data
and beyond the processing capacity of traditional system.
It is really very hard to store and process in a conventional
system. However, we further emphasize on volume in section
Exponential growth
Figure 5. Relation among Volume, Velocity and Variety
The size of data is growing exponentially, and this velocity
contributes a bigger database [1]. Data creates another data.
The data always increases even if we use compressing
technology. The velocity is always defined with respect to
volume in Big Data. The velocity in Big Data concerns
mainly two things, namely, speed of growth and speed of
transfer. These two velocity requirements differ each other.
A. Growth
In 2009, the Facebook Inc. announced that the company
has 1 Petabytes of data, and more interestingly, Google has
15 Exabytes of data in present day. The reason for data
growing are enlisted below:
Users. The internet users are growing daily basis and
one user creates much amount of data. The internet
users were 1000 million in 2005, and it was 3000
million in 2015 [3]. The users of internet are linearly
growing. This implies exponential growth of data vol-
IoT. Emerging of Internet of Things (IoT) is the promi-
nent contributor to the Big Data growth. For instance,
sensor devices, surveillance camera, RFID, etc. are the
prominent data generator for Big Data. Moreover, the
number of connected objects are increasing as shown in
the figure 4. In addition, the number of smart phones
are increasing as shown in the figure 2.
Cloud Computing. Evolving of cloud computing gen-
erates huge data to store, process and manage.
Website. Every day many new websites are launched.
As shown in the figure 1, the number of websites are
rising exponentially. The growth rate of the new website
was highest in 2011 and 2014. The new spawned
websites generate huge data and thus, the volume is
the prominent field to do research.
Scientific Data. The scientific data are naturally huge
in size. For instance, seismic data, Ocean, Weather data
B. Transmission
There is Big Data, then transferring data become a promi-
nent issue. That is, when massive amount of data flies! The
transmission of small volume does not have any problem,
except security concern. The movement of large volume
creates problem, even if there is a fibre-optics medium
to transfer those data. If the distance increases, then the
latency becomes a prominent barrier to transfer [8]. The
bandwidth is also a barrier in the field of Big Data. The user
requirements are always minimal latency with the highest
bandwidth, which is not possible due to cost involvement.
Therefore, Big Data technology also concerns about $ per
Variety [1], [16], [21], [23] is the assorted forms of
data, this includes structured, semi-structured, and unstruc-
tured data. The structured data comprises a tabular form
of data, semi-structured data includes log data, XML data
and unstructured data includes videos, images, scientific
information. The big database comprises a variety of data
[1]. Therefore, the volume grows at an exponential pace
with a variety of data as shown in the figure 5. We know
that everything is file. No variety. The data concerns with
file only. Again, there are varieties of structured, semi-
structured and unstructured data. For instance, there are
.docx, .doc, .odt, .text etc. for textual data. There are a variety
of data formats. However, these varieties of data comprise
high volume of Big Data. The unstructured data has higher
varieties of data than other two categories. The unstructured
data contributes 90% of Big Data [13]. The unstructured
data is raising daily and it become massive varieties in Big
Data. The massive amount of varieties of data cause a big
issue, either with low volume or high volume.
The veracity is accuracy, truthfulness, and meaningfulness
[16], [21], [18], [23], [20]. The Big Data with huge volume
becomes problematic when we would like to perform some
operation on these data. The question is- how do we believe
that the operation is successful? Is it accurate? The veracity
is also a large problem in big data as it’s rather impossible
to spell check the huge quantity of information. Any data
is worthless and meaningless if it is not accurate [23].
Inaccurate data may lead to wrong direction or decision.
For instance, we would like to recommend some product to
the users and accuracy matter a lot to revenue, but revenue
is affected by an inaccurate results.
On the other hand, though the process can perform a
task accurately, but the data may not be valid [16]. For
example, the very old data in e-commerce becomes obsolete
and it can be truncated. However, some data never obsoletes.
For instance, a transaction log record of a financial bank.
Sometimes, the input of very old data is not valid for a
process. The validity may differ from time to time. The
validity refers to the data those have worthiness. The correct
data may not be valid for certain processing. The huge
volume of data, all are not valid at all. It depends on time
as well. 𝑉𝑜𝑙𝑢𝑚𝑒𝑉𝑎𝑙𝑖𝑑𝑖𝑡𝑦 =𝑊 𝑜𝑟𝑡ℎ𝑙 𝑒𝑠𝑠𝑛𝑒𝑠𝑠 [16]
However, the Big Data is concerned mainly on extracting
value from enormous stored data [16], [21], [18], [23],
[20]. The Big Data extract values from data, reveal hidden
truth from the data, uncover useful message from the data,
creates value of data. Data in itself has no value. The
massive sets of data are processed to give a worth, for
instance, Big Data Mining. The Big Data Mining is nothing
but a bigger dimension of Data Mining. The dump data
are mined to search the hidden jewels. Therefore, the Big
Data is a platform to provide worthiness of unworthy data.
𝐵𝑖𝑔𝐷𝑎𝑡𝑎 =𝐷𝑎𝑡𝑎 +𝑉𝑎𝑙𝑢𝑒 [16]
The visibility is the state of being able to see or be
seen. The disparate data to be stitched together to show,
either the data source is disparate or the data itself. There
is no point to archive the data, if data cannot be seen or
shown. The visibility defines the data to be viewed from
a point of reference. The data may be visible from an
administrator or other side, but data are made visible. On
the other hand, visualization [23] is the process to show
the hidden data of Big Data. This term is more precise to
describe Big Data, because visualization is making visible.
The Visualization is the most key process to enhance the
performance of the data and business processes/decisions.
In this investigation, we found that the key to success of
the business is the analysis using their own data and the
visualization process gives the growth rate of the company’s
revenues. Thus, we are excluding the “visibility” term from
Big Data “V” family. The visualization of large volume of
data is a dilemma in Big Data [23]. For instance, how would
we like to view the followers of Barack Obama in Twitter?
The conventional technology never able to answer this query,
because Barack Obama has 75,705,896 followers [14]. In
addition, the visualization of Big Graph is very big issue
and it is very complex process.
The virtual is found in the article [21], [2]. The data
management itself a virtual process, and delineates the
management of data. The virtual is a process to manage
the data effectively and efficiently as per the demand of
users. If we recall conventional operating system, then the
virtual is the management of resources, for instance, demand
paging. It also applies to Big Data. Moreover, the Big Data
analytics visualize the required data which are purely virtual
process. The role of Cloud Computing in Big Data is a
tremendous contribution to growth of Big Data and this
Cloud Computing evolves on the basis of virtualization.
Therefore, the term “virtual” is accurate to describe the
characteristics of Big Data.
The variability [2], [23] and Volatility [16] are conflicting
terminology that is wholly in the same boat. But, they
conflict each other. The volatility is the nature of changing
abruptly, sudden change, instability, and changed by uninten-
tionally or anonymously. The volatile nature is superfluous
in Big Data. On the other hand, the variability is the nature
of changing, shifting, mutation, and modification with good
intention. Therefore, the “variability” is more weighted and
accurate to describe the characteristics of Big Data. The
changes in the data may happen due to time, modification of
user, obsolete data. For instance, Katy Perry has the highest
followed person in twitter [14], but, after some time it may
change. Moreover, suddenly one video on YouTube has got
popular and access rate is increasing exponentially. After
some time, access rate of this video falls down. This is the
nature of variability.
The complexity is the pure form of computational term.
There is a very high complexity in the case of a bulk volume
of data [24]. The large volume of data is stored, processed
and managed to serve to their intended clients. The handling
huge volume of data always associates with high complexity.
Time flies! Wait for none! Therefore, it is the right time
to re-look the volume of 3V’s of Big Data. The volume is
defined in cube in general science, namely, height, width
and length. As shown in the figure 6, the volume is 3D in
Big Data. Another 3V’s are added to the Big Data. These
3V’s defines the characteristics of volume in Big Data. The
volume is cog in the machine of Big Data. The Big Data
and volume are analogous meaning, and thus the volume
is integral part of Big Data. Further, the volume is the
real backbone of the Big Data. However, the technology
must support volume of the Big Data to store, process and
manage. This is 100% sure that size of volume in the Big
Data paradigm never going to decrease in near future. That
is why, this the time to experiment the behavior of volume.
Further, the 360-degree investigation of volume is always
called for. Since, the volume is the main dilemma of Big
Data to conquer. From the perspective of hardware, changing
of the floppy disk to DVD and right now, the DVD is
almost obsolete. Moreover, the SSD is emerging to replace
the HDD. Different sort of RAM has been introduced. For
instance, the RAMCloud [8], [9] has been introduced to
overcome the latency issue [8], [9]. This technology transfer
occurs only when the growth of volume in Big Data. Even,
there are many issues in the data warehouse due to rising of
volume [11]. On the other hand, the software perspective
is also promoting the technology transfer by innovating
new technology, and devising new algorithm. For instance,
erasure coding creates more empty storage spaces [25].
Figure 6. Re-definition of Volume of 3V’s of Big Data.
A. Voluminosity
The voluminosity is the greatness of volume. The volu-
minosity in volume state that there is a very large set of
data collected so far and to be collected. The data bank
becomes unmanageable if data is extensive in size and hence,
this is the time to do research on how to store, process
and manage the mammoth size of data such that these data
results revenue. Of course, the volume collected so far and to
be collected has a significant gap. Surprisingly, the Google
and NSA claims that they have 15 and 10 exabytes of data
warehouse respectively [37], [38]. These data are collected
not to dump, but it is collected to generate revenue from
these data. After a few years, many companies will bypass
exabytes and this is the ground truth.
Let us explore the current trends of voluminosity. There
are 7,238 Tweets sent in 1 second, 54,708 Google searches
in 1 second, 124,563 YouTube videos viewed in 1 second,
2,497,667 Emails sent in 1 second, 2,158 Skype calls in
1 second, 35,651 GB of Internet traffic in 1 second, 726
Instagram photos uploaded in 1 second, and 1,127 Tumblr
posts in 1 second [28]. Moreover, there are 1.09 billion
daily active Facebook users on average for March 2016,
989 million mobile daily active Facebook users on average
for March 2016, 1.65 billion monthly active Facebook users
as of March 31, 2016, and 1.51 billion mobile monthly
active Facebook users as of March 31, 2016 [29]. Another
voluminosity is shown in the figure 2, where the millions
of smart phones are shipped. These smart phones create
not only data. Putting it other way, smart phones creates
billions of photos. Moreover, connected object will increase
and these object generates tremendous amount of data. The
voluminous data is acquired, stored, processed and managed,
and it is onerous task. However, the technological advent can
partially solve the problem. Albeit, the real time processing
of the Big Data is strenuous in the current state-of-the-art
technology. By the same token, if data meet voluminosity,
then it become more arduous to process in real-time.
B. Vacuum
The Big Data, without volume, it is just a small set of data,
those can be processed by conventional way. The vacuum
in volume state that there is a strong requirement of empty
space to store large volume of data. Furthermore, vacuum
also refers to creation of room for store, process and mange
a tremendous data set from the existing data sets. This is the
time to think about how much storage space available for
incoming data rather than how much data we have stored.
The process of creating storage space for incoming data
is very challenging as well as managing these vast sets
of data. The vacuum concerns with creating empty space,
either augmenting storage devices or other techniques to
reduce the size of data. For example, storage space can
be created using erasure code in unused data. Moreover,
we can apply compressing algorithm in infrequently used
data as well as frequently used data[10]. The article [10],
shows that processing of uncompressed data is more faster
that compressed data. This de-duplication process improves
the storage performance of the system. The opposite, the
replication is the mechanism to duplicate a data block
in multiple blocks to maximize the parallelism. Both, de-
duplication and replication can also be combined to enhance
the performance of a storage system [25]. However, big
problems always come along with humongous set of data.
The key point of research is that how to enhance a storage
system to process gigantic amount of data. The exabytes of
data is no farther from today. Let us step into the future
today by concentrating on vacuum.
C. Vitality
The vitality is power of enduring, survive and grow. The
vitality of a volume state that there is a massive amount
of data actively served and remaining are not. Further, the
vitality state about the data survival in big data in storage
environment. That’s reliability of data. In a large data bank,
there is some data which are actively used and some are not.
However, the company generates revenue from the actively
used data only and the rest are stored in hope for future
uses. Let us assume, a farm house having exabytes data
and there is no vitality. Consequently, risk becomes higher
and hence, anything can happen to those data. Anywhere
anything can happen. Putting it other word, fire, earthquake,
flood, war, and terrorist are the prominent reason for data
loss. There is no big issue in a small amount of data.
The more issues are augmented when the volume increases
and reach it limit. Then, the vitality is the most prominent
question. For instance, how do we rely on a system when
there is no disaster management system? If not, then how
do we implement on the colossal volume of data (beyond
exabytes)? It is literally a billion dollar question. Undoubt-
edly, the disaster management system is always a prominent
research issue. Apart from the disaster management system,
fault-tolerance system also play a vital role in Big Data.
However, ensuring vitality is a very big deal in tremendous
data size. Moreover, the vitality describes about reliability,
flexibility, dependability and security. Therefore, the vitality
is an integral component of volume of Big Data.
A. Vendee
The one more new V in Big Data is “Vendee” to define
the client size associated with the Big Data to deal with.
The Vendee is the most significant component to define
the Big Data, where the 9V’s are made only for clients
to conform as their requirements. Moreover, the data and
clients/users are ubiquitous. In short, business is client-
driven process. Let’s presume, the users per months in
Facebook, and it is 1.01 billion. But, the access log of that
billions of users becomes very large to handle and store.
Not just that, the action performed by this user is enormous
in data size. Furthermore, the Google search is 100 billion
per month [12]. The users/clients aggrandize the companies.
The crystal clear matter of fact is that Big Data involves big
user/customer management. Unquestionably, the “Vendee” is
very easy to define, interpret as a component of Big Data and
the 9V’s are under the umbrella of “Vendee”. The trillions
of user to deal with, exponential growth of users, and day-
to-day basis requirement of the clients are also a crucial
part of Big Data. Without “Vendee”, does Big Data has a
B. Vase
According to Merriam-Webster dictionary, the vase is a
container, that is used for holding flowers or for decoration.
The Big Data is high volume, and stored in the datacen-
ter. The datacenter requires farmhouse, land, huge electric
power, thousands of hardware, thousands of manpower,
varieties of hardware, and many more small products to
enable Big Data. In the Big Data paradigm, the flower refers
to Big Data and the vase refers to underlying requirements
to enable Big Data. The Google data centers alone consume
1.1% and 1.5% of Global electricity [22]. What happen
to the companies which deals with the gigantic size of
data? The exabytes of data warehouse requires not only
many hardware to store, but also electric power, manpower,
disaster recovery system, lands, farmhouse etc [38]. The Big
Data cannot obviate from the vase. The vase is root of Big
As shown in the figure 1, 2, 3, and 4, we can
easily guess the future of the Big Data without doing
research on future of Big Data. The Big Data will at least
dominate the market till 2030 and proven to be crackerjack
technology. Since, new technologies are introduced day by
day, which will generate data by any means. Therefore, data
will continue to grow. It will never be stopped. Thus, the
volume of data increases and will create a dilemma for
the industries. This is the time to do the research on the
3+𝐶of Big Data to develop new technology, since it is
the most current demands of Big Data, like IoT. The future,
we can create it, but not always. In the technological sense,
most of the future, we predict it first, and then create the
future. Nevertheless, the volume is growing exponentially,
need not to imagine today, and it is clearly “visible future”
from today. The voluminosity, vacuum and vitality will be
the most prominent research area in the near future as well
as vendee and vase. Right now, most of the researchers have
begun the journey in this direction.
The volume of Big Data takes lion’s share in revenue
without whom the Big Data does not exist. However, we
have demonstrated 9V’s of Big Data and engage new more
V’s to enhance the Big Data technology. The V family
3+𝐶) is worthy in the sense of future technology
and stepping towards the future right now. And finally, we
conclude that the 𝑉11
3+𝐶is a milestone in developing
future technology of Big Data and cannot be imagined a
new technology without this V’s of Big Data.
[1] Doug Laney, “3D Data Management: Controlling Data
Volume, Velocity, and Variety”, Gartner, file No.
949. 6 February 2001,
Data-Volume-Velocity- and-Variety.pdf
[2] C.L. Philip Chen, and Chun-Yang Zhang, “Data-intensive
applications, challenges, techniques and technologies: A survey
on Big Data”, Information Science, 275(2014), pages 314347,
[3] Number of internet users worldwide from 2005 to 2015
(in millions), [online], Retrieved on 11 June 2016 from internet-
[4] Dipayan Dev, and Ripon Patgiri, “Dr. Hadoop: an infinite scal-
able metadata management for HadoopHow the baby elephant
becomes immortal”, Frontiers of Information Technology &
Electronic Engineering, January 2016, 17(1), pp 15-31, DOI:
[5] Quanqing Xu, Rajesh Vellore Arumugam, Khai Leong Yong,
and Sridhar Mahadevan, “Efficient and Scalable Metadata
Management in EB-Scale File Systems”, IEEE Transactions on
Parallel and Distributed Systems, 25(11), pages 2840 - 2850,
2014, DOI:
[6] Quanqing Xu, Rajesh Vellore Arumugam, Khai Leong Yang,
Sridhar Mahadevan, “DROP: Facilitating distributed metadata
management in EB-scale storage systems”, In 2013 IEEE
29th Symposium on Mass Storage Systems and Technologies
(MSST), 6-10 May 2013, pages 1-10, DOI:
[7] Jeffrey Dean and Sanjay Ghemawat, “MapReduce: simplified
data processing on large clusters”, In OSDI’04: Sixth Sympo-
sium on Operating System Design and Implementation, San
Francisco, CA, December, 2004.
[8] John Ousterhout, Parag Agrawal, David Erickson, Christos
Kozyrakis, Jacob Leverich, David Mazires, Subhasish Mitra,
Aravind Narayanan, Diego Ongaro, Guru Parulkar, Mendel
Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan
Stutsman, “The case for RAMCloud” Communications of
the ACM, 54(7), July 2011, DOI:
[9] John Ousterhout, Arjun Gopalan, Ashish Gupta, Ankita Ke-
jriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo
Jin Park, Henry Qin, Mendel Rosenblum, Stephen Rumble,
Ryan Stutsman, and Stephen Yang, “The RAMCloud Storage
System”, ACM Transactions on Computer Systems (TOCS),
33(3), September 2015, Article No. 7, DOI:
[10] Adnan Haider, Xi Yang, Ning Liu, Xian-He Sun, and Shuib-
ing He, “IC-Data: Improving Compressed Data Processing in
Hadoop”, 2015 IEEE 22nd International Conference on High
Performance Computing (HiPC), 16-19 Dec. 2015, pages 356
- 365, DOI:
[11] Alfredo Cuzzocrea, “Warehousing and Protecting Big
Data: State-Of-The-Art-Analysis, Methodologies, Future Chal-
lenges”, In ICC ’16: Proceedings of the International Con-
ference on Internet of things and Cloud Computing, Art-
cile nmber. 14, 2016, DOI:
[12] “Internet Live Stats”, [online], Retrieved on 11 June
2016 from
[13] Mark van Rijmenam , “Why The 3V’s Are Not Sufficient To
Describe Big Data”, [Online], Accessed on 20 June 2016, from
[14] Twitter Counter, “Twitter Top 100 Most Followers”, [online],
Accessed on 27 June 2016 from
[15] Gema Bello-Orgaz, Jason J. Jung, and David Camacho,
“Social big data: Recent achievements and new challenge”,
Information Fusion, 28(March), pages 45-59, 2016 DOI: http:
[16] M Ali-ud-din Khan, M F Uddin, and N Gupta, “Seven Vs of
Big Data: Understanding Big Data to extract Value”, In 2014
Zone 1 Conference of the American Society for Engineering
Education (ASEE Zone 1), pages 3-5, April, 2014, DOI: http:
[17] Cheikh Kacfah Emani, Nadine Cullot, and Christophe
Nicolle, “Understandable Big Data: A survey”, Computer
Science Review, 17, pages 70-81, 2015, DOI:
[18] Yuri Demchenko, Cees de Laat, and Peter Membrey, “Defin-
ing Architecture Components of the Big Data Ecosystem”, In
2014 International Conference on Collaboration Technologies
and Systems (CTS), pages 104 - 112, 2014, DOI: http://dx.doi.
[19] Jianzheng Liu, Jie Li, Weifeng Li and Jiansheng Wu, “Re-
thinking big data: A review on the data quality and usage
issues”, ISPRS Journal of Photogrammetry and Remote Sens-
ing, In Press, DOI:
[20] Xiaolong Jin, Benjamin W. Wah, Xueqi Cheng, and Yuanzhuo
Wang, “Significance and Challenges of Big Data Research”,
Big Data Research, 2(2), pages 59-64, 2015, DOI: http://dx.
[21], “The 7 pillars of Big Data”, A White
Paper of Landmark Solutions, Retrieved on 10, June, 2016
Whitepapers/The 7 pillars of Big Data Whitepaper.pdf
[22] Inhibitant, “INFOGRAPHIC: How Much Energy Does
Google Use? ”, [online], Accessed on 28 June 2016
[23] Eileen McNulty, “Understanding Big Data: The Seven Vs”,
[online] Retrieved 10, June, 2016 from
seven-vs-big- data/
[24] Monica Bulger, Greg Taylor, and Ralph Schroeder, “Engaging
Complexity: Challenges and Opportunities of Big Data”, In
London: NEMDOE, 2014.
[25] Mingyuan Xia, Mohit Saxena, Mario Blaum, and David A.
Pease, “A tale of two erasure codes in HDFS”, In Proceedings
of the 13th USENIX Conference on File and Storage Tech-
nologies, Pages 213-226, 2015.
[26] Louis Columbus, “Roundup Of Analytics, Big
Data & Business Intelligence Forecasts And Market
Estimates, 2015”, [online], Retrieved on 11 June 2016
roundup-of-analytics-big-data-business-intelligence- forecasts-
[27] “The Digital Universe of Opportunities: Rich Data and the
Increasing Value of the Internet of Things”, [online], Retrieved
on 11 June 2016 from
[28] ILS, “Internet Live Stats”, Retrieved on 12 June 2016,
at 2:31AM from
[29] Facebook, “Facebook Newsroom”, [online], Retrieved on 12
June 2016 from
[30] IDC, “Global smartphone shipments forecast from 2010
to 2020 (in million units)”, [online], Retrieved on 12 June
2016 from
[31] Rajkumar Buyya, Chee Shin Yeo, and Srikumar Venugopal,
“Market-oriented cloud computing: Vision, hype, and reality
for delivering it services as computing utilities”, In 10th IEEE
International Conference on High Performance Computing and
Communications, pages 5 - 13, 2008, DOI:
[32] SRI Consulting Business Intelligence, “Disruptive Civil Tech-
nologies: Six Technologies with Potential Impacts on US
Interests out to 2025”, [online], Retrieved on 12 June 2016
[33] Nathan Eddy, “Gartner: 21 Billion IoT Devices To Invade
By 2020”, [online], Retrieved on 12 June 2016 from http:
21-billion-iot-devices-to- invade-by-2020/d/d-id/1323081
[34] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “In-
ternet of Things (IoT): A vision, architectural elements, and
future directions”, Future Gener. Comput. Syst., 29(7), pages
1645-1660, 2013, DOI:
[35] “Big Data Market 2016 to 2030 - Company Profiles and
Strategies of 150 Big Data Ecosystem Players”, Retrieved
on 12 June 2016 from
[36] “Hadoop Market Forecast 2017-2022”, [online], Retrieved on
12 June 2016 from
[37] Follow the Data, “Data Size Estimates”, Retrieved on 10
June 2016 from
[38] James Zetlen, “Google’s datacenters on punch cadrs”, [on-
line], Retrieved on 10 June 2016 from https://what-if.xkcd.
[39] Charles McLellan, “The internet of things and big data:
Unlocking the power”, [online], Retrieved on 10 jun 2016
... The research then was developed by Patgiri and Ahmed (2016), which finds the V family ( ). Which are (1) Variety, (2) Veracity is the accuracy data; (3) Validity is the data that have worthiness, (4) Value is the sets of data that give worthiness, (5) Visibility / (6) Visualization is data ability to be shown, (7) Virtual is data efficiently and effectively to be managed in a virtual process, (8) Variability / (9) Volatility is the sudden changes of data, (10) Vendee defines the client size is associated with the big data, and (11) Vase is the receptacle that holds the data. ...
... Complexity is the pure form of a computational term. The term volume is redefined into a) Voluminosity is the size of the volume, b) Vacuum is data without volume, and c) Vitality is the activity of the data (Patgiri & Ahmed, 2016). ...
Technology always plays a significant role for humans in simplifying and optimizing tools and ideas to solve problems. When the first society was created began with foraging and livestock, the population increased. The Industrial Revolution 4.0 significantly impacted agriculture at the country level in a tolerable and understandable way. This paper aims to assess the implementation of preparation IR. 4.0 in Indonesia achieved its objectives or could be optimized. The previous research was in general and mainly focused on technological factors. . Indonesia is the fourth most populous country globally, and many of its citizens depend on their livelihood from the agriculture sector (Badan Pusat Statistik / Central Bureau of Statistic, 2021). Based on the national policy of Making Indonesia 4.0 and programs implemented in The Ministry of Agriculture, This research employs both quantitative and qualitative methods. A systematic review was conducted to check published articles regarding the topic from Food and Agricultural Organization (FAO) and Indonesian Governmental Institutions. To fully understand the topic, four research sources were used to find relevant data published between 2000 and 2021. The inclusion criteria were related to the IR 4.0 impact on the agriculture sector, original research, and published in English. Thirty articles have met the criteria. The findings have shown that IR 4.0 has revolutionized agriculture to become Agriculture 4.0, and various innovations are taken to increase the productivity of food production. Applications also built a stepping stone but yet worked for one purpose only.
... Random data sets. But it is growing exponentially over time [2]. Big and complex data big data is a data but a large amount. ...
Full-text available
Big Data is a collection of technologies developed to store, analyze and manage this data. It is a macro tool. Today it is used in fields as diverse as medicine, agriculture, gambling and environmental protection. Machine learning, forecasting Companies use big data to streamline their marketing campaigns and techniques. Modeling and other advanced analytics applications enable big data organizations to generate valuable insights. Companies use it in machine learning. Programs cannot balance large data in any particular database. When datasets are large in size, velocity, and volume, the following three distinct dimensions constitute "big data." Examples of big data analytics include stock markets, social media platforms, and jet engines. But it is growing exponentially with time. Traditional data management tools cannot store or process it efficiently. It is a large scale technology developed by Story, Analysis and Manoj, Macro-Tool. Find patterns of exploding confusion in information about smart design, however, implying that various big data structures are structured, whether this includes the amount of information, the speed with which it is generated and collected, or the variety or scope of related data points. In the past, data was collected only from spreadsheets and databases of large, disparate information that grew at an ever-increasing rate. Very large, complex data refers to large amounts of data for sets that are impossible to analyze with traditional data processing applications. Is software configuration used to handle the problem? Apache is an advanced application for storing and processing large, complex data sets and analyzing big data. Analytical techniques against very large, heterogeneous databases containing varying amounts of data, big data analytics help businesses gain insights from today's massive data sources. Defined as software tools for processing and extraction. People, companies and more and more machines are now developing technologies to analyze more complex and large databases that cannot be handled by traditional management tools. It is designed to discover patterns of chaos that explode in information in order to design smart solutions.
... Gartner et al. [7] further define big data as "high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making". Based on the definition by Gartner, [8] highlights the main characteristics of big data namely Volume (data size), Velocity (data speed), Variety (various data formats), Value (benefits), and Veracity (integrity and trustworthiness). ...
Full-text available
The telecommunication industry is the leading industry in big data trends as this industry has the most capable infrastructure for big data. However, the adoption of big data in telecommunication services also raises important security and privacy challenges due to the high volume, velocity, and variety of big data characteristics. To address the issue, this study shed light on the security and privacy challenges of big data adoption in the telecommunication industry. This study focuses on investigating the security and privacy challenges for data users in telecommunication services from the lens of the TOE framework and examines the mitigation strategies to address the privacy and security challenges. This study is conducted using qualitative methodology. From the perspectives of data users (telecommunication providers), it could be concluded that data management, data privacy, data compliance, and regulatory orchestration challenges are the most pressing concerns in big data adoption. This study offers contributions in presenting a thematic classification of security and privacy challenges and their mitigation strategies for big data adoption in the telecommunication industry. The thematic classification highlights potential gaps for future research in the big data security domain. This study is significant in that it provides empirical evidence for the perspectives of telecommunication data users in addressing privacy and security issues that are related to big data adoption.
... Veracity indicates that the data may be dirty, whereas variety means that the data type may be structured, semi structured and unstructured. Finally, value indicates that big data holds potential value [21]. From the explanation of the V's characteristics, the big data acquires data from a variety of sources and a variety of formats. ...
Full-text available
Big data is increasingly being promoted as a game changer for the future of science, as the volume of data has exploded in recent years. Big data characterized, among others, the data comes from multiple sources, multi-format, comply to 5-V’s in nature (value, volume, velocity, variety, and veracity). Big data also constitutes structured data, semi-structured data, and unstructured-data. These characteristics of big data formed “big data ecosystem” that have various active nodes involved. Regardless such complex characteristics of big data, the studies show that there exists inherent structure that can be very useful to provide meaningful solutions for various problems. One of the problems is anticipating proper action to students’ achievement. It is common practice that lecturer treat his/her class with “one-size-fts-all” policy and strategy. Whilst, the degree of students’ understanding, due to several factors, may not the same. Furthermore, it is often too late to take action to rescue the student’s achievement in trouble. This study attempted to gather all possible features involved from multiple data sources: national education databases, reports, webpages and so forth. The multiple data sources comprise data on undergraduate students from 13 provinces in Indonesia, including students’ academic histories, demographic profles and socioeconomic backgrounds and institutional information (i.e. level of accreditation, programmes of study, type of university, geographical location). Gathered data is furthermore preprocessed using various techniques to overcome missing value, data categorisation, data consistency, data quality assurance, to produce relatively clean and sound big dataset. Principal component analysis (PCA) is employed in order to reduce dimensions of big dataset and furthermore use K-Means methods to reveal clusters (inherent structure) that may occur in that big dataset. There are 7 clusters suggested by K-Means analysis: 1. very low-risk students, 2. low-risk students, 3. moderate-risk students, 4. fuctuating-risk students, 5. high risk students, 6. very high-risk students and, 7. fail students. Among the clusters unreveal, (1) a gap between public universities and private universities across the three regions in Indonesia, (2) a gap between STEM and non-STEM programmes of study, (3) a gap between rural versus urban, (4) a gap of accreditation status, (5) a gap of quality human resources distribution, etc. Further study, we will use the characteristics of each cluster to predict students’ achievement based on students’ profles, and provide solutions and interventions strategies for students to improve their likely success.
... Big Data have recognizable feature called V's of Big Data [74], shown in Figure 2.5. Name is derived from initial letters of the big Data features. ...
Full-text available
Cloud computing is facing some serious latency issues due to huge volumes of data that need to be transferred from the place where data is generated to the cloud. For some types of applications, this is not acceptable. One of the possible solutions to this problem is the idea to bring cloud services closer to the edge of the network, where data originates. This idea is called edge computing, and it is advertised that it dramatically reduces the network latency as a bridge that links the users and the clouds, and as such, it makes the foundation for future interconnected applications. Edge computing is a relatively new area of research and still faces many challenges like geo-organization and a clear separation of concerns, but also remote configuration, well defined native applications model, and limited node capacity. Because of these issues, edge computing is hard to be offered as a service for future real-time user-centric applications. This thesis presents the dynamic organization of geo-distributed edge nodes into micro data-centers and forming micro-clouds to cover any arbitrary area and expand capacity, availability, and reliability. We use a cloud organization as an influence with adaptations for a different environment with a clear separation of concerns, and native applications model that can leverage the newly formed system. We argue that the presented model can be integrated into existing solutions or used as a base for the development of future systems. Furthermore, we give a clear separation of concerns for the proposed model. With the separation of concerns setup, edge-native applications model, and a unified node organization, we are moving towards the idea of edge computing as a service, like any other utility in cloud computing. The first chapter of this thesis, gives motivation and problem are that this thesis is trying to resolve. It also presents research questions, hypotheses and goals based on these questions. The second chapter gives an introduction to the area of distributed systems, narrowing it down only the parts that are important for further understanding of the other chapters and the rest of the thesis in general. The third chapter shows related work from different areas that are connected or that influenced this thesis. This chapter also shows what the current state of the art in industry and academia is, and describes the position of this thesis compared to the related research as well. The fourth chapter proposes a model that is influenced by cloud computing architectural organizations but adapted for a different environment. We present how we can separate the geographic area into micro data-centers that are zonally organized to serve the local population, and form them dynamically. This chapter also gives formal models for all protocols used for the creation of such a system with separation of concerns, applications models, and presents limitations of this thesis. The fifth presents an implemented framework that is based on the model described in chapter three. We describe the architecture, and in detail every operation a framework can do, with all existing limitations. The sixth chapter presents the usability of the proposed model, with possible applications that could be implemented based on the model. We also present one example of COVID-19 area traffic control in the city of Milan, Italy. The seventh and the last chapter concludes this thesis and presents future work that should be done. Key words: distributed systems, cloud computing, multi cloud, microservices, software as a service, edge computing, micro clouds, big data, infrastructure as code.
Full-text available
Our society has an insatiable appetite for data. Much of the data is collected to monitor the activities of people, e.g., for discovering the purchasing behaviour of customers, observing the users of apps, managing the performance of personnel, and conforming to regulations and laws, etc. Although monitoring practices are ubiquitous, monitoring as a general concept has received little analytical attention. We explore: (i) the nature of monitoring facilitated by software; (ii) the structure of monitoring processes; and (iii) the classification of monitoring systems. We propose an abstract definition of monitoring as a theoretical tool to analyse, document, and compare disparate monitoring applications. For us, monitoring is simply the systematic collection of data about the behaviour of people and objects. We then extend this concept with mechanisms for detecting events that require interventions and changes in behaviour, and describe five types of monitoring. We argue for the development of a general theory of monitoring. ****Online first open access version:
Full-text available
In today's technology age, digitalization is an important issue within the framework of the globalizing world structure, the internet's gaining momentum, and becoming a part of life also changes daily life practices. For this reason, many individuals, institutions, and organizations have to develop and transform themselves in order to keep up with the structure of the changing world. Journalism practices are some of the structures that need to adapt to the new digital world by improving themselves within the framework of this change and transformation. For this reason, in the context of this study, the perception of journalism and journalism practices, which is one of the structures that have transformed in the light of the changing world balances and perceptions, will be examined; the formation of people to become the data of the digital world and the concept of digital journalism will be examined by emphasizing the concept of big data, which is the main formation of this data. It is examined by the method of literature review through the technological determinism approach.
Full-text available
O século XX foi marcado pelo desenvolvimento das Tecnologias da Informação e Comunicações, grande parte financiadas com recursos públicos. Dentre elas, a internet originou-se nos EUA como uma resposta geopolítica ao avanço tecnológico soviético. Em meados do fim da Guerra Fria, a ascensão do neoliberalismo econômico fez com que a iniciativa privada ganhasse influência no setor, abrindo espaço para a desregulamentação financeira e privatização de tecnologias. Após a abertura comercial da internet, a pressão competitiva pelos mercados da “nova economia” proporcionou o surgimento de empresas inovadoras e o avanço global da internet, fazendo-nos entrar na era do Big Data, uma época de produção de dados em massa. Nos últimos anos, um oligopólio constituiu-se nos EUA para explorar economicamente esses recursos digitais, que se dilatou durante o governo Obama. Após a expansão econômica, a era Trump suscitou problemáticas sobre o poder político das chamadas Big Techs, marcando o início de tentativas de contenção pelo chefe do Executivo e Legislativo. Entre as iniciativas, Trump buscou minar o poder corporativo através de ordens executivas, enquanto o Congresso abriu uma investigação antitruste destinada a quebrar os monopólios digitais para “proteger o mercado de transgressões”. No entanto, essas tentativas de contenção são, de fato, iniciativas voltadas apenas para proteção do “livre mercado”? A hipótese da dissertação parte da premissa de que as tentativas de conter as big techs na era Trump não buscam apenas a defesa do “livre mercado”, mas sim, supõe-se que sejam iniciativas para reparar as consequências desse livre mercado que levou à oligopolização e centralização de poder estrutural nas empresas. Assim, o objetivo do trabalho é analisar as relações entre as big techs e a política norte-americana nos governos Obama (período de expansão) e na era Trump (tentativas de contenção). É utilizado o método hipotético-dedutivo, congregando a análise qualitativa de documentos oficiais e literatura, além da utilização de indicadores quantitativos. ABSTRACT: The 20th century was marked by the development of Information and Communication Technologies, largely financed with public resources. Among them, the internet originated in the US as a geopolitical response to Soviet technological advancement. In the middle of the end of the Cold War, the rise of economic neoliberalism made the private sector gain influence in the sector, opening space for financial deregulation and privatization of technologies. After the commercial opening of the internet, the competitive pressure by the markets of the “new economy” provided the emergence of innovative companies and the global advancement of the internet, making us enter the era of Big Data, an era of mass data production. In recent years, an oligopoly was formed in the US to economically exploit these digital resources, which expanded during the Obama administration. After the economic expansion, the Trump era raised questions about the political power of the so-called Big Techs, marking the beginning of containment attempts by the head of the Executive and Legislative. Among the initiatives, Trump sought to undermine corporate power through executive orders, while Congress opened an antitrust investigation aimed at breaking digital monopolies to "protect the market from transgressions". However, are these attempts at containment, in fact, initiatives aimed only at protecting the “free market”? The hypothesis of the dissertation starts from the premise that the attempts to contain big techs in the Trump era do not seek only the defense of the “free market”, but rather, it is assumed that they are initiatives to repair the consequences of this free market that led to oligopolization and centralization of structural power in companies. Thus, the objective of the work is to analyze the relationship between big techs and US policy in the Obama administrations (expansion period) and in the Trump era (containment attempts). The hypothetical-deductive method is used, bringing together the qualitative analysis of official documents and literature, in addition to the use of quantitative indicators.
Full-text available
The recent explosive publications of big data studies have well documented the rise of big data and its ongoing prevalence. Different types of "big data" have emerged and have greatly enriched spatial information sciences and related fields in terms of breadth and granularity. Studies that were difficult to conduct in the past time due to data availability can now be carried out. However, big data brings lots of "big errors" in data quality and data usage, which cannot be used as a substitute for sound research design and solid theories. We indicated and summarized the problems faced by current big data studies with regard to data collection, processing and analysis: inauthentic data collection, information incompleteness and noise of big data, unrepresentativeness, consistency and reliability, and ethical issues. Cases of empirical studies are provided as evidences for each problem. We propose that big data research should closely follow good scientific practice to provide reliable and scientific "stories", as well as explore and develop techniques and methods to mitigate or rectify those 'big-errors' brought by big data. © 2015 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS).
Full-text available
In this Exa Byte scale era, the data increases at an exponential rate. This is in turn generating a massive amount of metadata in the file system. Hadoop is the most widely used framework to deal with Big Data. But due to this growth of huge amount of metadata, the efficiency of Hadoop is questioned numerous time by many researchers. Therefore, it is essential to create an efficient and scalable metadata management for Hadoop. Hash-based mapping and subtree partitioning are suitable in distributed metadata management schemes. Subtree partitioning does not uniformly distribute workload among the metadata servers, and metadata need to be migrated to keep the load roughly balanced. Hash-based mapping suffers from a constraint on the locality of metadata, though it uniformly distributes the load among NameNodes, which is the metadata server of Hadoop. In this paper, we present a circular metadata management mechanism named Dynamic Circular Metadata Splitting (DCMS). DCMS preserves metadata locality using consistent hashing as well as locality-preserving hashing, keeps replicated metadata for excellent reliability, as well as dynamically distributes metadata among the NameNodes to keep load balancing. NameNode is a centralized heart of the Hadoop which keeps the directory tree of all files and thus failure of which causes the Single Point of Failure (SPOF). DCMS removes Hadoop's SPOF as well as provides an efficient and scalable metadata management. The new framework is named 'Dr.Hadoop' after the name of the authors.
Full-text available
In recent years, the rapid development of Internet, Internet of Things, and Cloud Computing have led to the explosive growth of data in almost every industry and business area. Big data has rapidly developed into a hot topic that attracts extensive attention from academia, industry, and governments around the world. In this position paper, we first briefly introduce the concept of big data, including its definition, features, and value. We then identify from different perspectives the significance and opportunities that big data brings to us. Next, we present representative big data initiatives all over the world. We describe the grand challenges (namely, data complexity, computational complexity, and system complexity), as well as possible solutions to address these challenges. Finally, we conclude the paper by presenting several suggestions on carrying out big data projects.
Conference Paper
This paper proposes a comprehensive critical survey on the issues of warehousing and protecting big data, which are recognized as critical challenges of emerging big data research. Indeed, both are critical aspects to be considered in order to build truly, high-performance and highly-flexible big data management systems. We report on state-of-the-art approaches, methodologies and trends, and finally conclude by providing open problems and challenging research directions to be considered by future efforts.
RAMCloud is a storage system that provides low-latency access to large-scale datasets. To achieve low latency, RAMCloud stores all data in DRAM at all times. To support large capacities (1PB or more), it aggregates the memories of thousands of servers into a single coherent key-value store. RAMCloud ensures the durability of DRAM-based data by keeping backup copies on secondary storage. It uses a uniform log-structured mechanism to manage both DRAM and secondary storage, which results in high performance and efficient memory usage. RAMCloud uses a polling-based approach to communication, bypassing the kernel to communicate directly with NICs; with this approach, client applications can read small objects from any RAMCloud storage server in less than 5μs, durable writes of small objects take about 13.5μs. RAMCloud does not keep multiple copies of data online; instead, it provides high availability by recovering from crashes very quickly (1 to 2 seconds). RAMCloud's crash recovery mechanism harnesses the resources of the entire cluster working concurrently so that recovery performance scales with cluster size.
This survey presents the concept of Big Data. Firstly, a definition and the features of Big Data are given. Secondly, the different steps for Big Data data processing and the main problems encountered in big data management are described. Next, a general overview of an architecture for handling it is depicted. Then, the problem of merging Big Data architecture in an already existing information system is discussed. Finally this survey tackles semantics (reasoning, coreference resolution, entity linking, information extraction, consolidation, paraphrase resolution, ontology alignment) in the Big Data context.
Efficient and scalable distributed metadata management is critically important to overall system performance in large-scale distributed file systems, especially in the EB-scale era. Hash-based mapping and subtree partitioning are state-of-the-art distributed metadata management schemes. Hash-based mapping evenly distributes workload among metadata servers, but it eliminates all hierarchical locality of metadata. Subtree partitioning does not uniformly distribute workload among metadata servers, and metadata needs to be migrated to keep the load balanced roughly. Distributed metadata management is relatively difficult since it has to guarantee metadata consistency. Meanwhile, scaling metadata performance is more complicated than scaling raw I/O performance. The complexity further rises with distributed metadata. It results in a primary goal that is to improve metadata management scalability while paying attention to metadata consistency. In this paper, we present a ring-based metadata management mechanism named Dynamic Ring Online Partitioning (DROP). It can preserve metadata locality using locality-preserving hashing, keep metadata consistency, as well as dynamically distribute metadata among metadata server cluster to keep load balancing. By conducting performance evaluation through extensive trace-driven simulations and a prototype implementation, experimental results demonstrate the efficiency and scalability of DROP.