ArticlePDF Available

A formal definition of Big Data based on its essential features


Abstract and Figures

Purpose – The purpose of this paper is to identify and describe the most prominent research areas connected with “Big Data” and propose a thorough definition of the term. Design/methodology/approach – The authors have analysed a conspicuous corpus of industry and academia articles linked with Big Data to find commonalities among the topics they treated. The authors have also compiled a survey of existing definitions with a view of generating a more solid one that encompasses most of the work happening in the field. Findings – The main themes of Big Data are: information, technology, methods and impact. The authors propose a new definition for the term that reads as follows: “Big Data is the Information asset characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value.” Practical implications – The formal definition that is proposed can enable a more coherent development of the concept of Big Data, as it solely relies on the essential strands of current state-of-the-art and is coherent with the most popular definitions currently used. Originality/value – This is among the first structured attempts of building a convincing definition of Big Data. It also contains an original exploration of the topic in connection with library management.
Content may be subject to copyright.
De Mauro, Greco, Grimaldi (2016), A Formal Definition of Big Data Based on its Essential Features
Preproof version Published on Library Review, Vol. 65 Iss: 3, pp.122 135, DOI: 10.1108/LR-06-2015-0061
A Formal Definition of Big Data
Based on its Essential Features
Andrea De Mauro1, a), Marco Greco2, b) and Michele Grimaldi2, c)
1Department of Enterprise Engineering, University of Rome Tor Vergata, Via del Politecnico 1,
00133 Roma, Italy
2Department of Civil and Mechanical Engineering, University of Cassino and Southern Lazio,
Via Di Biasio 43, 03043 Cassino (FR), Italy
a) Corresponding author:
Structured Abstract
Purpose This article identifies and describes the most prominent research areas connected with ‘Big
Data’ and proposes a thorough definition of the term.
Design/Methodology/Approach We have analyzed a conspicuous corpus of industry and academia
articles linked with Big Data in order to find commonalities among the topics they treated. We have
also compiled a survey of existing definitions with a view of generating a more solid one that
encompasses most of the work happening in the field.
Findings We’ve found that the main themes of Big Data are: Information, Technology, Methods and
Impact. We propose a new definition for the term that reads as follows: “Big Data is the Information
asset characterized by such a High Volume, Velocity and Variety to require specific Technology and
Analytical Methods for its transformation into Value.”
Practical implications The formal definition we propose can enable a more coherent development of
the concept of Big Data, as it solely relies on the essential strands of current state-of-art and is
coherent with the most popular definitions currently used.
Originality/value This is among the first structured attempts of building a convincing definition of
Big Data. It also contains an original exploration of the topic in connection with library management.
Keywords: Big Data; Analytics; Information Management; Data Processing; Business Intelligence.
Article Classification: Literature Review
At the time of writing, the term ‘Big Data’ is nearly omnipresent within articles and reports issued by
Information Technology practitioners and researchers. The pervasive nature of digital technologies
and the broad range of data-reliant applications have made this expression widespread also across
other disciplines including sociology, medicine, biology, economics, management and information
science. However, the degree of popularity of this phenomenon has not been accompanied by a
rational development of an accepted vocabulary. In fact, the term ‘Big Data’ itself has been used with
several and inconsistent acceptations and lacks a formal definition.
With this article we want to drive clarity upon the concept of Big Data so that the main
connected “themes” are identified and a formal and widely acceptable definition is proposed. As we do
so, we also explore the connections within this topic and Library Science. In order to pursue the first
objective we analyse the most significant occurrences of the term ‘Big Data” in both academic and
business literature, intercepting the most debated topics. Then we propose a thorough definition of
Big Data by discussing existing proposals, identifying their key features and synthesizing an
expression that formally represents the essence of the phenomenon. We believe that doing so enables
a more structured development of literature in this field.
De Mauro, Greco, Grimaldi (2016), A Formal Definition of Big Data Based on its Essential Features
Preproof version Published on Library Review, Vol. 65 Iss: 3, pp.122 135, DOI: 10.1108/LR-06-2015-0061
Theoretical background: the Four Themes of Big Data
The aim of this section is to describe the essential features of Big Data by means of a broad yet non-
comprehensive review of related literature. We have analyzed the abstracts of a large set of scientific
papers and assessed the nature of the most recurring words, looking for usage patterns and making
reasonable assumptions on their mutual interrelation. By doing so we recognized a subset of key
topics that can be used to depict the multifaceted nature of the phenomenon under examination. A
thorough, systematic literature review goes beyond the scope of this work and is left for future work.
We decided to use Elsevier's Scopus, a major reference database holding more than 50 million
literature entries from around 5,000 publishers. In the month of May 2014 we exported a list of 1,581
conference papers and journal articles that contained the full term ‘Big Data’ in the title or within the
list of keywords provided by the authors. We have cleaned up this initial list by removing those entries
for which the full abstract was not available, leaving us with a corpus of 1,437 records. By counting
the number of times each word was appearing in those abstracts we have identified the most
recurring ones. Figure 1 contains the "word cloud" of the most prevalent words in the corpus we
considered in our analysis: in this visualization, the font size of each term is directly proportional to
its relative presence within the input text. Afterwards, we have collectively reviewed the list of the
most frequent words included in the Big Data-related literature and made assumptions on their
mutual interconnection. We have then grouped together the words that were more clearly connected
with each other, from a conceptual viewpoint. By doing so we were able to recognize the existence of
four fundamental “Themes”, i.e. prevalent concepts that represent the essential components of the
subject. We have described the four themes with the following titles: 1. Information, 2. Technology,
3. Methods, 4. Impact. As a last step, we have verified that the majority of papers on Big Data
included in the list we retrieved deal with one or more of the themes we identified. In the following
paragraphs we will proceed with a description of each of the four themes and report a list of the most
representative related works.
Fig. 1 Static tag cloud visualization (or “word cloud”) of key words appearing in the abstracts of papers
related to Big Data, created through the online tool ManyEyes (Viegas et al., 2007), as appeared in (De
Mauro et al., 2015).
The fuel of Big Data: Information
The first reason behind the quick expansion of Big Data is the extensive degree at which data is
created, shared and utilized in recent times. Digitization, i.e. the transformation of analog signals into
digital ones, had reached massive popularity in the early 1990s. At that time the first commercial
Optical Character Recognition (OCR) tools were launched and paved the way for the kickoff of the first
"mass digitization" projects, i.e. the conversions of entire traditional libraries into machine-readable
files, (Coyle, 2006). A noteworthy example of mass digitization was the Google Books Library Project
(2015), begun in 2004, whose objective was to fully digitize more than 15 million printed books held in
several college libraries, including Harvard, Stanford and Oxford.
Once signals have been converted into a digital format they could be organized into more
structured datasets. This further step, that Mayer-Schönberger and Cukier called “datafication”
(2013), is capable of offering a unique, macro-level view-point for studying relevant trends and
patterns that would have been impossible if all the data stayed in an analog format. In case of the
De Mauro, Greco, Grimaldi (2016), A Formal Definition of Big Data Based on its Essential Features
Preproof version Published on Library Review, Vol. 65 Iss: 3, pp.122 135, DOI: 10.1108/LR-06-2015-0061
aforementioned Google mass digitization project, datafication started when the massive amount of
textual strings was converted into sequences of contiguous words (n-grams) for which it was possible
to track the level of occurrence over the centuries. In this way, researchers were able to find insights
on disparate fields, such as linguistics, etymology, sociology and historical epidemiology by utilizing
Google Books' datasets (Michel et al., 2011).
The Data-Information-Knowledge-Wisdom hierarchy offers an alternative view according to
which information appears as data that is structured in a way to be useful and relevant to a specific
purpose (Rowley, 2007). The identification of such purpose is a common element across all Big Data
applications we have reviewed. In this perspective, information becomes a knowledge asset that can
create value for firms (Cricelli & Grimaldi, 2008). Hence, we can conclude that information, not data,
is the fundamental fuel of the current Big Data phenomenon.
According to Prescott (2013) “library catalogues may be seen as representing an early
encounter with Big Data”. In fact, also library catalogues are characterized by a certain level of
“heterogeneity” due to human mistakes and to the development of different standards for cataloguing
over time. Big Data methods can be used to identify the various processes of cataloguing library
assets over time and to find new inconsistencies in the data. For instance, different layers of data can
be identified in the British Library catalogue because of its progressive reshaping: Big Data
techniques can enable something like an “archaeology” of data in library catalogues.
We notice a strong connection between Big Data repositories and Digital Libraries (DLs).
According to Candela et al. (2008) DLs are “organizations which might be virtual, that comprehensively
collect, manage and preserve for the long term rich digital content, and offer to its user communities
specialized functionality on that content, of measurable quality and according to codified policies”. As
noticed by Jansen (2013) the heterogeneity of content that can be found in a digital library, ranging
from digitized versions of printed books to born-digital content, requires advanced data management
A peculiar element of complexity comes from the fact that digital content can lie on different
levels of syntactic and semantic abstraction. Big Data techniques offer enough flexibility to cope with
such intrinsically heterogeneous information assets. When the magnitude of Digital Library in terms
of Volume, Velocity and Variety of content, user base or any other aspect requires “specialized
technologies or approaches” we enter the domain of Very Large DLs (VLDLs), (Candela et al., 2012). It
is interesting to notice how the definition of VLDLs is coherent with the one we propose for Big Data:
this suggests how Big Data technology and methods are strongly needed for a continued development
of library and information management as a discipline.
Another prominent reason for the growing availability of information is the proliferation of
personal devices connected to the Internet and equipped with digital sensors (such as cameras, audio
recorders and GPS locators). Such sensors make digitization possible while network connection lets
data be collected, transformed and, ultimately, organized as information. It was estimated that at
some point between 2008 and 2009 the quantity of connected devices surpassed the number of
human beings (Evans, 2011) and, Gartner forecasted that by 2020 there will be 26 billion devices on
earth, more than three per living person (2014). The scenario in which artificial objects, equipped with
unique identifiers interact with each other to achieve common goals, without any human interaction,
goes under the name of Internet of Things, IoT (Atzori et al., 2010; Estrin et al., 2002) and represents
a promising source of Information in the age of Big Data. An increasing amount of Data will also be
generated by the collaboration among firms by means of internet-based tools (Michelino et al., 2008).
An important feature of the data that gets produced and utilized nowadays is its expanding
variety in form. The traditional alphanumerical tables are being overtaken in presence by the growing
availability of less structured data sources, like video, pictures and text produced by humans,
(Russom, 2011). The multiplicity of data types and their co-existence is one of the major challenges
associated with the handling of Big Data today (Manyika et al., 2011).
A necessary prerequisite for using Big Data: Technology
Another theme we found in Big Data literature relates to the specific technological issues that come
hand in hand with the utilization of extensive amounts of data. Dealing with Big Data at the right
speed implies computational and storage requirements that an average Information Technology
system might not be able to grant.
Hadoop is an open source framework that was specifically designed to deal with Big Data in a
satisfactory manner. The primary components of Hadoop are HDFS and MapReduce: both were
originally developed by Google (Ghemawat et al., 2003), before becoming an Apache standalone
project. HDFS (Hadoop Distributed File System) enables multiple, remotely located machines to co-
operate seamlessly towards a common computational goal (Shvachko et al., 2010). MapReduce is a
programming model intended to efficiently split operations across separate logical units (Dean &
Ghemawat, 2008).
Hadoop and Map Reduce have proven to be very effective when exploring and managing
metadata of large libraries: Powell et al. (2012) propose an implementation for extraction and
matching of author names based on Big Data technology. Another example of how of a Big Data
implementation designed to serve very large digital libraries is PuntStore (Wang et al., 2013): this
De Mauro, Greco, Grimaldi (2016), A Formal Definition of Big Data Based on its Essential Features
Preproof version Published on Library Review, Vol. 65 Iss: 3, pp.122 135, DOI: 10.1108/LR-06-2015-0061
plug-in system supports the integration of several storage and index engines in order to maximize
efficiency in making use of large collections of digital items.
Beside the complexity arising from the processing of Big Data, another fundamental
technological issue, due to the dispersed nature of machines, is its transmission. Communication
networks need to sustain bigger and faster data transfers and systems require specific benchmarking
techniques for evaluating their overall performance (Xiong et al., 2013).
An additional technological requirement linked to the usage of Big Data is the capacity to
store a greater extent of data on smaller devices. Moore's law states that the number of transistors
that can be placed on a silicon chip tends to double every 18 to 24 months and this implies that
memory storage capacity grows exponentially (2006). However, data also grows exponentially (Hilbert
& López, 2011), and the issue of storing extensive amounts of data persists as a critical technological
challenge in the age of Big Data.
Techniques for processing Big Data: Methods
Huge amounts of data need to be processed by means of more complex methods than the usual
statistical procedures. Unfortunately, a specific competence about the potentiality and limitations of
these techniques is not readily accessible in the job market at present.
Big Data Analytical Methods have been singled out by Manyika et al. (2011) and Chen (2012).
They have obtained a list of the most usual procedures that includes: Cluster analysis, Genetic
algorithms, Natural Language Processing, Machine learning, Neural Networks, Predictive modelling,
Regression Models, Social Network Analysis, Sentiment Analysis, Signal Processing and Data
According to Chen et al. (2012)., given the current, structured data-centric business
environment, companies should invest in interdisciplinary Business Intelligence and Analytics
education, so as to cover "critical analytical and IT skills, business and domain knowledge, and
communication skills”. At the same time, a cultural change should accompany this process, involving
the company's entire population, urging its members to efficiently manage data properly and
incorporate them into decision making processes”, (Buhl et al., 2013).
New professional skills could derive from such innovative education, that would help in
training experts to assimilate various disciplines (Mayer-Schönberger & Cukier, 2013). These data
scientists can be seen as hybrid specialists, able to manage both technologic knowledge and academic
research (Davenport & Patil, 2012). There is a gap in the education for this professional profile
(Manyika et al., 2011) and new productive learning subjects and methods are required for teaching
future data specialists.
In addition, it must be noted that Big Data development has turned the method of decision
making from a static process into a dynamic one; indeed, the analysis of the relationships among the
many events derived from information data has replaced the pursue of traditional, logical connections.
It is reasonable to presume that the consequence of the application of Big Data to companies,
research and university institutions could modify, both the decision making rules (McAfee et al.,
2012) and the scientific method (Anderson, 2007).
Learning about the strengths and weaknesses of Big Data Methods’ application represents an
undeniable resource for public and private institutions when carrying out strategic decision making
processes(Boyd & Crawford, 2012). Clearly, the insight into the realm of future possibilities advanced
by Big Data applications should be carefully verified, by ruling their high degree of complexity with
the utmost cognizance.
Big Data touches our lives pervasively: Impact
Utilization and management of Big Data are impacting many fields of activities of our society. Big Data
applications have shown a consistent level of adaptability to the different requirements arising from
disparate scientific domains and industrial organizations. Problems originating in very distant areas
were sometimes solved by making use of the same techniques and data types. An example of this is
the application of correlation analysis to Google Search logs that produced insights applied to a range
of domains, from epidemiology to economics (Ginsberg et al., 2009; Askitas and Zimmermann, 2009;
Guzman, 2011).
Big Data is also a source of concern as its rapid growth preceded the establishment of
exhaustive guidelines to protect private information(Boyd & Crawford, 2012). For example, it is
necessary to prevent any possible identification of personal data by means of anonymization
algorithms aimed to defend individual privacy.
Furthermore, the accessibility of information should be properly and impartially regulated in
order to avoid anticompetitive business practices (Manovich, 2012) that could reinforce dominant
positions in the market. The creation of a new digital divide among companies, driven by their
different level of access to data is a potential impediment to the progress of innovation (Boyd &
Crawford, 2012).
Big Data also impacts companies in depth, as they are forced to reconsider their organization
and all of their business processes in light of the availability of new information that could be
De Mauro, Greco, Grimaldi (2016), A Formal Definition of Big Data Based on its Essential Features
Preproof version Published on Library Review, Vol. 65 Iss: 3, pp.122 135, DOI: 10.1108/LR-06-2015-0061
transformed into a competitive advantage in a data-driven market (Pearson and Wegener, 2013;
McAfee et al., 2012 ).
Fig. 2 Big Data themes and related topics in existing literature.
A thorough definition for Big Data
Emerging disciplines often experience a lack of agreement regarding the definition of core concepts.
Indeed, the level of consensus shown by a scientific community when defining a concept is a proxy of
the development of a discipline (Ronda-Pupo & Guerras-Martin, 2012). The quick and chaotic
evolution of Big Data literature has impeded the development of a universally and formally accepted
definition for Big Data. In fact, although several authors proposed their own definitions for Big Data
(Beyer & Laney, 2012; Dijcks, 2013; Dumbill, 2013; Mayer-Schönberger & Cukier, 2013; Schroeck et
al., 2012; Shneiderman, 2008; Suthaharan, 2014; Ward & Barker, 2013), none of such proposals has
prevented subsequent works to modify or ignore previous definitions and suggest new ones (Ward &
Barker, 2013). Such lack of agreement and homogeneity, although justified by the relative youngness
of Big Data as a concept, limits the proper development of the discipline.
The next sub-section reviews a non-exhaustive list of existing definitions of Big Data, and ties
them to the four themes of research previously identified in the theoretical background. The in-depth
analysis of such definitions and of their shared characteristics is meant to recognize the critical
factors that should be considered in a consensual definition of Big Data. Therefore, such definition
should be less exposed to critiques than existing ones, being based on the most crucial aspects that
have been associated so far to Big Data.
Existing definition: a survey
The absence of a consensual definition of Big Data often brought scholars to adopt “implicit”
definitions through anecdotes, success stories, characteristics, technological features, trends or its
impact on society, firms and business processes. The existing definitions for Big Data provide very
different perspectives, denoting the chaotic state of the art. Big Data is considered in turn as a term
describing a social phenomenon, information assets, data sets, storage technologies, analytical
techniques, processes and infrastructures.
We observed that the definitions provided so far might be classified according to four groups,
depending on where the focus has been put in describing the phenomenon: I. Attributes of Data,
II. Technology needs, III. Overcoming of Thresholds, IV Social Impact. Table I shows the reviewed
definitions, and describes their focus according to the four themes discussed in the theoretical
Information Technology
Big Data
Internet of
De Mauro, Greco, Grimaldi (2016), A Formal Definition of Big Data Based on its Essential Features
Preproof version Published on Library Review, Vol. 65 Iss: 3, pp.122 135, DOI: 10.1108/LR-06-2015-0061
I group: Attributes of Data
The Big Data definitions pertaining to the first group enlist its characteristics. One of the most
popular definitions for Big Data falls within this group (Laney, 2001). Noticeably, Laney’s “3 Vs”,
underpinning the expected 3-dimensional increase in data Volume, Velocity and Variety, did not
mention Big Data explicitly. His contribution was associated with Big Data several years later (Beyer
& Laney 2012; Zikopoulos and Eaton, 2011; Zaslavsky et al. 2013). Several authors extended the “3
Vs” model, adding other features of Big Data, such as Veracity (Schroeck et al., 2012), Value (Dijcks,
2013), Complexity and Unstructuredness (Intel 2012; Suthaharan 2013).
II group: Technological Needs
The definitions falling within the second group emphasize the technological needs behind the
processing of large amounts of data. Indeed, according to Microsoft, Big Data describes a process in
which serious computing power” is applied to “seriously massive and often highly complex sets of
information (Microsoft Research, 2013). Similarly, the National Institute of Standards and Technology
emphasizes the need for a scalable architecture for efficient storage, manipulation, and analysis” when
defining Big Data (NBD-PWG NIST, 2014).
III group: Thresholds
Some definitions consider Big Data in terms of crossing thresholds. Dumbill (2013) proposes that data
is Big when the processing capacity of conventional database systems is exceeded and alternative
approaches are needed to process it. Fisher (2012) suggests that the concept of “big” in terms of size
is linked with Moore’s Law, and consequently with the capacity of commercial storage solutions.
IV group: Social Impact
Finally, several definitions highlight the effect of Big Data advancement on society. Boyd and Crawford
(2012, p. 663) define big data as a cultural, technological, and scholarly phenomenonbased on three
elements: Technology (i.e. the maximization of computational power and algorithmic accuracy),
Analysis (i.e. the identification of patterns on large data sets) and Mythology (i.e. the belief that large
data sets offer a superior form of intelligence, carrying an aura of truth, accuracy and objectivity).
Mayer-Schönberger and Cukier (2013) describe Big Data in terms of three main shifts in the way of
analysing information that improve our understanding of society and our organization of it. Such
shifts include: More data (i.e. all available data are used instead of a sample), More messy (i.e. even
incomplete or inaccurate data may be used instead of limiting to complete ones), Correlation (i.e.
correlation becomes more important, overtaking causality as privileged mean to make decisions).
A Proposed Definition
The conjoint analysis of the existing definitions for Big Data and of the main research themes in
literature allows us to conclude that the nucleus of the concept of Big Data includes the following
‘Volume’, ‘Velocity’ and ‘Variety’, to describe the characteristics of Information;
‘Technology’ and ‘Analytical Methods’, to describe the requirements needed to make proper
use of such Information;
‘Value’, to describe the transformation of information into insights that may create economic
value for companies and society.
On the whole, we argue that the definition for Big Data should refer to its nature of
‘Information asset’, a clearly identifiable entity not dependent on the field of application. Therefore, the
following consensual definition is proposed:
“Big Data is the Information asset characterized by such a High Volume, Velocity and Variety to
require specific Technology and Analytical Methods for its transformation into Value.”
Such a definition is compatible with the usage of terms such as “Big Data Technology” and
“Big Data Methods” when referring directly to the specific technology and methods cited in the main
The concept of Big Data is as popular as its meaning is nebulous. With this article we have clarified
the essential characteristics of Big Data, namely: Information, Technology, Methods and Impact. For
each of them we have offered an exploration of the main research areas, bringing meaningful
examples across multiple domains.
We have provided special focus on the impact of Big Data on information and library science:
we noticed how the collection and organization of information resources have changed approach and
methodology with the advent of Big Data, and how this is affecting the way libraries, especially digital
De Mauro, Greco, Grimaldi (2016), A Formal Definition of Big Data Based on its Essential Features
Preproof version Published on Library Review, Vol. 65 Iss: 3, pp.122 135, DOI: 10.1108/LR-06-2015-0061
ones, are being managed. As noticed already in other sectors, researchers and professionals operating
in the domain of libraries will have to upgrade their own digital and analytical skills in order to stay
relevant with the upcoming data-driven innovations in the field.
We have also surveyed the most popular existing definitions of Big Data and suggested a new
one that is congruent with its most prominent features: “Big Data is the Information asset
characterized by such a High Volume, Velocity and Variety to require specific Technology and
Analytical Methods for its transformation into Value”. A consistent utilization of this definition will
contribute to the creation of an accepted terminology that will support a coherent scientific
development of the topic.
Finally, some limitations regarding the present study should be considered: first, the amount
of papers included in the initial list is limited in comparison with the extensive size of related
literature currently available to researchers. Second, the approach we have adopted for identifying
themes might have missed emerging or less evident topics as it is mainly based on human judgement.
Third, our survey of definitions includes 15 popular entries while many more attempts have been
presented to date and might have included other important elements to consider.
Future work includes: a more granular literature review based on quantitative methods that is
capable to identify sub-topics for each of the themes we discussed; a summary of current best
practises and trends by industry across the four themes of Big Data identified in this article.
Anderson, C. (2007). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.
Wired, 16(7).
Askitas, N., & Zimmermann, K. F. (2009). Google Econometrics and Unemployment Forecasting.
Applied Economics Quarterly, 55(2), 107120.
Atzori, L., Iera, A., & Morabito, G. (2010). The Internet of Things: A survey. Computer Networks,
54(15), 27872805.
Beyer, M. A., & Laney, D. (2012). The Importance of “Big Data”: A Definition. Stamford, CT.
Boyd, D., & Crawford, K. (2012). Critical questions for big data: Provocations for a cultural,
technological, and scholarly phenomenon. Information, Communication & Society, 15(5), 662
Buhl, H. U., Röglinger, M., Moser, F., & Heidemann, J. (2013). Big Data. Business & Information
Systems Engineering, 5(2), 6569.
Candela, L., Castelli, D., Ferro, N., Ioannidis, Y., Koutrika, G., Meghini, C., … Schuldt, H. (2007). The
DELOS Digital Library Reference Model. Foundations for Digital Libraries.
Candela, L., Manghi, P., & Ioannidis, Y. (2012). Fourth workshop on very large digital libraries: on the
marriage between very large digital libraries and very large data archives. ACM SIGMOD Record,
40(4), 6164.
Chen, H., Chiang, R., & Storey, V. (2012). Business Intelligence and Analytics: From Big Data to Big
Impact. MIS Quarterly, 36(4), 11651188.
Coyle, K. (2006). Mass Digitization of Books. Journal of Academic Librarianship, 32(6), 641645.
Cricelli, L., & Grimaldi, M. (2008). A dynamic view of knowledge and information: a stock and flow
based methodology. International Journal of Management and Decision Making, 9(6), 686698.
Davenport, T. H., & Patil, D. J. (2012). Data Scientist: The Sexiest Job Of the 21st Century. Harvard
Business Review, 90(10), 7076.
De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a
review of key research topics. In International Conference on Integrated Information (IC-ININFO
2014) AIP Conf. Proc. 1644 (pp. 97104). Madrid, Spain: AIP Publishing LLC.
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters.
Communications of the ACM, 51(1), 113.
De Mauro, Greco, Grimaldi (2016), A Formal Definition of Big Data Based on its Essential Features
Preproof version Published on Library Review, Vol. 65 Iss: 3, pp.122 135, DOI: 10.1108/LR-06-2015-0061
Dijcks, J. (2013). Oracle: Big data for the enterprise. Oracle White Paper. Redwood Shores, CA: Oracle
Dumbill, E. (2013). Making Sense of Big Data. Big Data, 1(1), 12.
Estrin, D., Culler, D., Pister, K., & Sukhatme, G. (2002). Connecting the physical world with pervasive
networks. IEEE Pervasive Computing, 1(1), 5969.
Evans, D. (2011). The Internet of Things - How the Next Evolution of the Internet is Changing
Everything. San Jose, CA: Cisco Systems.
Fisher, D., DeLine, R., Czerwinski, M., & Drucker, S. (2012). Interactions with Big Data Analytics.
Gartner. (2014). Gartner Says the Internet of Things Will Transform the Data Center.
Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. ACM SIGOPS Operating
Systems Review, 37(5), 2943.
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009).
Detecting influenza epidemics using search engine query data. Nature, 457(7232), 10121014.
Google. (2015). Google Books Library Project An enhanced card catalog of the world’s books.
Retrieved July 20, 2015, from
Guzman, G. (2011). Internet Search Behavior as an Economic Forecasting Tool: The Case of Inflation
Expectations. Journal of Economic and Social Measurement, 36(3), 119167.
Hilbert, M., & López, P. (2011). The world’s technological capacity to store, communicate, and
compute information. Science, 332(6025), 6065.
Intel IT Center. (2012). Big Data Analytics. Intel’s IT Manager Survey on How Organizations Are Using
Big Data. Intel IT Center. Santa Clara, CA: Intel Corporation.
Jansen, W., Barbera, R., Drescher, M., Fresa, A., Hemmje, M., Ioannidis, Y., … Stanchev, P. (2013). e-
Infrastructures for digital libraries... the future. Lecture Notes in Computer Science, 8092, 480
Laney, D. (2001). 3-D Data Management:Controlling Data Volume, Velocity and Variety. META Group
Research Note, (February), 14.
Manovich, L. (2012). Trending: The Promises and the Challenges of Big Social Data. In M. K. Gold
(Ed.), Debates in the Digital Humanities (pp. 460475). Minneapolis, MN: University of Minnesota
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big
data: The next frontier for innovation, competition, and productivity. New York, NY: McKinsey
Global Institute.
Mayer-Schönberger, V., & Cukier, K. (2013). Big Data: A Revolution That Will Transform How We Live,
Work and Think. London: John Murray.
McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D., & Barton, D. (2012). Big data: the
management revolution. Harvard Business Review, 90(10), 6167.
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., … Aiden, E. L. (2011).
Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176182.
Michelino, F., Bianco, F., & Caputo, M. (2008). Internet and supply chain management: adoption
modalities for Italian firms. Management Research News, 31(5), 359374.
Microsoft Research. (2013). The Big Bang: How the Big Data Explosion Is Changing the World.
De Mauro, Greco, Grimaldi (2016), A Formal Definition of Big Data Based on its Essential Features
Preproof version Published on Library Review, Vol. 65 Iss: 3, pp.122 135, DOI: 10.1108/LR-06-2015-0061
Moore, G. E. (2006). Cramming more components onto integrated circuits, Reprinted from Electronics,
volume 38, number 8, April 19, 1965, pp.114 ff. IEEE Solid-State Circuits Newsletter, 11(5), 33
NBD-PWG NIST. (2014). NIST Big Data Public Working Group. Draft of Big Data Definition.
Pearson, T., & Wegener, R. (2013). Big Data: The organizational challenge, Bain & Company.
Powell, J. (2012). “At scale” author name matching with Hadoop/MapReduce. Library Hi Tech News,
29(4), 612.
Prescott, A. (2013). Bibliographic records as humanities big data. In Proceedings - 2013 IEEE
International Conference on Big Data, Big Data 2013 (pp. 5558). Ieee.
Ronda-Pupo, G. A., & Guerras-Martin, L. A. (2012). Dynamics of the evolution of the strategy concept
1962-2008: A co-word analysis. Strategic Management Journal, 33(2), 162188.
Rowley, J. (2007). The wisdom hierarchy: representations of the DIKW hierarchy. Journal of
Information Science, 33(2), 163180.
Russom, P. (2011). Big data analytics. TDWI Best Practices Report, Fourth Quarter, pp 1-35.
Schroeck, M., Shockley, R., Smart, J., Romero-Morales, D., & Tufano, P. (2012). Analytics: The real-
world use of big data. New York, NY: IBM Institute for Business Value, Said Business School.
Shneiderman, B. (2008). Extreme visualization: squeezing a billion records into a million pixels. In
Proceedings of the 2008 ACM SIGMOD international conference on Management of data (pp. 312).
Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop distributed file system. In
IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST2010 (pp. 110). IEEE.
Suthaharan, S. (2014). Big data classification: Problems and challenges in network intrusion
prediction with machine learning. ACM SIGMETRICS Performance Evaluation Review.
Viegas, F.B., Wattenberg, M., Van Ham, F., Kriss, J. and McKeon, M. (2007), “Many Eyes: A site for
visualization at internet scale”, IEEE Transactions on Visualization and Computer Graphics, Vol. 13 No.
6, pp. 11211128.
Wang, J., Zhang, Y., Gao, Y., & Xing, C. (2013). A New Plug-in System Supporting Very Large Digital
Library. In S. R. Urs, J.-C. Na, & G. Buchanan (Eds.), 15th International Conference on Asia-
Pacific Digital Libraries, ICADL 2013, Bangalore, India, December 9-11, 2013 (Lecture No, Vol.
8279, pp. 4552). Bangalore: Springer International Publishing.
Ward, J. S., & Barker, A. (2013). Undefined By Data: A Survey of Big Data Definitions.
arXiv:1309.5821 [cs.DB].
Xiong, W., Yu, Z., Bei, Z., Zhao, J., Zhang, F., Zou, Y., … Xu, C. (2013). A characterization of big data
benchmarks. In Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013
(pp. 118125).
Zaslavsky, A., Perera, C., & Georgakopoulos, D. (2013). Sensing as a service and big data.
arXiv:1301.0159 [cs.CY].
Zikopoulos, P., & Eaton, C. (2011). Understanding Big Data: Analytics for Enterprise Class Hadoop and
Streaming Data. McGraw-Hill Osborne Media.
De Mauro, Greco, Grimaldi (2016), A Formal Definition of Big Data Based on its Essential Features
Preproof version Published on Library Review, Vol. 65 Iss: 3, pp.122 135, DOI: 10.1108/LR-06-2015-0061
About the authors
Andrea De Mauro has over nine years of international experience in Data
Analytics and Project Management, working for a leading FMCG multinational
company. He holds a Master’s Degree in Electrical and Computer Engineering
from the University of Illinois at Chicago, a Master’s Degree in ICT with honours
from the Polytechnic of Turin and a diploma in Innovation from Alta Scuola
Politecnica at Milan. He is currently pursuing his PhD in Business and Economic
Engineering at “Tor Vergata” University of Rome, researching on Analytics, Data
Visualisation and Decision Making. Andrea is the corresponding author and can
be contacted at
Marco Greco is Assistant Professor at the Department of Civil and Mechanical
Engineering at the University of Cassino and Southern Lazio. Graduated with
honors in Business Engineering at the University of Rome “Tor Vergata”, he
gained the Ph.D. degree in Business and Economic Engineering at the “Tor
Vergata” University of Rome. His research interests cover three disciplines: open
innovation, strategic management and negotiation models.
Michele Grimaldi is an Assistant Professor at the Department of Civil and
Mechanical Engineering at the University of Cassino and Southern Lazio.
Graduated with honours in Business Engineering at the University of Rome “Tor
Vergata”, he gained the PhD degree in Business and Economic Engineering at the
“Tor Vergata” University of Rome. He is a Tenured Professor of “Knowledge
Management” in the second-level MS in Business Engineering organized by the
“Tor Vergata” University of Rome and of “Intangible Assets” in the Executive
Master Business Administration organized by the “Tor Vergata” University of
Rome. He has published more than 80 papers in international journals and
conference proceedings.
De Mauro, Greco, Grimaldi (2016), A Formal Definition of Big Data Based on its Essential Features
Preproof version Published on Library Review, Vol. 65 Iss: 3, pp.122 135, DOI: 10.1108/LR-06-2015-0061
TABLE I. Survey of existing definitions of Big Data. The first column indicates the conceptual focus of the
definition, namely: I. Attributes of Data, II. Technological Needs, III. Exceeding of Thresholds, IV. Social Impact.
The last four columns flag whether the definition alludes to any of the four Big Data themes identified in this
study, that are: I - Information, T - Technology, M - Methods, P Impact.
The four characteristics defining big data are
Volume, Velocity, Variety and Value.
High volume, velocity and variety information
assets that demand cost-effective, innovative
forms of information processing for enhanced
insight and decision making.
Complex, unstructured, or large amounts of
Big data is a combination of Volume, Variety,
Velocity and Veracity that creates an
opportunity for organizations to gain
competitive advantage in today’s digitized
Can be defined using three data characteristics:
Cardinality, Continuity and Complexity.
The storage and analysis of large and or
complex data sets using a series of techniques
including, but not limited to: NoSQL,
MapReduce and machine learning.
The process of applying serious computing
power, the latest in machine learning and
artificial intelligence, to seriously massive and
often highly complex sets of information.
Extensive datasets, primarily in the
characteristics of volume, velocity and/or
variety, that require a scalable architecture for
efficient storage, manipulation, and analysis.
A dataset that is too big to fit on a screen.
Datasets whose size is beyond the ability of
typical database software tools to capture, store,
manage, and analyze.
Data that cannot be handled and processed in a
straightforward manner.
The data sets and analytical techniques in
applications that are so large and complex that
they require advanced and unique data storage,
management, analysis, and visualization
Data that exceeds the processing capacity of
conventional database systems.
A cultural, technological, and scholarly
phenomenon that rests on the interplay of
Technology, Analysis and Mythology.
Phenomenon that brings three key shifts in the
way we analyze information that transform how
we understand and organize society: 1. More
data, 2. Messier (incomplete) data, 3.
Correlation overtakes causality.
... големи данни (или големи обеми данни -big data). Не е постигнат консенсус относно точната дефиниция на това понятие, но се приема, че те са информационен ресурс, при който данните се характеризират с висока степен на обем, скорост и разнообразие, чието обработване изисква специфични технологии и аналитични методи, за да донесе полза (De Mauro, Greco & Grimaldi, 2016). Според разбирането на Gartner задачата им е да осигурят ефикасност и ефективност на протичащите процеси чрез повишаване на познанието и вземане на правилни управленски решения (Статева, 2014). ...
Full-text available
JEL: M41, M15, O32 1 Доцент, доктор, катедра "Счетоводство и анализ", Финансово-счетоводен факултет, УНСС 2 Хоноруван преподавател, доктор, ВУЗФ, ORCID 0000-0001-7255-8183
... The theory of the innovation commons draws upon Hayek, Williamson, and Ostrom to present the innovation problem as a combined knowledge problem(Potts, 2018).10 The concept of data-driven innovation, in this and subsequent publication of the OECD, is tied to the availability of "Big Data," that is "the information asset characterized by such a high volume, velocity and variety to require specific technology and analytical methods for its transformation into value"(De Mauro et al., 2016). ...
Full-text available
Data-driven innovation entails an overall positive effect on society. Innovation is a central policy goal in the EU, and the regulation of the data economy tends to elect innovation as a primary objective. However, considerably less attention is devoted to the identification of the qualitative characteristics of the desired innovation. From a technological point of view, (data-driven) innovation can be cumulative, combinatorial, or generative. In all three instances, innovation commons are crucial. The design of successful data commons demands the analysis of the relational dimension of the data economy, which can be conducted through the framework of business ecosystems. Incentives for data-based competition or cooperation in ecosystems are inspired by a metaphorical cognition of the economic function of data: whether data is considered a resource or an infrastructure ultimately affects the design of innovation commons. To conclude, the paper draws the policy implications of this framework. Policymakers and regulators may select one narrative over another, thus molding the features of future innovation.
... At its core, IoT has the potential to revolutionize our lives, reshaping the way we live, work, and interact with the world around us. From smart home applications like energy-efficient lighting systems, to machine-to-machine communication in industries, connected wearables promoting health, and smart city applications focused on traffic management and environmental monitoring, IoT permeates various aspects of daily life and creates a matrix of interconnectivity that promises convenience, efficiency, and sustainability [61][62][63]. ...
Full-text available
This paper, titled "From AI to IoT: The Impacts on Metaverse Marketing & Consumer Engagement", delves into the converging realms of artificial intelligence (AI), Internet of Things (IoT), the metaverse, and digital marketing. With the metaverse gaining increasing commercial relevance, we explore its multidimensional impacts on contemporary marketing strategies and consumer interaction. Leveraging AI and IoT, businesses assimilate real-time consumer data to create tailored marketing experiences within the metaverse. We inspect how AI's high computational capability, coupled with IoT's pervasive ability to collate information from multiple sources, results in sophisticated identification of consumer patterns and preferences. While AI's intelligence and adaptability provide precise customer segmentation, predictive analytics, and personalized communication, IoT assists in curating immersive, interactive, and interconnected experiences-critically catering to the digital native consumer profiles within the metaverse. The paper also scrutinizes potential challenges to privacy policies and ethical considerations, given the comprehensive, continuous data collection activities. Drawing insights from case studies and industry reports, the paper suggests strategic recommendations for businesses venturing into the metaverse. Ultimately, we affirm that the harmonious integration of AI, IoT, and metaverse is poised to revolutionize digital marketing strategies and drastically enhance online consumer engagement, provided ethical guidelines and privacy protections are diligently enforced.
... The essence of big data technology to create value is to systematically link the technical dimension with the management dimension (Huda et al., 2017). Big data technology transformation of information assets to create value is a challenge for the company, which has an important impact on the creation of company value (De Mauro et al., 2016). Big data technology converts data into models and then converts these models into value. ...
Full-text available
The transformative role of big data technology in fostering scientific and technological innovation, leading to sustainable development and economic growth, has become increasingly crucial in modern business environments. This study utilizes text analysis of annual financial reports from Chinese A-share listed companies to assess the frequency of keywords related to big data application technology. Through panel data regression, the research investigates the significant impact of big data technology on scientific and technological innovation across diverse industries while controlling for relevant financial and corporate governance variables. The findings reveal a positive correlation between big data application technology and scientific and technological innovation, even after accounting for control factors. Moreover, private enterprises emerge as influential contributors to scientific and technological advancement. The study highlights the theoretical implications of integrating big data technology with the real economy to optimize resources effectively, and the policy implications call for targeted strategies to nurture innovation in established and growing enterprises. As future research prospects, this study lays the groundwork for exploring additional dimensions of big data technology’s impact on innovation and its implications for sustainable development in the ever-evolving business landscape.
... Big data has become a new healthcare ally making a wide range of contributions from resource allocation to diseases diagnosis 42 . In health, big data is often defined as electronic records that provide data volume (large databases), velocity (high speed of access), variety (data heterogeneity), and veracity (quality and reliability) [43][44][45] . ...
Full-text available
The COVID-19 virus caused a global pandemic leading to a swift policy response. While this response was designed to prevent the spread of the virus and support those with COVID-19, there is growing evidence regarding measurable impacts on non-COVID-19 patients. The paper uses a large dataset from administrative records of the Brazilian public health system (SUS) to estimate pandemic spillover effects in critically ill health care delivery, i.e. the additional mortality risk that COVID-19 ICU hospitalizations generate on non-COVID-19 patients receiving intensive care. The data contain the universe of ICU hospitalizations in SUS from February 26, 2020 to December 31, 2021. Spillover estimates are obtained from high-dimensional fixed effects regression models that control for a number of unobservable confounders. Our findings indicate that, on average, the pandemic increased the mortality risk of non-COVID-19 ICU patients by 1.296 percentage points, 95% CI 1.145–1.448. The spillover mortality risk is larger for non-COVID patients receiving intensive care due to diseases of the respiratory system, diseases of the skin and subcutaneous tissue, and infectious and parasitic diseases. As of July 2023, the WHO reports more than 6.9 million global deaths due to COVID-19 infection. However, our estimates of spillover effects suggest that the pandemic’s total death toll is much higher.
... Since the digital component of Quality 4.0 generates large amounts of data, which is defined by De Mauro et al. [65], as having high velocity and diversity, it is important in food quality 4.0 to understand how to use these tools effectively. Collecting, processing, and analyzing Big Data in the food sector requires smart sensors and reliable communication [66], leading to smart products, processes/technologies and systems/factories. ...
Full-text available
The aim of this work was to investigate the influence of pulsed electric field (PEF) and high-power ultrasound (HPU) combined in hurdle technology to preserve the bioactive compounds content (BACs) and antioxidant capacity in strawberry juices stored at 4 °C for 7 days. PEF was performed at 50 kV, 100 Hz during 1.5, 3, and 4.5 min, while HPU was performed at 25% amplitude and 50% pulse during 2.5, 5.0, and 7.5 min. Total phenols and hydroxycinnamic acids were the most stable BACs during hurdle treatment without influence of both treatments’ durations, while flavanols and condensed tannins showed significant stability dependence with respect to both treatments’ duration. Total phenols were also stable during storage, in contrast to the individual groups of examined BACs. A chemometric approach was used to optimize the parameters of the hurdle treatments in terms of the highest content of BACs and the antioxidant capacity of the treated juices. In general, shorter treatment times in the PEF/HPU hurdle concept favored better stability of BACs and antioxidant capacity. The hurdle technology investigated in this study has strong potential to be an excellent concept to optimize the combination of sustainable technologies in the preservation of functional foods.
This chapter describes machine learning workflows. It starts by introducing a typical five-step workflow made of (1) data acquisition, (2) pre-processing, (3) model training, (4) model validation, and (5) model deployment. Each step is described in detail and accompanied by Python examples.
Full-text available
A R T I C L E I N F O Keywords: Sustainable intensification Model accuracy Model precision Linear mixed models Machine learning A B S T R A C T Context: Collection and analysis of large volumes of on-farm production data are widely seen as key to understanding yield variability among farmers and improving resource-use efficiency. Objective: The aim of this study was to assess the performance of statistical and machine learning methods to explain and predict crop yield across thousands of farmers' fields in contrasting farming systems worldwide. Methods: A large database of 10,940 field-year combinations from three countries in different stages of agricultural intensification was analyzed. Random effects models were used to partition crop yield variability and random forest models were used to explain and predict crop yield within a cross-validation scheme with data re-sampling over space and time. Results: Yield variability in relative terms was smallest for wheat and barley in the Netherlands and for wheat in Ethiopia, intermediate for rice in the Philippines, and greatest for maize in Ethiopia. Random forest models comprising a total of 87 variables explained a maximum of 65 % of cereal yield variability in the Netherlands and less than 45 % of cereal yield variability in Ethiopia and in the Philippines. Crop management related variables were important to explain and predict cereal yields in Ethiopia, while predictive (i.e., known before the growing season) climatic variables and explanatory (i.e., known during or after the growing season) climatic variables were most important to explain and predict cereal yield variability in the Philippines and in the Netherlands, respectively. Finally, model cross-validation for regions or years not seen during model training reduced the R 2 considerably for most crop x country combinations, while for wheat in the Netherlands this was model dependent. Conclusion: Big data from farmers' fields is useful to explain on-farm yield variability to some extent, but not to predict it across time and space. Significance: The results call for moderate expectations towards big data and machine learning in agronomic studies, particularly for smallholder farms in the tropics where model performance was poorest independently of the variables considered and the cross-validation scheme used.
‘Big data’ is vast describing volumes of information that can work miracles. Multiple attempts designed at highlighting the relationship between big data analytics and benefits for healthcare organizations have been raised. The big data influence on health organization management is still not clear due to the relationship’s multi-disciplinary nature. It has become a topic of special interest for the past two decades because of a great potential that is hidden in it. Numerous public and private sector industries generate, store, and analyze big data with an aim to improve the services they provide. In the healthcare industry, various sources for big data include hospital records, medical records of patients, results of medical checkups, and devices that are a part of internet of things. Biomedical research also produces a significant share of big data pertinent to public healthcare. Big Data and healthcare are precarious to discourse the risk of hospitalization. This data requires proper management and analysis in order to derive expressive communication. There are various encounters associated with each step of handling big data which can only be surpassed by using high-end computing solutions for big data analysis. Providing relevant results for improving public health, healthcare providers are required to be fully equipped with appropriate infrastructure to systematically generate and analyze big data. A well-organized management, analysis, and interpretation of big data can transformation the game by opening new avenues for contemporary healthcare. With a strong integration of biomedical and healthcare data, modern healthcare institutes can probably revolutionize the medical therapies and personalized medicine.
Teacher resistance, teacher accommodation, and teacher conformism informed instructional strategies that Mr. Jenkins used to prevent suspension. Mr. Jenkins’s instructional strategies were impacted by his resistance to dominant PBS ideology, accommodation of system constraints related to classroom disruptions and PBS, and conformism to the dominant ideology of teaching and learning culinary arts.
Humans have been generating data for thousands of years. More recently we have seen an amazing progression in the amount of data produced from the advent of mainframes to client server to ERP and now everything digital. For years the overwhelming amount of data produced was deemed useless. But data has always been an integral part of every enterprise, big or small. As the importance and value of data to an enterprise became evident, so did the proliferation of data silos within an enterprise. This data was primarily of structured type, standardized and heavily governed (either through enterprise wide programs or through business functions or IT), the typical volumes of data were in the range of few terabytes and in some cases due to compliance and regulation requirements the volumes expectedly went up several notches higher.
The term Big Data is applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, currently ranging from a few dozen terabytes to many petabytes of data in a single data set. This chapter addresses some of the theoretical and practical issues raised by the possibility of using massive amounts of social and cultural data in the humanities and social sciences. These observations are based on the author’s own experience working since 2007 with large cultural data sets at the Software Studies Initiative at the University of California, San Diego. The issues discussed include the differences between ‘deep data’ about a few people and ‘surface data’ about many people; getting access to transactional data; and the new “data analysis divide” between data experts and researchers without training in computer science.