Content uploaded by Ying Huang
Author content
All content in this area was uploaded by Ying Huang on Jul 28, 2015
Content may be subject to copyright.
MetaData: BigData Research Evolving Across
Disciplines, Players, and Topics
Alan L. Porter
School of Public Policy,
Georgia Institute of Technology,
Atlanta, GA 30332, USA &
Search Technology, Inc.,
Atlanta, GA 30092, USA
Email: alan.porter@isye.gatech.edu
Jannik Schuehle
Department of Economics and Management,
Karlsruhe Institute of Technology
76131 Karlsruhe, Germany
Email: jannik.schuehle@student.kit.edu
Ying Huang
School of Management and Economics,
Beijing Institute of Technology,
Beijing, 100081, China
Email: huangying_work@126.com
Jan Youtie
Enterprise Innovation Institute,
Georgia Institute of Technology,
Atlanta, GA 30332, USA
Email: jy5@mail.gatech.edu
Abstract — We present a meta-analysis of BigData research
activity since 2009. Our purpose here is to present “tech
mining” (bibliometric and text analyses of research publication
abstract record sets) to provide a research landscape of who is
doing what, where, and when. Our larger purpose is to help
Forecast Innovation Pathways for big data & analytics over
the coming decade. We download 7006 research publication
abstracts from Web of Science resulting from a search
algorithm devised to recall a high percentage of core BigData
research and a moderate percentage of peripherally related
research (fair recall). We find interesting engagement of
different disciplines in BigData over time. On a national level,
the USA and China dominate these fundamental research
publications to a striking degree. Mapping topics presents
interesting evidence on what topics are emerging in this
dynamic field.
Keywords-Big Data; Tech Mining; Multidisciplinarity;
Bibliometrics
I. INTRODUCTION
Big Data (“BD”) needs no introduction to this audience.
However, reflecting on BD can be intriguingly recursive.
Our title word “MetaData” has overtones of “big,” but really
focuses on the sort of “data about the subject” which we
analyze using abstract records of scientific publications.
Analyses of metadata have blossomed as the field of
bibliometrics. An extension of that - “tech mining” -
combines counting of R&D activity categories with text
analytics to probe topical content and relationships [1, 2] .
We are undertaking a tech mining study [see
Acknowledgements] whose focal area is “Big Data &
Analytics.” This paper reflects an essential step toward
understanding the “lay of the land” in BD. Our longer term
aims entail Forecasting Innovation Pathways (“FIP”) [3] for
BD. Our general approach is to do empirical analyses to
understand key players, research networks, R&D trajectories,
and target applications to help anticipate future
developments and attendant issues [4]. So, here we bring to
bear, shall we say, “large data” analytics to offer a broad
perspective on the development of the BD research arena.
BD research poses real challenges to those who aspire to
profile it [5]. Data are fundamental to many interests,
including science as it archives, draws upon prior work, and
advances frontiers. How to manage and utilize BD draws
wide attention. Data collection, storage, and networking
are rich domains. Big Data are increasingly vital across
many science and engineering domains, including biological,
biomedical, and physical sciences [6]. BD analytics pose a
wealth of exciting challenges as well [7]. Extracting value
from BD involves complex interplay of multiple tools - e.g.,
statistics, visualization, and decision support [8]. Those
tools perform diverse functions -- e.g., aggregating,
manipulating, and analyzing multiple data types.
Accordingly, they draw upon diverse fields, such as
computer science, applied math and statistics, and economics
[9]. Such cross-disciplinarity is of special interest to us in
tracking BD research knowledge diffusion patterns [10], but
it makes generation of a good search strategy difficult.
Tech mining draws upon content analysis. BD poses
essential challenges in the variety of content involved and its
intersection with analytics and human decision processes
[11]. Currently, we witness particular enthusiasm for the
potential of BD to enhance decision processes [12, 13].
Shortly, we will take note of the explosive nature of growth
in BD research.
Tech mining to generate research landscapes and
trajectories starts with a search to identify suitable data. We
rely firstly upon search within global R&D databases. Park
and Leydesdorff [5] are pioneers in doing such research on
BD. Our approach begins with conceptualizing and
operationalizing a search for BD research. Section II
describes our data search model and development of our
extraction strategy. Section III shares results of the search in
depicting BD research over time, by research area, by
discipline, and by organization. We conclude with discussion
of the implications of these “meta-findings” on BD research.
II. DATA AND METHODS
Drawing on multiple resources, we explored a variety of
search terms and approaches to obtain publication (and
patent) data pertaining to “Big Data & Analytics.” Our
roots lie in preliminary research in 2014 by one of us (Huang)
as the Lab for Knowledge Management and Data Analysis at
Beijing Institute of Technology initiated a research program
on BD (reflecting on a topic in which they also actively
participate). This has provided some familiarity with BD
practices and literature. In 2015 we have been motivated to
align a new FIP study of ‘Big Data & Analytics’ with a
Technology Assessment being conducted by the US
Government Accountability Office (GAO) on “21st Century
Data.” We seek to advance our tech mining and FIP
methodology, sharing as suitable with GAO to boost our
tools’ utility to inform policy-making.
We initiated our BD data search with a conceptual
framework to guide identification and retrieval of BD
research data. A key notion is that we seek to capture a
very high percentage of core BD research (high recall),
enriched by moderate sampling of peripherally related
research (moderate precision). Toward these ends we have
progressed through a series of preliminary searches, reviews,
and refinements - i.e., a highly iterative approach, which
continues.
The gist of the search development:
(1) We began with basic searches on the single phrase,
“big data,” in multiple databases (Web of Science (WoS),
INSPEC, EI Compendex, Derwent Innovation Index (DII),
and US National Science Foundation (NSF) Awards. Using
VantagePoint desktop text analysis software
[www.theVantagePoint.com], we applied its Natural
Language Processing (NLP) routine to extract, then meld,
title and abstract NLP phrases with keywords.
(2) We reviewed those phrases along with candidates
from [10], and from literature review. We also gathered
and analyzed articles to elicit terminology from the set of
emerging journals on BD1. We thus generated an expanded
query.
(3) Ran a 2nd round WoS search using 6 search terms:
big data, mega data, MapReduce, Hadoop, semi-structured
data, and unstructured data. Downloaded this search (some
2145 records) into HistCite software to examine those
1The journals include: Journal of Big Data, Big Data,
International Journal of Big Data Intelligence, Big Data & Society,
Big Data Research, Data Science and Engineering, Open Journal
of Big Data, and the American Journal of Big Data Research.
We also considered the Data Science Journal, Journal of Data
Science, and EPJ Data Science.
publications’cited references. We identified their highly
cited references that were not captured by our search.
Again, these provide a set of terms to review for possible
enrichment of our search strategy to capture more such
research (our presumption is that papers heavily cited by a
set of “big data” papers are related). [We plan to take our
current WoS data back to HistCite to retrieve such references
that we have missed.]
(4) We determined to use a two-prong approach to enrich
the Boolean term-based search further by 1) expanding the
term set, and 2) adding a set of topics (technique-focused) to
be used to retrieve records only if they co-occur in those
records with a set of “contingent” terms (themselves relating
to big data interests).
(5) A 3rd round WoS search for 2005-2015 yielding
9323 records was investigated in some depth. That pointed
us to the main “call for papers” topics of the 2015 IEEE Big
Data Conference and Congress – a source of additional
candidate search terms.
(6) A 4th round of searching generated 19962 abstract
records from WoS. Two further rounds of search term
refinement and test retrievals led to a 6th round. Some
essential advances:
BD’s explosive recent growth generates temporal
sensitivities in search queries; in essence, terms may
retrieve with high precision recently, but poor
precision in earlier years -- hence our current data
are limited to 2009-2015.
We explored segmenting our contingency group into
one keying on data size and one on techniques &
processing associated with BD, but we found the
later ineffective and reassigned terms.
We settled on a 3-group search algorithm:
A. Group A to apply as direct search terms
(retrieving any records containing one or more)
B. Group B-contingency terms
C. Group C-terms that need to co-occur with a
Group B term for us to retrieve that record
An assessment protocol that sampled ~20 early and
late year articles, tallying % relevant to BD based on
titles, enriched by reviewing problematic record
abstracts
[along with guidance from a BD researcher].
On many phrases, we also assessed alternative
formulations:
oWild-cards -- e.g., to take singular and
plural versions, “mine and mining,” etc.
oProximity -- comparing coverage and
relevance of two terms locked in sequence
vs. some distancing and flexibility in
ordering (e.g., data near/2 quality)
In the process, terms migrated among Groups A, B,
and C, and the “outtakes” to be left out entirely.
We also considered exclusion terms - i.e., if these
terms also appear in search set records, we would
remove those records as not relevant to BD. This
remains an option for future refinement. In all, we
examined on the order of 100 candidate terms &
phrases, with many variants. Results reflect our
judgment. We intend to ask colleagues from
various industries involved with Big Data (e.g.,
software, analysis, healthcare) to review our search
strategy and its results (e.g., to review lists of key
terms and key players). We anticipate that will lead
to enrichment.
This paper reports on analyses of our 6th round BD research
publication records downloaded on April 1, 2015. We
searched and retrieved publications dating 2009-2015 from
Web of Knowledge, covering Science Citation Index
Expanded, Social Sciences Citation Index, Arts &
Humanities Citation Index, Conference Proceedings Citation
Index - Science, and Conference Proceedings Citation Index
- Social Science & Humanities -- we label as “WoS”. The
search algorithm included:
Group A: (big data or bigdata or MapReduce or
Hadoop or hbase or bigdata or Nosql or newsql) -
yielding 4730 records [of which “big data” or
bigdata account for 3051.
Group B: (big near/1 data or huge near/1 data) or
"massive data" or petabyte or Exabyte or Zettabyte
or "data lake" or "massive information" or "huge
information" or "big information" or "semi-
structured data" or "semistructured data" or
"unstructured data" or “streaming data”. We
excluded records from Group B that appear also in
Group A. This leaves 2624 unique records of the
5808 associated with the 14 Group B terms. Note
that these are not to be used to retrieve BD records
per se, but to be required to co-occur with a Group C
term.
Group C: (algorithm* or analy* or architectur* or
automat* or "cloud computing" or “data mining" or
“data mine” or (data near/2 mine) or (data near/2
mining) or "data extract*" or "data visualiz*" or
design* or detect* or "distributed comput*" or
extract* or intelligence or "location based social
network*" or “machine learning” or monitor* or
"parallel comput*" or privacy or scalab* or semantic
or "text mining"). These 24 terms generate
5,160,147 records, reducing to 2254 that co-occur
with Group B. We separately assessed and
incorporated “velocity” and “volume” in Group C,
adding 22 records.
The total retrieved is 7006 abstract records containing well-
defined data fields (e.g., authors, Web of Science Categories,
publication year, keywords).
III. EXPLORATORY ANALYSES OF BD RESEARCH
We often think of research profiling in terms of
answering “Who? What? Where? and When?” Let’s start
with versions of those here blended to build knowledge of
BD research).
The 7006 records contain 14 document types, notably:
Proceedings papers - 3844
Articles - 2494
Reviews - 143
Not surprisingly, conferences are the most active mode of
professional interchange in this arena led by computer
science. Reviews certainly offer prospects of useful
perspectives on BD and various intersecting interests.
The “Top 10” sources from which we retrieved these
articles/papers are listed in Table 1. A longer list would
offer intelligence about leading forums, intersecting
emphases (and the emphases of our search!).
TABLE I. TOP 10 SOURCES
Source
Records
2013 IEEE International Conference on Big Data
157
2014 IEEE International Congress on Big Data (Bigdata
Congress)
84
Future Generation Computer Systems-The International
Journal of Grid Computing and Escience
56
2013 IEEE International Congress On Big Data
48
BMC Bioinformatics
38
Ehealth2013: Health Informatics
38
Concurrency and Computation-Practice & Experience
34
Expert Systems with Applications
34
IEEE Transactions on Parallel and Distributed Systems
33
Plos One
33
The BD is accelerating intensely. In that our 7006-record
search is limited to 2009 onward, Figure 1 plots results of a
simple, single-term “big data” search in WoS.
Note: Data for 2014 are incomplete; 2015 data are left out because they are so partial.
Figure 1. Big Data Research Trend
We run cleanup and term consolidation (“ClusterSuite”)
on the 137,932 combined keywords and NLP-derived terms
& phrases, and use an algorithm to consolidate related
phrases. We run VantagePoint’s ”Factor Mapping” routine
(a special form of Principal Components Analysis - PCA) on
300 terms appearing in 35 or more records. Figure 2 is the
map of terms that prominently occur, reflecting their
tendency to co-occur in records. This suggests considerable
range in the topics and fields convergent in BD.
Figure 2. Topical Factors and High-loading Terms
Figure 3. Big Data Research Across the Disciplines
At a broader level, we are curious about which
disciplines are researching BD? As an initial resource, we
can analyze the Web of Science Categories (“WCs”).
These are assigned to journals based on a combination of
cross-citation patterns and editorial judgment. They offer a
de facto standard in bibliometrics to treat disciplinary or field
participation [14]. We are particularly interested in cross-
field research knowledge transfer. For such purposes the
granularity of the WCs is effective - i.e., some 224 WCs
differentiate sub-fields. Figure 3 offers a science overlay
map [15]. For the present BD data, not surprisingly, research
is dominated by Computer Science. But the pattern of
widespread engagement is remarkable. BD research is not
bottled up in a silo!
Table 2 augments Figure 2. The variety of fields engaged
seems to offer many opportunities for research knowledge
transfer.
TABLE II. TOP 25 WEB OF SCIENCE CATEGORI ES FOR BIGDATA
PUBLICATIONS
Web of Science Category
Records
Computer Science, Theory & Methods
2180
Computer Science, Information Systems
1883
Engineering, Electrical & Electronic
1847
Computer Science, Artificial Intelligence
929
Computer Science, Hardware & Architecture
804
Computer Science, Software Engineering
779
Telecommunications
538
Computer Science, Interdisciplinary Applications
429
Materials Science, Multidisciplinary
197
Mathematical & Computational Biology
156
Optics
155
Automation & Control Systems
140
Information Science & Library Science
139
Multidisciplinary Sciences
136
Engineering, Mechanical
135
Biotechnology & Applied Microbiology
133
Health Care Sciences & Services
128
Operations Research & Management Science
127
Medical Informatics
119
Biochemical Research Methods
113
Statistics & Probability
101
Engineering, Multidisciplinary
99
Management
96
Mathematics, Applied
81
Remote Sensing
77
Table 3 shows the top 10 author organizations publishing
in WoS-indexed journals. Strikingly, the top 30 are all
American (18) or Chinese (12) - amazing domination of the
field.
TABLE III. TOP ORGANIZATIONS
Author Organization
Records
Chinese Acad Sci
182
Tsinghua Univ
71
MIT
54
Beijing Univ Posts & Telecommun
53
Univ Calif Berkeley
53
Huazhong Univ Sci & Technol
52
Stanford Univ
51
Harvard Univ
50
Natl Univ Def Technol
50
Northeastern Univ
50
Of the 7006 papers, 2180 have an American author or co-
author; 1708 have a Chinese one. Germany (347) and the UK
(344) trail. Dominance of BD fundamental scientific
research publication by these two countries is definitely
noteworthy for policy-makers and managers seeking to tap
into new knowledge.
IV. DISCUSSION
Our study uses a tech mining approach to gain insight
into the current status of ‘Big Data & Analytics’ from 7006
publications retrieved from the WoS. The surge in
publication activity in the most recent years shows the
growing interest in BD among researchers around the world.
We found that the USA and China are the leading
publishers of literature on BD, far ahead of other countries.
Further study of the distribution of publication behavior by
different countries over the course of time periods could give
valuable information about inter-country knowledge flows.
Linked to this research question is the dissection of the
diffusion process over time of BD among disciplines. An
extension will be to study BD-related patenting, comparing
that to the behavior for other innovations. Another
extension would involve creating a roadmap of Big Data
analytics to further elucidate the trajectory of this emerging
domain. Most of the current publications are unsurprisingly
in the field of Computer Science but our mapping also shows
a significant involvement of a wide array of other disciplines,
like Health Care and Biochemistry. The map can suggest
opportunities for BD researchers to address needs of various
target sectors.
As BD touches multiple sectors in society, we are
interested in further developing our methods and exposing
them to researchers and practitioners in the field of BD.
Our “Tech Mining”approach uses specific search terms
and we seek feedback on how well our search targets BD
research.
The individual iterations of our search term development
involved significant labor input that potentially could be
streamlined or automated. We welcome suggestions on
ways to improve the methodology and the precision of our
dataset in our ambition to create a universal framework for
innovation pathway prediction.
We will pursue these analyses, striving to enrich
understanding of the likely progression of the field, roles of
leading players, and prospects for the future – i.e., we aim
toward Forecasting Innovation Pathways (FIP) for ‘big data
& analytics’. Interpretation of empirical analyses of BD
research demand knowledge from multiple technical and
contextual perspectives. A challenge in doing FIP is to
engage multiple actors in such interpretation. In the past,
we have organized workshops that bring stakeholders
together to discuss and assess tech mining results. We
aspire to stretch this approach for this BD study to tap
various social media to obtain opinions and valuations on
alternative developmental options, impacts, and potential
policy measures to guide such development. We aim to share
such findings with GAO as they pursue their Technology
Assessment of 21st Century data.
ACKNOWLEDGMENT
We acknowledge support from the US National Science
Foundation (Award #1527370 -- “Forecasting Innovation
Pathways of Big Data & Analytics”). The findings and
observations contained in this paper are those of the authors
and do not necessarily reflect the views of the National
Science Foundation.
REFERENCES
[1] A. L. Porter and S. W. Cunningham, Tech mining: exploiting new
technologies for competitive advantage vol. 29: John Wiley & Sons,
2004.
[2] Y. Guo, X. Zhou, A. L. Porter, and D. K. Robinson, "Tech mining to
generate indicators of future national technological competitiveness:
Nano-Enhanced Drug Delivery (NEDD) in the US and China,"
Technol. Forecast. Soc. Change., in press.
[3] D. K. Robinson, L. Huang, Y. Guo, and A. L. Porter, "Forecasting
Innovation Pathways (FIP) for new and emerging science and
technologies," Technol. Forecast. Soc. Change., vol. 80, pp. 267-285,
2013.
[4] A. L. Porter, S. W. Cunningham, and A. Sanz, "Advancing the
Forecasting Innovation Pathways Approach: Hybrid & Electric
Vehicles Case," Int. J. Technol. Manage., in press.
[5] H. W. Park and L. Leydesdorff, "Decomposing social and semantic
networks in emerging “big data” research," J. Informetr., vol. 7, pp.
756-765, 2013.
[6] X. D. Wu, X. Q. Zhu, G. Q. Wu, and W. Ding, "Data Mining with
Big Data," Ieee Transactions on Knowledge and Data Engineering,
vol. 26, pp. 97-107, Jan 2014.
[7] K. Kambatla, G. Kollias, V. Kumar, and A. Grama, "Trends in big
data analytics," J. Parallel. Distr. Com., vol. 74, pp. 2561-2573, Jul
2014.
[8] M. L. Berger and V. Doban, "Big data, advanced analytics and the
future of comparative effectiveness research," J. Comp. Eff. Res., vol.
3, pp. 167-176, Mar 2014.
[9] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh,
et al., "Big data: The next frontier for innovation, competition, and
productivity," McKinsey Global Institute, 2011.
[10] A. L. Porter and I. Rafols, "Is science becoming more
interdisciplinary? Measuring and mapping six research fields over
time," Scientometrics, vol. 81, pp. 719-745, 2009.
[11] S. C. Lewis, R. Zamith, and A. Hermida, "Content Analysis in an Era
of Big Data: A Hybrid Approach to Computational and Manual
Methods," J Broadcast. Electron., vol. 57, pp. 34-52, Jan 2013.
[12] A. Labrinidis and H. Jagadish, "Challenges and opportunities with big
data," Proc. of the VLDB Endowment, vol. 5, pp. 2032-2033, 2012.
[13] A. McAfee and E. Brynjolfsson, "Big Data: The Management
Revolution," Harvard Bus. Rev., vol. 90, pp. 60-+, Oct 2012.
[14] A. L. Porter, J. D. Roessner, A. S. Cohen, and M. Perreault,
"Interdisciplinary research: meaning, metrics and nurture," Res.
Evaluat., vol. 15, pp. 187-195, 2006.
[15] I. Rafols, A. L. Porter, and L. Leydesdorff, "Science overlay maps: A
new tool for research policy and library management," J. Am. Soc.
Inf. Sci. Tec., vol. 61, pp. 1871-1887, 2010.