Conference PaperPDF Available

MetaData: BigData Research Evolving Across Disciplines, Players, and Topics

Authors:

Abstract and Figures

We present a meta-analysis of BigData research activity since 2009. Our purpose here is to present “tech mining” (bibliometric and text analyses of research publication abstract record sets) to provide a research landscape of who is doing what, where, and when. Our larger purpose is to help Forecast Innovation Pathways for big data & analytics over the coming decade. We download 7006 research publication abstracts from Web of Science resulting from a search algorithm devised to recall a high percentage of core BigData research and a moderate percentage of peripherally related research (fair recall). We find interesting engagement of different disciplines in BigData over time. On a national level, the USA and China dominate these fundamental research publications to a striking degree. Mapping topics presents interesting evidence on what topics are emerging in this dynamic field.
Content may be subject to copyright.
MetaData: BigData Research Evolving Across
Disciplines, Players, and Topics
Alan L. Porter
School of Public Policy,
Georgia Institute of Technology,
Atlanta, GA 30332, USA &
Search Technology, Inc.,
Atlanta, GA 30092, USA
Email: alan.porter@isye.gatech.edu
Jannik Schuehle
Department of Economics and Management,
Karlsruhe Institute of Technology
76131 Karlsruhe, Germany
Email: jannik.schuehle@student.kit.edu
Ying Huang
School of Management and Economics,
Beijing Institute of Technology,
Beijing, 100081, China
Email: huangying_work@126.com
Jan Youtie
Enterprise Innovation Institute,
Georgia Institute of Technology,
Atlanta, GA 30332, USA
Email: jy5@mail.gatech.edu
Abstract — We present a meta-analysis of BigData research
activity since 2009. Our purpose here is to present “tech
mining” (bibliometric and text analyses of research publication
abstract record sets) to provide a research landscape of who is
doing what, where, and when. Our larger purpose is to help
Forecast Innovation Pathways for big data & analytics over
the coming decade. We download 7006 research publication
abstracts from Web of Science resulting from a search
algorithm devised to recall a high percentage of core BigData
research and a moderate percentage of peripherally related
research (fair recall). We find interesting engagement of
different disciplines in BigData over time. On a national level,
the USA and China dominate these fundamental research
publications to a striking degree. Mapping topics presents
interesting evidence on what topics are emerging in this
dynamic field.
Keywords-Big Data; Tech Mining; Multidisciplinarity;
Bibliometrics
I. INTRODUCTION
Big Data (“BD”) needs no introduction to this audience.
However, reflecting on BD can be intriguingly recursive.
Our title word “MetaData” has overtones of “big,” but really
focuses on the sort of “data about the subject” which we
analyze using abstract records of scientific publications.
Analyses of metadata have blossomed as the field of
bibliometrics. An extension of that - “tech mining” -
combines counting of R&D activity categories with text
analytics to probe topical content and relationships [1, 2] .
We are undertaking a tech mining study [see
Acknowledgements] whose focal area is “Big Data &
Analytics.” This paper reflects an essential step toward
understanding the “lay of the land” in BD. Our longer term
aims entail Forecasting Innovation Pathways (“FIP”) [3] for
BD. Our general approach is to do empirical analyses to
understand key players, research networks, R&D trajectories,
and target applications to help anticipate future
developments and attendant issues [4]. So, here we bring to
bear, shall we say, “large data” analytics to offer a broad
perspective on the development of the BD research arena.
BD research poses real challenges to those who aspire to
profile it [5]. Data are fundamental to many interests,
including science as it archives, draws upon prior work, and
advances frontiers. How to manage and utilize BD draws
wide attention. Data collection, storage, and networking
are rich domains. Big Data are increasingly vital across
many science and engineering domains, including biological,
biomedical, and physical sciences [6]. BD analytics pose a
wealth of exciting challenges as well [7]. Extracting value
from BD involves complex interplay of multiple tools - e.g.,
statistics, visualization, and decision support [8]. Those
tools perform diverse functions -- e.g., aggregating,
manipulating, and analyzing multiple data types.
Accordingly, they draw upon diverse fields, such as
computer science, applied math and statistics, and economics
[9]. Such cross-disciplinarity is of special interest to us in
tracking BD research knowledge diffusion patterns [10], but
it makes generation of a good search strategy difficult.
Tech mining draws upon content analysis. BD poses
essential challenges in the variety of content involved and its
intersection with analytics and human decision processes
[11]. Currently, we witness particular enthusiasm for the
potential of BD to enhance decision processes [12, 13].
Shortly, we will take note of the explosive nature of growth
in BD research.
Tech mining to generate research landscapes and
trajectories starts with a search to identify suitable data. We
rely firstly upon search within global R&D databases. Park
and Leydesdorff [5] are pioneers in doing such research on
BD. Our approach begins with conceptualizing and
operationalizing a search for BD research. Section II
describes our data search model and development of our
extraction strategy. Section III shares results of the search in
depicting BD research over time, by research area, by
discipline, and by organization. We conclude with discussion
of the implications of these “meta-findings” on BD research.
II. DATA AND METHODS
Drawing on multiple resources, we explored a variety of
search terms and approaches to obtain publication (and
patent) data pertaining to “Big Data & Analytics.” Our
roots lie in preliminary research in 2014 by one of us (Huang)
as the Lab for Knowledge Management and Data Analysis at
Beijing Institute of Technology initiated a research program
on BD (reflecting on a topic in which they also actively
participate). This has provided some familiarity with BD
practices and literature. In 2015 we have been motivated to
align a new FIP study of ‘Big Data & Analytics’ with a
Technology Assessment being conducted by the US
Government Accountability Office (GAO) on “21st Century
Data.” We seek to advance our tech mining and FIP
methodology, sharing as suitable with GAO to boost our
tools’ utility to inform policy-making.
We initiated our BD data search with a conceptual
framework to guide identification and retrieval of BD
research data. A key notion is that we seek to capture a
very high percentage of core BD research (high recall),
enriched by moderate sampling of peripherally related
research (moderate precision). Toward these ends we have
progressed through a series of preliminary searches, reviews,
and refinements - i.e., a highly iterative approach, which
continues.
The gist of the search development:
(1) We began with basic searches on the single phrase,
“big data,” in multiple databases (Web of Science (WoS),
INSPEC, EI Compendex, Derwent Innovation Index (DII),
and US National Science Foundation (NSF) Awards. Using
VantagePoint desktop text analysis software
[www.theVantagePoint.com], we applied its Natural
Language Processing (NLP) routine to extract, then meld,
title and abstract NLP phrases with keywords.
(2) We reviewed those phrases along with candidates
from [10], and from literature review. We also gathered
and analyzed articles to elicit terminology from the set of
emerging journals on BD1. We thus generated an expanded
query.
(3) Ran a 2nd round WoS search using 6 search terms:
big data, mega data, MapReduce, Hadoop, semi-structured
data, and unstructured data. Downloaded this search (some
2145 records) into HistCite software to examine those
1The journals include: Journal of Big Data, Big Data,
International Journal of Big Data Intelligence, Big Data & Society,
Big Data Research, Data Science and Engineering, Open Journal
of Big Data, and the American Journal of Big Data Research.
We also considered the Data Science Journal, Journal of Data
Science, and EPJ Data Science.
publicationscited references. We identified their highly
cited references that were not captured by our search.
Again, these provide a set of terms to review for possible
enrichment of our search strategy to capture more such
research (our presumption is that papers heavily cited by a
set of “big data” papers are related). [We plan to take our
current WoS data back to HistCite to retrieve such references
that we have missed.]
(4) We determined to use a two-prong approach to enrich
the Boolean term-based search further by 1) expanding the
term set, and 2) adding a set of topics (technique-focused) to
be used to retrieve records only if they co-occur in those
records with a set of “contingent” terms (themselves relating
to big data interests).
(5) A 3rd round WoS search for 2005-2015 yielding
9323 records was investigated in some depth. That pointed
us to the main “call for papers” topics of the 2015 IEEE Big
Data Conference and Congress – a source of additional
candidate search terms.
(6) A 4th round of searching generated 19962 abstract
records from WoS. Two further rounds of search term
refinement and test retrievals led to a 6th round. Some
essential advances:
BD’s explosive recent growth generates temporal
sensitivities in search queries; in essence, terms may
retrieve with high precision recently, but poor
precision in earlier years -- hence our current data
are limited to 2009-2015.
We explored segmenting our contingency group into
one keying on data size and one on techniques &
processing associated with BD, but we found the
later ineffective and reassigned terms.
We settled on a 3-group search algorithm:
A. Group A to apply as direct search terms
(retrieving any records containing one or more)
B. Group B-contingency terms
C. Group C-terms that need to co-occur with a
Group B term for us to retrieve that record
An assessment protocol that sampled ~20 early and
late year articles, tallying % relevant to BD based on
titles, enriched by reviewing problematic record
abstracts
[along with guidance from a BD researcher].
On many phrases, we also assessed alternative
formulations:
oWild-cards -- e.g., to take singular and
plural versions, “mine and mining,” etc.
oProximity -- comparing coverage and
relevance of two terms locked in sequence
vs. some distancing and flexibility in
ordering (e.g., data near/2 quality)
In the process, terms migrated among Groups A, B,
and C, and the “outtakes” to be left out entirely.
We also considered exclusion terms - i.e., if these
terms also appear in search set records, we would
remove those records as not relevant to BD. This
remains an option for future refinement. In all, we
examined on the order of 100 candidate terms &
phrases, with many variants. Results reflect our
judgment. We intend to ask colleagues from
various industries involved with Big Data (e.g.,
software, analysis, healthcare) to review our search
strategy and its results (e.g., to review lists of key
terms and key players). We anticipate that will lead
to enrichment.
This paper reports on analyses of our 6th round BD research
publication records downloaded on April 1, 2015. We
searched and retrieved publications dating 2009-2015 from
Web of Knowledge, covering Science Citation Index
Expanded, Social Sciences Citation Index, Arts &
Humanities Citation Index, Conference Proceedings Citation
Index - Science, and Conference Proceedings Citation Index
- Social Science & Humanities -- we label as “WoS”. The
search algorithm included:
Group A: (big data or bigdata or MapReduce or
Hadoop or hbase or bigdata or Nosql or newsql) -
yielding 4730 records [of which “big data” or
bigdata account for 3051.
Group B: (big near/1 data or huge near/1 data) or
"massive data" or petabyte or Exabyte or Zettabyte
or "data lake" or "massive information" or "huge
information" or "big information" or "semi-
structured data" or "semistructured data" or
"unstructured data" or “streaming data”. We
excluded records from Group B that appear also in
Group A. This leaves 2624 unique records of the
5808 associated with the 14 Group B terms. Note
that these are not to be used to retrieve BD records
per se, but to be required to co-occur with a Group C
term.
Group C: (algorithm* or analy* or architectur* or
automat* or "cloud computing" or “data mining" or
“data mine” or (data near/2 mine) or (data near/2
mining) or "data extract*" or "data visualiz*" or
design* or detect* or "distributed comput*" or
extract* or intelligence or "location based social
network*" or “machine learning” or monitor* or
"parallel comput*" or privacy or scalab* or semantic
or "text mining"). These 24 terms generate
5,160,147 records, reducing to 2254 that co-occur
with Group B. We separately assessed and
incorporated “velocity” and “volume” in Group C,
adding 22 records.
The total retrieved is 7006 abstract records containing well-
defined data fields (e.g., authors, Web of Science Categories,
publication year, keywords).
III. EXPLORATORY ANALYSES OF BD RESEARCH
We often think of research profiling in terms of
answering “Who? What? Where? and When?” Let’s start
with versions of those here blended to build knowledge of
BD research).
The 7006 records contain 14 document types, notably:
Proceedings papers - 3844
Articles - 2494
Reviews - 143
Not surprisingly, conferences are the most active mode of
professional interchange in this arena led by computer
science. Reviews certainly offer prospects of useful
perspectives on BD and various intersecting interests.
The “Top 10” sources from which we retrieved these
articles/papers are listed in Table 1. A longer list would
offer intelligence about leading forums, intersecting
emphases (and the emphases of our search!).
TABLE I. TOP 10 SOURCES
Source
Records
2013 IEEE International Conference on Big Data
157
2014 IEEE International Congress on Big Data (Bigdata
Congress)
84
Future Generation Computer Systems-The International
Journal of Grid Computing and Escience
56
2013 IEEE International Congress On Big Data
48
BMC Bioinformatics
38
Ehealth2013: Health Informatics
38
Concurrency and Computation-Practice & Experience
34
Expert Systems with Applications
34
IEEE Transactions on Parallel and Distributed Systems
33
Plos One
33
The BD is accelerating intensely. In that our 7006-record
search is limited to 2009 onward, Figure 1 plots results of a
simple, single-term “big data” search in WoS.
Note: Data for 2014 are incomplete; 2015 data are left out because they are so partial.
Figure 1. Big Data Research Trend
We run cleanup and term consolidation (“ClusterSuite”)
on the 137,932 combined keywords and NLP-derived terms
& phrases, and use an algorithm to consolidate related
phrases. We run VantagePoint’s ”Factor Mapping” routine
(a special form of Principal Components Analysis - PCA) on
300 terms appearing in 35 or more records. Figure 2 is the
map of terms that prominently occur, reflecting their
tendency to co-occur in records. This suggests considerable
range in the topics and fields convergent in BD.
Figure 2. Topical Factors and High-loading Terms
Figure 3. Big Data Research Across the Disciplines
At a broader level, we are curious about which
disciplines are researching BD? As an initial resource, we
can analyze the Web of Science Categories (“WCs”).
These are assigned to journals based on a combination of
cross-citation patterns and editorial judgment. They offer a
de facto standard in bibliometrics to treat disciplinary or field
participation [14]. We are particularly interested in cross-
field research knowledge transfer. For such purposes the
granularity of the WCs is effective - i.e., some 224 WCs
differentiate sub-fields. Figure 3 offers a science overlay
map [15]. For the present BD data, not surprisingly, research
is dominated by Computer Science. But the pattern of
widespread engagement is remarkable. BD research is not
bottled up in a silo!
Table 2 augments Figure 2. The variety of fields engaged
seems to offer many opportunities for research knowledge
transfer.
TABLE II. TOP 25 WEB OF SCIENCE CATEGORI ES FOR BIGDATA
PUBLICATIONS
Web of Science Category
Records
Computer Science, Theory & Methods
2180
Computer Science, Information Systems
1883
Engineering, Electrical & Electronic
1847
Computer Science, Artificial Intelligence
929
Computer Science, Hardware & Architecture
804
Computer Science, Software Engineering
779
Telecommunications
538
Computer Science, Interdisciplinary Applications
429
Materials Science, Multidisciplinary
197
Mathematical & Computational Biology
156
Optics
155
Automation & Control Systems
140
Information Science & Library Science
139
Multidisciplinary Sciences
136
Engineering, Mechanical
135
Biotechnology & Applied Microbiology
133
Health Care Sciences & Services
128
Operations Research & Management Science
127
Medical Informatics
119
Biochemical Research Methods
113
Statistics & Probability
101
Engineering, Multidisciplinary
99
Management
96
Mathematics, Applied
81
Remote Sensing
77
Table 3 shows the top 10 author organizations publishing
in WoS-indexed journals. Strikingly, the top 30 are all
American (18) or Chinese (12) - amazing domination of the
field.
TABLE III. TOP ORGANIZATIONS
Author Organization
Chinese Acad Sci
Tsinghua Univ
MIT
Beijing Univ Posts & Telecommun
Univ Calif Berkeley
Huazhong Univ Sci & Technol
Stanford Univ
Harvard Univ
Natl Univ Def Technol
Northeastern Univ
Of the 7006 papers, 2180 have an American author or co-
author; 1708 have a Chinese one. Germany (347) and the UK
(344) trail. Dominance of BD fundamental scientific
research publication by these two countries is definitely
noteworthy for policy-makers and managers seeking to tap
into new knowledge.
IV. DISCUSSION
Our study uses a tech mining approach to gain insight
into the current status of ‘Big Data & Analytics’ from 7006
publications retrieved from the WoS. The surge in
publication activity in the most recent years shows the
growing interest in BD among researchers around the world.
We found that the USA and China are the leading
publishers of literature on BD, far ahead of other countries.
Further study of the distribution of publication behavior by
different countries over the course of time periods could give
valuable information about inter-country knowledge flows.
Linked to this research question is the dissection of the
diffusion process over time of BD among disciplines. An
extension will be to study BD-related patenting, comparing
that to the behavior for other innovations. Another
extension would involve creating a roadmap of Big Data
analytics to further elucidate the trajectory of this emerging
domain. Most of the current publications are unsurprisingly
in the field of Computer Science but our mapping also shows
a significant involvement of a wide array of other disciplines,
like Health Care and Biochemistry. The map can suggest
opportunities for BD researchers to address needs of various
target sectors.
As BD touches multiple sectors in society, we are
interested in further developing our methods and exposing
them to researchers and practitioners in the field of BD.
Our Tech Miningapproach uses specific search terms
and we seek feedback on how well our search targets BD
research.
The individual iterations of our search term development
involved significant labor input that potentially could be
streamlined or automated. We welcome suggestions on
ways to improve the methodology and the precision of our
dataset in our ambition to create a universal framework for
innovation pathway prediction.
We will pursue these analyses, striving to enrich
understanding of the likely progression of the field, roles of
leading players, and prospects for the future – i.e., we aim
toward Forecasting Innovation Pathways (FIP) for ‘big data
& analytics’. Interpretation of empirical analyses of BD
research demand knowledge from multiple technical and
contextual perspectives. A challenge in doing FIP is to
engage multiple actors in such interpretation. In the past,
we have organized workshops that bring stakeholders
together to discuss and assess tech mining results. We
aspire to stretch this approach for this BD study to tap
various social media to obtain opinions and valuations on
alternative developmental options, impacts, and potential
policy measures to guide such development. We aim to share
such findings with GAO as they pursue their Technology
Assessment of 21st Century data.
ACKNOWLEDGMENT
We acknowledge support from the US National Science
Foundation (Award #1527370 -- “Forecasting Innovation
Pathways of Big Data & Analytics”). The findings and
observations contained in this paper are those of the authors
and do not necessarily reflect the views of the National
Science Foundation.
REFERENCES
[1] A. L. Porter and S. W. Cunningham, Tech mining: exploiting new
technologies for competitive advantage vol. 29: John Wiley & Sons,
2004.
[2] Y. Guo, X. Zhou, A. L. Porter, and D. K. Robinson, "Tech mining to
generate indicators of future national technological competitiveness:
Nano-Enhanced Drug Delivery (NEDD) in the US and China,"
Technol. Forecast. Soc. Change., in press.
[3] D. K. Robinson, L. Huang, Y. Guo, and A. L. Porter, "Forecasting
Innovation Pathways (FIP) for new and emerging science and
technologies," Technol. Forecast. Soc. Change., vol. 80, pp. 267-285,
2013.
[4] A. L. Porter, S. W. Cunningham, and A. Sanz, "Advancing the
Forecasting Innovation Pathways Approach: Hybrid & Electric
Vehicles Case," Int. J. Technol. Manage., in press.
[5] H. W. Park and L. Leydesdorff, "Decomposing social and semantic
networks in emerging “big data” research," J. Informetr., vol. 7, pp.
756-765, 2013.
[6] X. D. Wu, X. Q. Zhu, G. Q. Wu, and W. Ding, "Data Mining with
Big Data," Ieee Transactions on Knowledge and Data Engineering,
vol. 26, pp. 97-107, Jan 2014.
[7] K. Kambatla, G. Kollias, V. Kumar, and A. Grama, "Trends in big
data analytics," J. Parallel. Distr. Com., vol. 74, pp. 2561-2573, Jul
2014.
[8] M. L. Berger and V. Doban, "Big data, advanced analytics and the
future of comparative effectiveness research," J. Comp. Eff. Res., vol.
3, pp. 167-176, Mar 2014.
[9] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh,
et al., "Big data: The next frontier for innovation, competition, and
productivity," McKinsey Global Institute, 2011.
[10] A. L. Porter and I. Rafols, "Is science becoming more
interdisciplinary? Measuring and mapping six research fields over
time," Scientometrics, vol. 81, pp. 719-745, 2009.
[11] S. C. Lewis, R. Zamith, and A. Hermida, "Content Analysis in an Era
of Big Data: A Hybrid Approach to Computational and Manual
Methods," J Broadcast. Electron., vol. 57, pp. 34-52, Jan 2013.
[12] A. Labrinidis and H. Jagadish, "Challenges and opportunities with big
data," Proc. of the VLDB Endowment, vol. 5, pp. 2032-2033, 2012.
[13] A. McAfee and E. Brynjolfsson, "Big Data: The Management
Revolution," Harvard Bus. Rev., vol. 90, pp. 60-+, Oct 2012.
[14] A. L. Porter, J. D. Roessner, A. S. Cohen, and M. Perreault,
"Interdisciplinary research: meaning, metrics and nurture," Res.
Evaluat., vol. 15, pp. 187-195, 2006.
[15] I. Rafols, A. L. Porter, and L. Leydesdorff, "Science overlay maps: A
new tool for research policy and library management," J. Am. Soc.
Inf. Sci. Tec., vol. 61, pp. 1871-1887, 2010.
... Concerning our second point of discussion, the value of implementing the established knowledge source to disambiguate keywords, we contrasted our results with those reported by Park and Leydesdorff (2013) and Porter et al. (2015). In Table 4 we present the results. ...
... We identified four points of difference between our keywords and those reported by Porter et al. (2015). ...
... • Keywords filtered-out by NKB (X*): The word 'genome' for instance was not present in our study. We explain this by arguing that the time covered by Porter et al. (2015) was up until April of 2015. For the 2014 period the 'Genome' reported high frequency, making it highly visible in the previous study. ...
Article
Systems to track the early stages of industrial convergence are used to understand technological and scientific developments. Keywords are considered an important indicator to detect knowledge convergence and so far, few reported methods use them. We define two objectives, first to propose a framework to detect knowledge convergence using keywords and second to test this framework by detecting analysing topics converging into 'big data'. We propose a method which uses scientific papers' author keywords as the data source and includes techniques such as word co-occurrence network analysis and established knowledge sources to disambiguate and classify keywords. We analysed scientific publications related to 'big data' for the years 2008-2016 and identified 221 keywords as a proxy of knowledge convergence and grouped them into 11 topics. Among these 11 topics, four were identified as significant adopters of big data knowledge: artificial intelligence, pattern recognition, natural language processing and data science.
... Concerning our second point of discussion, the value of implementing the established knowledge source to disambiguate keywords, we contrasted our results with those reported by Park and Leydesdorff (2013) and Porter et al. (2015). In Table 4 we present the results. ...
... We identified four points of difference between our keywords and those reported by Porter et al. (2015). ...
... • Keywords filtered-out by NKB (X*): The word 'genome' for instance was not present in our study. We explain this by arguing that the time covered by Porter et al. (2015) was up until April of 2015. For the 2014 period the 'Genome' reported high frequency, making it highly visible in the previous study. ...
... The previous literature candidate search-term list was obtained from the works of Halevi (2012), Huang et al. (2015), Park and Leydesdorff (2013), Porter et al. (2015) and Rousseau (2012). ...
... The works of Huang et al. (2015) and Porter et al. (2015) explicitly differentiated between the search-terms of the A, B and C sets. Therefore, these were assigned to the respective sets. ...
... From the application of the systematic previous literature search-term selection process, we perceived in the previous studies creating lexical queries to study big-data were unconnected, except for Porter et al. (2015) and Huang et al. (2015). The individual nature of these initiatives was reflected in that, the majority of the reviewed previous literature search-terms were repeated, and there was little advancement in the exploration of new relevant search-terms, even though the field of study was an emerging technology which is continuously changing. ...
Article
Obtaining document sets to study emerging technologies is challenging. Researchers studying emerging technologies use lexical queries, e.g., core, expanded and evolutionary, to face this challenge. Creating lexical queries requires the selection of search-terms. Manual, automatic and semi-automatic techniques can be implemented to select search-terms. The current reported processes to select search-terms can be complemented by attending two issues. One is the lack of a systematic process for the selection of search-terms from previous literature, and the second is the evaluation of candidate search-terms’ document retrieval interdependence. We propose two steps to complement the process of selecting search-terms to create lexical queries to study emerging technologies. The first step consists of a process to systematically select search-terms from previous literature. The second is an evaluation of search-terms’ document retrieval interdependence, and for its evaluation, we propose the Significance of Interception Ratio (SIR). We tested our proposed steps setting as a reference the big-data lexical query proposed by Huang et al. (Scientometrics 105:2005–2022, 2015). The tests results show that the proposed steps can complement the current automatic methods to select search-terms. The first step increased around a 24% the recall of the reference lexical query. The increase in the recall was possible because of the addition of 37 additional search-terms and the elimination of three search-terms from the reference lexical query. In the second step (application of the SIR), five search-terms from the reference lexical query were optimized, showing a slight complementary ability when selecting search-terms.
... Stage 1-To understand the technology (Big Data) we reviewed 249 reviews identified in our Web of Science search and forecasts (6,7). We used our collegial networks to help us identify 18 Big Data innovation target applications (see Table 1) and, in Stage 2, to review our empirical analyses. ...
... Stage 4-We have prepared a variety of articles and other reports (6,8,9,10) as well as enriched our matrix of 10 policy-oriented factors X 18 (or so) applications. Two issues-privacy and security-are pervasive in analyzing Big Data applications. ...
... We devised an intricate search algorithm(8) and applied variants to retrieve abstracts from each of these databases. Papers on Big Data indexed by Web of Science (WoS) made an astounding increase from 29 in 2011 to 1,544 published in 2014(6).The illustration on the next page overlays the Big Data papers indexed in WoS on a base map locating some 224 WoS categories as nodes based on one year of WoS publications. More related categories (based on cross-citations in the respective papers) appear close together. ...
Article
Full-text available
Tech Mining of R&D literature, patent and business intelligence (1) can pay off in anticipating future pathways for tech innovation in order to make better business decisions, wrote Alan L. Porter and Nils C. Newman in the Spring 201l CIMS Technology Management Report (2). After relating several business decision success stories, they announced initiation of research at Georgia Tech to develop the data analytics for exploiting the potential of Big Data to forecast innovation pathways. The article below highlights progress to date in this " Forecast Innovation Pathways " (FIP) project.
... Since 2012, a large number of Big Data-related projects have been supported by the Ministry of Science and Technology, the National Development and Reform Commission, the Ministry of Industry and Information Technology, and other central governmental departments of China. Porter and his colleagues [38] figured out that Big Data scientific publications grew dramatically in 2013 and 2014, by more than four times the number of the papers published in 2012. They further reported that the leading countries based on author location were the US and China; these two countries accounted for more than half of all Big Data publications and nearly all of the top 30 author organizations. ...
... For such purposes the granularity of the WCs is effective-i.e., some 224 WCs differentiate sub-fields, so we have applied science overlay mapping to visualize these differences [56] in Fig 6. and Fig 7. Again, not surprisingly, Big Data papers from NSF or NSFC funding are dominated by Computer Science, followed by Math Methods. But the pattern of widespread engagement is remarkable, suggesting that Big Data research is not bottled up in a silo [38]. There are plenty of Big Data papers from sponsored research related to Biomedicine Science in the case of NSF and Environment Science & Technology in the case of NSFC. ...
Article
Full-text available
How do funding agencies ramp-up their capabilities to support research in a rapidly emerging area? This paper addresses this question through a comparison of research proposals awarded by the US National Science Foundation (NSF) and the National Natural Science Foundation of China (NSFC) in the field of Big Data. Big data is characterized by its size and difficulties in capturing, curating, managing and processing it in reasonable periods of time. Although Big Data has its legacy in longstanding information technology research, the field grew very rapidly over a short period. We find that the extent of interdisciplinarity is a key aspect in how these funding agencies address the rise of Big Data. Our results show that both agencies have been able to marshal funding to support Big Data research in multiple areas, but the NSF relies to a greater extent on multi-program funding from different fields. We discuss how these interdisciplinary approaches reflect the research hot-spots and innovation pathways in these two countries.
... Two lexical queries are used on the scientific databases to obtain the document datasets for big-data and public broadcasters, respectively. Previous work have developed lexical queries and search strategies to acquire datasets that explain the knowledge behind big-data [14], [51], [52]; We borrowed the core lexical query proposed by [14] and, inspired in this work, we extended the core and expanded lexical queries and obtained an updated lexical query for big-data which covers the topic over the years 2008-2016. The lexical query is run on Web of Science to obtain the dataset for big-data, and the type of document accepted are conferences and journal articles. ...
Conference Paper
The competitive environment of the broadcasting sector is changing; under this change, public broadcasters have to adapt to keep being relevant to their users. Big-data technologies play an essential part in the technological side of these changes. Our objective is to identify the public broadcasters' big-data technology trajectories in this changing environment. We propose two research questions to narrow down the objective: What are the big-data technology trajectories of public broadcasters? Also, which are the directions of big-data technologies proposed by the public broadcasters? We propose as the method, to analyze scientific paper's keywords and combine it with network analysis. We compare two datasets, big-data, and public broadcasters. The big-data set is borrowed from a previous work done by the authors which detected big-data keywords proxy of knowledge convergence. The public broadcasters' dataset is created from the scientific publications reported by BBC and NHK. We match the big-data converging keywords to the keywords of the BBC and NHK publications and visualize their behavior over the time (2008–2016). We analyze the documents linked to the shared keywords on both datasets to identify the big-data technology trajectories and propose future directions. We identified as big-data technological trajectories for BBC, Linked open data, recommender system, semantic web, and Image processing; For NHK, speech recognition, Generate metadata to index NHK's programs and Augmented reality (AR). Concerning their future, the detected trajectories are expected to be useful for broadcasters an organizations related to their value chain.
... Our search model (Fig. 1) explicitly works from initial results to enrich the search via additional terms and heavily cited papers. Our present search provides the basis of a relatively straightforward depiction of Big Data research emphases, with one interesting finding being that two countries dominate the global research (China and the US) (Porter et al. 2015). As noted, our search include conference papers from WoS, so provides a different profile than searches limited to WoS journals (Table 5). ...
Article
Full-text available
Bibliometric and “tech mining” studies depend on a crucial foundation—the search strategy used to retrieve relevant research publication records. Database searches for emerging technologies can be problematic in many respects, for example the rapid evolution of terminology, the use of common phraseology, or the extent of “legacy technology” terminology. Searching on such legacy terms may or may not pick up R&D pertaining to the emerging technology of interest. A challenge is to assess the relevance of legacy terminology in building an effective search model. Common-usage phraseology additionally confounds certain domains in which broader managerial, public interest, or other considerations are prominent. In contrast, searching for highly technical topics is relatively straightforward. In setting forth to analyze “Big Data,” we confront all three challenges—emerging terminology, common usage phrasing, and intersecting legacy technologies. In response, we have devised a systematic methodology to help identify research relating to Big Data. This methodology uses complementary search approaches, starting with a Boolean search model and subsequently employs contingency term sets to further refine the selection. The four search approaches considered are: (1) core lexical query, (2) expanded lexical query, (3) specialized journal search, and (4) cited reference analysis. Of special note here is the use of a “Hit-Ratio” that helps distinguish Big Data elements from less relevant legacy technology terms. We believe that such a systematic search development positions us to do meaningful analyses of Big Data research patterns, connections, and trajectories. Moreover, we suggest that such a systematic search approach can help formulate more replicable searches with high recall and satisfactory precision for other emerging technology studies.
Article
It is imperative and arduous to acquire product and business intelligence of global technical market. In this paper, a deep learning methodology is proposed to automatically extract and discover vital technical information from large-scale news dataset. More specifically, six kinds of technical elements are first defined to provide the concrete syntax information. Next, the CRF-BiLSTM approach is used to automatically extract technical entities, in which a conditional random field (CRF) layer is added on top of bidirectional long short-term memory (BiLSTM) layer. Then, three indicators including timeliness, influence and innovativeness are designed to evaluate the value of intelligence comprehensively. Finally, as a case study, technical news on three military-related websites is utilized to illustrate the efficiency and effectiveness of the foregoing methodology with the result of 80.82 (F-score) in comparison to four other models. In more detail, data on unmanned systems are extracted to summarize the state-of-the-art, and track up-to-the-minute innovations and developments in this field.
Article
Full-text available
Recent emerging technology policies seek to diminish negative impacts while equitably and responsibly accruing and distributing benefits. Social scientists play a role in these policies, but relatively little quantitative research has been undertaken to study how social scientists inform the assessment of emerging technologies. This paper addresses this gap by examining social science research on ‘Big Data’, an emerging technology of wide interest. This paper analyzes a dataset of fields extracted from 488 social science and humanities papers written about Big Data. Our focus is on understanding the multi-dimensional nature of societal assessment by examining the references upon which these papers draw. We find that eight sub-literatures are important in framing social science research about Big Data. These results indicate that the field is evolving from general sociological considerations toward applications issues and privacy concerns. Implications for science policy and technology assessment of societal implications are discussed.
Article
Full-text available
We present a novel approach to visually locate bodies of research within the sciences, both at each moment of time and dynamically. This article describes how this approach fits with other efforts to locally and globally map scientific outputs. We then show how these science overlay maps help benchmark, explore collaborations, and track temporal changes, using examples of universities, corporations, funding agencies, and research topics. We address conditions of application, with their advantages, downsides and limitations. Overlay maps especially help investigate the increasing number of scientific developments and organisations that do not fit within traditional disciplinary categories. We make these tools accessible to help researchers explore the ongoing socio-cognitive transformation of science and technology systems. Comment: 40 pages, 6 Figures
Article
Full-text available
The forecasting innovation pathways (FIP) approach combines empirical tech mining with expert opinion. To date, FIP has been devised for relatively immature emerging technologies. This study extends the FIP methodology to work for a more advanced and complicated technology. It does so through a case analysis of hybrid and electric vehicles (HEVs). We retain the ten-step FIP process, augmenting several steps to deal with this more complex technology and technology delivery system (TDS). In particular, it is vital to address TDS sub-systems and attendant technical and market infrastructures. The key method to explore future prospects for the technology in question is an interactive workshop. Splitting into multiple workshop sub-groups proved constructive in addressing target markets and regional variations in innovation systems and policy options. The paper derives methodological suggestions to enrich FIP to address more complex technologies regarding scoping, sub-systems analyses, and ways to systematise key operations.
Article
Full-text available
This paper examines the structural patterns of networks of internationally co-authored SCI papers in the domain of research driven by big data and provides an empirical analysis of semantic patterns of paper titles. The results based on data collected from the DVD version of the 2011 SCI database identify the U.S. as the most central country, followed by the U.K., Germany, France, Italy, Australia, the Netherlands, Canada, and Spain, in that order. However, some countries (e.g., Portugal) with low degree centrality occupied relatively central positions in terms of betweenness centrality. The results of the semantic network analysis suggest that internationally co-authored papers tend to focus on primary technologies, particularly in terms of programming and related database issues. The results show that a combination of words and locations can provide a richer representation of an emerging field of science than the sum of the two separate representations.
Article
Full-text available
Massive datasets of communication are challenging traditional, human-driven approaches to content analysis. Computational methods present enticing solutions to these problems but in many cases are insufficient on their own. We argue that an approach blending computational and manual methods throughout the content analysis process may yield more fruitful results, and draw on a case study of news sourcing on Twitter to illustrate this hybrid approach in action. Careful combinations of computational and manual techniques can preserve the strengths of traditional content analysis, with its systematic rigor and contextual sensitivity, while also maximizing the large-scale capacity of Big Data and the algorithmic accuracy of computational methods.
Article
Full-text available
The promise of data-driven decision-making is now being recognized broadly, and there is growing enthusiasm for the notion of "Big Data," including the recent announcement from the White House about new funding initiatives across different agencies, that target research for Big Data. While the promise of Big Data is real -- for example, it is estimated that Google alone contributed 54 billion dollars to the US economy in 2009 -- there is no clear consensus on what is Big Data. In fact, there have been many controversial statements about Big Data, such as "Size is the only thing that matters." In this panel we will try to explore the controversies and debunk the myths surrounding Big Data.
Chapter
Tech Mining supports management of technology (MOT) decision processes. To this end, we set out 13 MOT issues that lead into 39 MOT questions. We then array some 200 candidate empirical measures and more elaborate “innovation indicators” to address those MOT issues and questions. These indicators are grounded in understanding of technological innovation processes so as to track technology life cycle, innovation context, and market prospects. The chapter presents the expert opinion approaches that complement empirical tech mining. Representation of indicators raises challenges in matching user style preferences and appropriate visualizations and delivery modes. We offer “one-pagers” as technology information products that compile information to answer a particular MOT question. This chapter also considers how scripting can expedite analyses and how results can be integrated into business decision systems.
Article
Recognizing prior research and reflection, we offer a definition of interdisciplinary research (IDR) that focuses on integration of concepts, techniques and/or data. We note that this need not entail teaming. Building upon this definition, we discuss its implications for accurate measurement. We then synthesize contextual and process factors expected to foster knowledge integration. These suggest a rich set of research questions concerning the implications for successful IDR of actions by universities, funding organizations, professional associations, and the science media, including journal editors. We seek to engage social scientists who study research practices, organizations, and policy in consideration of interdisciplinary research processes and their evaluation.
Article
“Global technological competitiveness” is widely acknowledged, but the challenge is to go beyond this recognition to develop empirical indicators of important transitions. These may concern particular technologies, the competitive position of particular organizations, or national/regional shifts. For decades, the US has been the world leader in biomedical technologies, with attendant implications for organizational priorities in terms of R&D location and market targeting. Recent years have seen a tremendous acceleration in Asian research in most domains, including biomedical, particularly visible in China. This paper investigates comparative patterns between the US and China in a promising emerging area of biotechnology — Nano-Enhanced Drug Delivery. It then explores indicators of, and implications for, future transitions at the national level — an approach we label “Forecasting Innovation Pathways.”