Conference PaperPDF Available

Identifying emerging topics by combining direct citation and co-citation

Authors:
  • SciTech Strategies Inc.
  • SciTech Strategies, Inc.
  • SciTech Strategies, United States

Abstract and Figures

We present a novel approach to identifying emerging topics in science and technology. An existing co-citation cluster model is combined with a new method for clustering based on direct citation links. Both methods are run across multiple years of Scopus data, and emergent co-citation threads in a specific year are matched against the direct citation clusters to obtain the emergent topics ranked by a difference function. The topics are classified and characterized in various ways in order to understand the motive forces behind their emergence, whether scientific discovery, technological innovation, or exogenous events. Cross-sectional analysis of citation links and paper age are used to study the process of emergence for discovery based science topics.
Content may be subject to copyright.
928
IDENTIFYING EMERGING TOPICS BY
COMBINING DIRECT CITATION AND CO-
CITATION
Henry Small1, Kevin W. Boyack2 and Richard Klavans3
1 hsmall@mapofscience.com
SciTech Strategies, Inc., Bala Cynwyd, PA 19004 USA
2 kboyack@mapofscience.com
SciTech Strategies, Inc., Albuquerque, NM 87122 USA
3 rklavans@mapofscience.com
SciTech Strategies, Inc., Berwyn, PA 19312 USA
Abstract
We present a novel approach to identifying emerging topics in science and technology.
An existing co-citation cluster model is combined with a new method for clustering based
on direct citation links. Both methods are run across multiple years of Scopus data, and
emergent co-citation threads in a specific year are matched against the direct citation
clusters to obtain the emergent topics ranked by a difference function. The topics are
classified and characterized in various ways in order to understand the motive forces
behind their emergence, whether scientific discovery, technological innovation, or
exogenous events. Cross-sectional analysis of citation links and paper age are used to
study the process of emergence for discovery based science topics.
Conference Topic
Research Fronts and Emerging Issues (Topic 4); Modeling the Science System, Science
Dynamics and Complex System Science (Topic 11)
Introduction
Researchers in information science have long pondered how and why scientific
topics emerge. Derek Price famously analyzed the emergence of the topic of N-
rays using a citation network represented as a matrix (1965). Eugene Garfield
studied the development of genetics by constructing a node and link citation
network that he called a historiography (Garfield, Sher, & Torpie, 1964). Later
on co-citation clusters were used to detect emergence (Small, 1977), and more
recently co-authorship networks (Bettencourt, Kaiser, Kaur, Castillo-Chavez, &
Wojick, 2008) and direct citations (Shibata, Kajikawa, Takeda, & Matsushima,
2008) have been used for the same purpose.
Methods differ in the degree of foreknowledge used. Most rely on a case study
approach where a literature search is conducted for a specific topic expected to be
emergent, and then methods are used to verify that, in fact, emergence has
929
occurred. These might be termed local methods because only a literature local to
the targeted topic is used. More a priori or global approaches, in contrast, make
no assumptions about what new areas might have emerged. Global approaches are
based on a comprehensive analysis of an entire literature database by methods
such as cluster analysis using co-citation, bibliographic coupling (Boyack &
Klavans, 2010), or other methods such as topic modelling (Blei & Lafferty,
2007). An important new methodology which uses simple citation links has
recently been developed (Waltman & Van Eck, 2012) which uses a variant of
modularity clustering and takes normalized direct citation links as input. The
method arrives at an assignment of papers to clusters by maximizing a function
that rewards linked papers if they are in the same cluster and penalizes them if the
papers in the same cluster are not linked. An optimization algorithm is used to
maximize the function. Interestingly this new method turns the original local
methods of Price and Garfield into global methods with the ability to
automatically break up huge multiyear citation link databases into what are, in
effect, separate historiographs. In this paper we will use a unique marrying of two
global methodologies, direct citation clustering and co-citation clustering, for the
purpose of identifying emerging topics in science and technology.
Methods
The co-citation method forms clusters of cited papers based on their joint citation
in an annual slice of a citation database, and assigns current papers from that
annual slice to one or more of the clusters based on their referencing patterns.
The resulting clusters tend to be small and narrowly focused at the scientific
problem level. The annual solutions are then merged to form threads which
connect clusters in adjacent year slices based on shared cited papers (Klavans &
Boyack, 2011). This merges the yearly cluster slices into a longitudinal picture.
The resulting threads can be classified by their duration. For example, possibly
emergent threads for a given year are considered to be those that begin in the
previous or current year, that is, are only one or two years old. It is then possible
to identify all papers from a given year that belong to potentially emergent
threads.
Unlike co-citation which relies on the joint citation of earlier papers, the direct
citation clusters are based simply on the citation of individual papers by each
other and finds local concentrations of citation links by maximizing a modularity
criterion. The process generates clusters that are much larger and more broadly
focused than the co-citation model. The resulting direct citation networks, like
the co-citation threads, are of varying duration and involve different numbers of
papers per year.
Once the co-citation threads and direct citation clusters are in hand, the task is to
select those direct citation clusters that are the most emergent in specific years.
The approach used is to count the papers in the direct citation clusters that belong
to emergent threads (one or two years old) in the co-citation model. This is done
on a year by year basis, so the direct citation clusters having the highest emergent
930
counts in a given year can be identified. In addition, the number of papers in a
matching direct citation cluster in a set of prior years (greater than two years prior
to the emergent year) is subtracted from the emergent year counts to avoid
selecting areas with high publication activity in prior years. This ensures that the
emergent topics are increasing in size in addition to containing many papers
belonging to emergent threads. There are of course numerous variations of
selection criteria that could be attempted, but by combining evidence from both
forms of analysis we can take advantage of the high precision of the co-citation
model and the stronger growth characteristics of the direct citation model. The
difference between the emergent year counts and the prior year counts provides a
metric on which to rank the emergent topics in a given year. We call this the
emergence differential.
Figure 1 is an example of how a direct citation cluster is matched with emerging
co-citation threads. The topic is computed tomography angiography and the year
of emergence is 2007. The graph shows the growth in number of citing papers by
year in the direct citation cluster, superimposed on which are matching co-citation
threads which start in 2007 and hence are considered emergent. The numbers of
papers in emergent threads that match the direct citation cluster are given in the
thread boxes. Only some of the matching threads are shown. The sum of the
matching papers minus papers prior to 2005 in the direct citation cluster gives the
emergence differential.
Figure 1. Matching a direct citation cluster and emerging co-citation threads on the
topic of computed tomography angiography. The matching papers in 2007 are given
in the thread boxes. The number of papers in the direct citation cluster is above each
bar.
931
The data set used is a 15 year Scopus database (1996–2010) under a special
arrangement with Elsevier. Direct citation clustering was carried out on this
compilation using CWTS open access software (Waltman & Van Eck, 2012).
Existing co-citation clusters and threads were also used covering the same time
period. The years 2007-2010 were selected for identification of the top 25
emerging topics. The emerging threads (one or two years old) were identified for
each year and their papers matched against the direct citation cluster papers for
the same year. The number of matching papers minus the papers in the direct
citation cluster greater than two years prior to the emergent year gave the
emergence differential which was used to rank the topics in each year. A total of
71 distinct topics were selected across the four years, 50 of which appeared in
only one year, and the remaining 21 in two or more years. Six topics were in the
top 25 for three years, but none appeared in all four years. We will focus here on
the topics for 2010 which are listed in Table 1.
Results
The first column of table 1 gives the rank number of the direct citation cluster
determined by sorting the emergence differential. A topic name is given in the
second column which is based on a manual analysis of the titles and abstracts of
2010 papers in the intersection of the direct citation and emerging co-citation
clusters. The third column labelled “type” is a categorization of the type of event
mainly responsible for the emergence. We consider three types of events:
discovery, innovation and exogenous. The categorization was made by
examination of the 2010 papers in the topic and the papers they cited.
“Discovery” refers to scientific areas where an unexpected finding is made or
fundamental knowledge is gained. An example is the first topic on the list, iron-
based high temperature superconductivity, which was a discovery of
superconductivity in a new class of materials not previously thought to be a good
candidate for superconductivity.
The “innovation” category refers to areas of technology where existing science or
technology is used to create new devices or capabilities that serve specific
purposes. An example is cognitive radio which takes a new approach to assigning
radio spectrum. The third category “exogenous” refers to factors external to
science and technology, such as natural disasters, health threats, or societal events
with major impacts such as the launch of a new web product or a government
standard. An example is the second topic on the list, the swine flu pandemic of
2009, in which the global spread of a virus mobilized the health care community
to understand and combat the disease. If an innovation or discovery topic also
involves an exogenous event, a combined code is used. For example, the flu
pandemic is considered both a discovery and exogenous because a new virus was
discovered and it was a worldwide health event. Another example is topic 18 on
crystallographic evaluation where a new software service was introduced to
validate crystal structures. It should also be clear that discovery topics can also
932
involve elements of technological innovation and vice versa. What is sought here
is the main catalyst of emergence.
Table 1. 2010 top 25 emerging topics. Abbreviations: r = rank; dis = discovery; inn =
innovation; exo = exogenous; year Ev = year of event; year HC = year of most cited
paper; year Em = year of first emergence; Ev to HC = time lag from event to most
cited paper; Ev to Em = time lag from event to first emergence; H = H index.
r
label
type
year
Ev
year
HC
year
Em
Ev
to
HC
Ev
to
Em
1
iron-based superconductors
dis
2008
2008
2008
0
0
2
swine flu (H1N1) pandemic
dis/exo
2009
2009
2009
0
0
3
spectrum sensing in cognitive
radio
inn
2005
2005
2007
0
2
4
graphene nanosheets and
nanocomposites
dis
2006
2004
2010
-2
4
5
Horava-Lifshitz quantum gravity
dis
2009
2009
2010
0
1
6
graphene oxide nanosheets
dis
2008
2004
2010
-4
2
7
induced pluripotent stem-cells
dis
2006
2006
2008
0
2
8
MapReduce framework
inn/exo
2007
2008
2010
1
3
9
signal recovery from compressed
sensing
inn
2006
2006
2009
0
3
10
graphene transistors and optical
devices
dis
2005
2004
2010
-1
5
11
zigzag graphene nanoribbons
dis
2006
2004
2009
-2
3
12
cardiovascular events in type 2
diabetes
dis/exo
2008
2008
2008
0
0
13
transformative optics
dis
2006
2006
2009
0
3
14
spectrum allocation in cognitive
radio
inn
2005
2005
2010
0
5
15
IDH1 and IDH2 mutations in
cancer
dis
2009
2009
2010
0
1
16
epitaxial graphene
dis
2006
2004
2010
-2
4
17
H1N1 pandemic and seasonal flu
dis/exo
2009
2009
2010
0
1
18
crytallographic validation
inn/exo
2009
2009
2010
0
1
19
social tagging
inn/exo
2004
2006
2007
2
3
20
mechanical properties of graphene
dis
2008
2008
2010
0
2
21
online social networking
inn/exo
2006
2007
2010
1
4
22
gold nanocrystals
dis
2007
2007
2009
0
2
23
cloud computing
inn/exo
2006
2009
2010
3
4
24
cognitive radio networks
inn/exo
2003
2006
2010
3
7
25
metal-organic frameworks
dis/exo
2009
2009
2009
0
0
“Discovery” was the most common category with 12 topics. The combination of
“discovery/exogenous” had four topics, and these were mostly medical such as
the flu virus or a drug trial (topic 12). “Innovation” had only three topics, for
example, a new mathematical approach to signal compression (topic 9). The
933
combination “innovation/exogenous” had, however, six instances, suggesting that
technology areas often have an exogenous component. Many of these
combinations were computer science oriented involving, for example, a new
programming system (topic 8) or launch of a new web service (topic 21) that
stimulates research. Overall “discovery” applied to about two-thirds of topics,
“innovation” to one-third, and about 40 percent of topics had “exogenous”
influences.
A more detailed analysis of the causative factors for emergence suggests that in
most cases the publication of a new idea is what sets the stage for the emergence.
Fifteen of the 25 topics follow this pattern. In other cases the causative event was
the launch of a technology such as cloud computing services (topic 23) or a new
data management framework from Google (topic 8). Also government actions
such as DARPA’s architecture for cognitive radio (topic 24), or the failure of a
clinical trial (topic 12) can spark new research.
The fourth column labelled “year Ev” gives the year of the event. In cases where
a specific paper is driving emergence, this is the publication year of the paper.
This year may or may not correspond to the year of the most cited paper given in
the fifth column labelled “year HC”. Citation counts are determined by collecting
all references from the 2010 papers that are in the intersection of the direct
citation cluster and the emerging co-citation threads. Hence, this count is local to
a specific set of 2010 papers and differs from the global citation count found in
Scopus. Local citation counts are used because we want to assess the importance
of the paper to the specific topic. Examples of where the most cited paper differs
from the paper that appears to have directly stimulated the topic are some of the
graphene related areas. The most cited paper for these topics is usually the
original graphene discovery paper by Novoselov and Geim (2004), while the
paper most germane to the specific graphene topic often corresponds to a less
cited paper, but usually within the top three or four.
The sixth column labelled “year Em” is the year in which the topic was observed
to emerge in the top 25 going back to 2007. Because we have generated top 25
lists for each year from 2007 to 2010, it is possible that a given topic will be in the
top 25 for multiple prior years. This is illustrated in Figure 2 which plots the rank
of topics which have appeared in the top 25 in three consecutive years from 2007
to 2010. For example, the iron-based superconductor topic was ranked first for
three consecutive years from 2008-2010, while induced pluripotent stem-cells
rose from rank 19 in 2008 to rank 7 in 2010, and social tagging fell from rank 1 in
2007 to rank 19 in 2010. Fourteen of the 25 topics in 2010 appeared in the
ranking for the first time in 2010, and it is likely that several of these topics will
fall out of the top 25 ranking in 2011.
The seventh and eighth columns labelled “Ev to HC” and “Ev to Em” give two
time lags of interest: the time lag from the emergence event to publication of the
most cited paper, and the lag from the event to the year of first emergence. In the
former, lags will be positive if the most cited paper is published after the
emergence event and negative if the most cited paper precedes the key event. The
934
negative time lags are due to the graphene discovery paper being published prior
to the highly cited paper closest to the topic in content. Positive time lags tend to
be associated with exogenous stimuli, such as a software system, web products, or
government standards that stimulate research and result in highly cited papers at
later dates. Across all topics, the average lag from event to most cited paper is
near zero. The second type of lag shown in the column labelled “Ev to Em” is
more a measure of our system’s ability to detect emergence at an early stage.
Large positive lags indicate a delay in detection, and there are no negative lags.
The average delay in detection across the 25 topics is 2.5 years, and the largest
lags include both discovery and innovation cases where delays may be due to
technical or conceptual problems, as was possibly the case with some of the
graphene topics which were technically difficult.
Figure 2. Change in rank of topics in top 25 that appear in three or more years 2007
-2010.
The last column labelled “H” gives the H index, the number of papers N cited at
or above N times. This indicates the number and citedness of highly cited papers
in the topic. The data suggest that low H values are associated with topics which
are driven by exogenous events, such as swine flu, cloud computing, and social
tagging. As one would expect, the H indexes are higher for topics associated with
specific discovery or innovation papers. The highest H index is for iron-based
superconductivity (topic 1), clearly a discovery based topic, while the lowest is
935
online social networking (topic 21) which is focused on analyses of data from
social network services such as Twitter and Facebook.
The topics were also coded for indications of any practical applications that
researchers hoped to achieve. Interestingly all of the topics, with the exception of
quantum gravity (topic 5), foresaw some type of practical application. About half
the topics envisioned specific devices or physical products, while the other half
anticipated improvements in services, for example, health care or software.
Validation
In the absence of a definitive list of emerging topics against which to evaluate this
list, we fall back on other types of evidence to corroborate that the topics are of
current importance, such as awards to authors of most cited papers or recognition
in the science press. The awards should be relevant to the topics and post-date the
highly cited work in question. Two Nobel Prizes were related to the topics, one
for graphene awarded to Novoselov and Geim in 2010 (topics 4, 6, 10, 11, 16,
20), and another to Shinya Yamanaka in 2012 for induced pluripotent stem-cells
(topic 7). Graphene was also named a runner-up to “Breakthrough of the Year”
by Science in 2009. Both graphene and induced pluripotent stem-cells have been
the object of recent bibliometric studies (Chen, Hu, Liu, & Tseng, 2012; Shapira,
Youtie, & Arora, 2012; Shibata, Kajikawa, Takeda, Sakata, & Matsushima,
2010).
Other highly cited authors also received recognition. In 2009 Hideo Hosono
received the Bernd T. Matthias Prize for his discovery of iron-based high
temperature superconductivity (topic 1), and in 2008 the topic was named a
runner up to “Breakthrough of the Year” by Science. Sir John Pendry was
awarded the UNESCO-Niels Bohr gold medal in 2009 and the 2010 Willis E.
Lamb Award for Laser Science and Quantum Optics for his work on
transformative optics and meta-materials (topic 13). In 2008 David Dohono
received the IEEE Information Theory Society Paper Award for his work on
compressed sensing (topic 9), an award he shared with the author of the second
most cited paper in the topic Emmanuel Candes. In 2010 Anthony Spek received
the Kenneth Trueblood award for his work in chemical crystallography and
crystallographic computing (topic 18). In addition, the swine flu virus (topics 2
and 17) was named “virus of the year” by Science in 2009, and in 2008 IDH1 and
IDH2 mutations in cancer (topic 15) was named a runner up to “Breakthrough of
the Year” by Science (topic 15).
While this search for awards is necessarily incomplete, it provides evidence that
at least some of the topics and their highly cited authors have received recent
recognition for work that has topical relevance.
Citations during emergence
To gain a better understanding of the process of emergence, the pattern of
citations was examined during the period of emergence for the first ranked topic –
iron-based superconductivity. The analysis is based on all citation links extracted
936
from the direct citation cluster for this topic. In this case a specific discovery
paper had appeared in 2008 which was critical to the topic. The procedure was to
make annual time slices into the citation network and compute the most cited
papers in each year.
Table 2 gives the ten most cited papers for each of three years, 2007-2009 which
spans the year of emergence 2008. We use letter codes to identify the papers and
also show the age of the cited papers with respect to the citing year. The
discovery paper is indicated by an asterisk, and the letter code for the paper is
underlined if the paper continues from the prior year.
First we observe a dramatic increase in the H index across the time slices
coinciding with the appearance of the discovery paper at the top of the ranking in
2008 when H goes from 3 to 30. Of course, this goes hand in hand with a rapid
increase in the number of papers and citations in the direct citation cluster.
Second we see a decrease in the age of the cited papers. In the year of emergence
the top seven papers have an age of 0, that is, were published in the citing year.
Third we see a low continuity of cited papers prior to emergence and a high
continuity of cited papers following emergence. Of course, high post-emergence
continuity leads to an aging of the highly cited work, which will continue unless
new papers become highly cited.
Table 2. Iron-based superconductivity top 10 papers by year during emergence
showing paper age, citations and continuity.
Cited
paper
2007
age
#cites
Cited
paper
2008
age
#cites
Cited
paper
2009
age
#cites
A
1
4
K*
0
277
K*
1
517
B
12
3
L
0
140
T
1
275
C
1
3
M
0
132
L
1
258
D
4
2
N
0
106
M
1
235
E
12
2
O
0
104
U
14
202
F
12
2
P
0
96
O
1
193
G
6
2
Q
0
93
N
1
169
H
6
2
R
13
84
Q
1
166
I
5
2
S
13
79
P
1
143
J
5
2
C
2
79
V
14
131
H=3
H=30
H=51
_ underline – continuing from previous year
* discovery paper
This suggests that the discovery event was sufficiently persuasive to immediately
dominate the community, stimulate a new crop of compelling findings and carry
this interest forward in time. We do not know yet whether this pattern holds for
other topics in the list, particularly those that are not so clearly associated with
specific discovery papers. Nevertheless the results suggest a general pattern which
might hold for discovery-based science where the combined factors of citedness,
age, and continuity are important indicators.
937
Discussion
Despite the fact that citation data are often regarded as biased toward science, we
are struck by how strongly technology-based topics are represented. These topics
were generally categorized as innovation. Eight of the topics are clearly
technology-based, and a number of other more science-based areas such as
epitaxial graphene, metal-organic frameworks and transformative optics have
important technological components. Five of the technology topics are oriented
toward computer science, and their appearance possibly reflects the strong
representation of this subject in the Scopus database.
Since one factor in our detection methodology is growth in the direct citation
network, we could ask whether the topics identified are prone to bandwagon
effects. Such a tendency could be the result of an availability of a large pool of
researchers with adequate support to be able to rapidly exploit a new finding.
Such might be the case, for example, with the high temperature superconductivity
community within materials science and applied physics. Another way to pose
this question is to ask why we do not see more topics in basic physics, chemistry,
and biology, and whether such topics may have less dramatic growth
characteristics? Perhaps varying the selection parameters for matching direct
citation clusters and co-citation threads would give a stronger representation of
these disciplines.
Another feature of the list that requires further research is the repetition of topics
within the top 25, such as the appearance of six graphene related topics and three
on cognitive radio. It is perhaps not surprising that a material of such practical
and theoretical interest as graphene should have such a strong representation. It is
usually possible to draw subtle distinctions between the various subtopics dealing
with graphene, and these distinctions are usually apparent in the citing papers as
well as a different mix of highly cited papers. The most likely explanation for
this repetition is an overly granular setting of the underlying direct citation
clustering parameters, or perhaps also the proneness of citation data to
fragmentation.
A more fundamental question regarding the methodology we have used to
identify emerging topics is whether alternative methodologies would perform
equally well, or whether known cases of emergence during the 2007-2010 period
were missed. For example, could either the direct citation clusters or co-citation
threads be used on their own to detect emergence? Direct citation clusters have
measurable growth properties so a slope analysis looking for inflection points
might be possible. Alternatively, emergent co-citation threads could be grouped
using some alternative bibliometric measure independent of the direct citation
clustering and used as an emergence indicator. These possibilities remain to be
explored, but what we can say now is that the two methods, based on different
citation metrics and algorithms, can be used in a complementary manner that
takes advantage of the longitudinal and cross-sectional strengths of the respective
methods. Lacking any definitive list of emerging topics for the period, we cannot
say whether areas have been missed, but a good source of intelligence on this
938
question can be obtained from the Breakthrough of the Year listings in Science,
where we have seen some confirmation of our selections, but not a one-to-one
match.
Conclusions
It seems clear that specific highly cited papers have played a key role in
emergence in 17 of 25 topics, including technological areas such as cognitive
radio and compressed sensing. It is likely that most of these discoveries and
innovations could not have been anticipated, even though with hindsight we
might be able to identify precursor papers in the direct citation network that might
foretell possible forthcoming breakthroughs. One task for future research will be
to use this list of topics and similar lists from other years to see if common
preconditions to discovery and innovation can be found. It is also of interest to
study the fate of these emerging topics in later years. Did work continue, decline
or disappear? We would not be surprised if some were proved to be errors, dead
ends, or continued under their own inertia until well past their prime. Having a
reasonably certain inventory of emergent topics as a quasi-gold standard opens up
many new research possibilities, for example, studies of sentiment words changes
during emergence, or correlated social network or institutional factors.
The role of exogenous events, which was a factor in 40 percent of topics, also
deserves further attention. Previous bibliometric case studies have been carried
out on topics such as the 9/11 and anthrax terrorist attacks (Chen, 2006; Morris,
Yen, Wu, & Asnake, 2003), but perhaps more common exogenous events are
disease or natural disaster-related. We do not know how pervasive such
influences are or in general the role that extra-scientific factors have in
emergence. As we delve more deeply into other topics, we may find further
evidence of exogenous stimuli. For example, in metal-organic frameworks (topic
25) it was not immediately obvious that the DOE had issued new targets for
hydrogen storage.
Regarding our methodology, we do not know whether we can reduce the average
time lag of 2.5 years from the so-called emergence event to our detection of
emergence. This may depend on our ability to identify emergent co-citation
threads earlier perhaps by adjusting our threading threshold, since we know that
the slope of the direct citation cluster growth curve will not be steep at earlier
stages. Perhaps an indicator of network structure can also be devised.
In modelling the emergence process at the paper level we need to further
investigate the factors of citedness, paper age, and continuity of the highly cited
papers. These variables might eventually be part of an emergence index, in
conjunction with the topic growth rate. Obviously the precision of topic paper
identification is critical in such an analysis, and the combination of direct citation
and co-citation methods used here has probably contributed to this accuracy.
Clearly at this stage we are engaged in detection and not prediction of emergence.
Perhaps the most important implication of the present work is that detection by
citation-based methods is broadly feasible using a global approach to data
939
analysis rather than a local or case study approach which up to now has been the
predominant approach. Whether detection can be enhanced by a deeper analysis
of full texts, or application, for example, of word-based methods remains to be
seen.
Acknowledgments
Scopus data from 1996 to 2010 were generously provided by Elsevier under an
agreement with SciTech Strategies, Inc. We would like to thank Ludo Waltman
and Nees Jan van Eck and CWTS for use of the direct citation clustering software.
This research is supported by the Intelligence Advanced Research Projects
Activity (IARPA) via Department of Interior National Business Center
(DoI/NBC) contract number D11PC20152. The U.S. Government is authorized to
reproduce and distribute reprints for Governmental purposes notwithstanding any
copyright annotation thereon. Disclaimer: The views and conclusions contained
herein are those of the authors and should not be interpreted as necessarily
representing the official policies or endorsements, either expressed or implied, of
IARPA, DoI/NBC, or the U.S. Government.
References
Bettencourt, L. M. A., Kaiser, D. I., Kaur, J., Castillo-Chavez, C., & Wojick, D.
(2008). Population modeling of the emergence and development of
scientific fields. Scientometrics, 75(3), 495-518.
Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. Annals
of Applied Statistics, 1(1), 17-35.
Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, bibliographic
coupling, and direct citation: Which citation approach represents the
research front most accurately? Journal of the American Society for
Information Science and Technology, 61(12), 2389-2404.
Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and
transient patterns in scientific literature. Journal of the American Society
for Information Science and Technology, 57(3), 359-377.
Chen, C., Hu, Z., Liu, S., & Tseng, H. (2012). Emerging topics in regenerative
medicine: A scientometric analysis in CiteSpace. Expert Opinion on
Biological Therapy, 12(5), 593-608.
Garfield, E., Sher, I. H., & Torpie, R. J. (1964). The use of citation data in writing
the history of science. Philadelphia: Institute for Scientific Information.
Klavans, R., & Boyack, K. W. (2011). Using global mapping to create more
accurate document-level maps of research fields. Journal of the American
Society for Information Science and Technology, 62(1), 1-18.
Morris, S. A., Yen, G., Wu, Z., & Asnake, B. (2003). Time line visualization of
research fronts. Journal of the American Society for Information Science
and Technology, 54(5), 413-422.
940
Novoselov, K. S., Geim, A. K., Morozov, S. V., Jiang, D., Zhang, Y., Dubonos,
S. V., et al. (2004). Electric field effect in atomically thin carbon films.
Science, 306(5696), 666-669.
Price, D. J. D. (1965). Networks of scientific papers. Science, 149, 510-515.
Shapira, P., Youtie, J., & Arora, S. (2012). Early patterns of commercial activity
in graphene. Journal of Nanoparticle Research, 14(4), art. num. 811.
Shibata, N., Kajikawa, Y., Takeda, Y., & Matsushima, K. (2008). Detecting
emerging research fronts based on topological measures in citation
networks of scientific publications. Technovation, 28, 758-775.
Shibata, N., Kajikawa, Y., Takeda, Y., Sakata, I., & Matsushima, K. (2010).
Detecting emerging research fronts in regenerative medicine by the
citation network analysis of scientific publications. Technological
Forecasting & Social Change, 78(2), 274-282.
Small, H. (1977). A co-citation model of a scientific specialty: A longitudinal
study of collagen research. Social Studies of Science, 7(139-166).
Waltman, L., & Van Eck, N. J. (2012). A new methodology for constructing a
publication-level classification system of science. Journal of the
American Society for Information Science and Technology, 63(12), 2378-
2392.
... Although we did not count topics under the general rubric of "computer science", one could argue that nearly one-half the technology topics are oriented toward computer science, and their appearance possibly reflects the strong representation of this subject in the Scopus database. The prevalence of technology topics is only slightly higher than the 36% which was reported in our preliminary study of the top 25 emergent topics from 2010 (Small, Boyack & Klavans, 2013). ...
... A preliminary version of this paper based on 2010 data only was presented at the ISSI 2013 conference(Small et al., 2013). ...
Article
Full-text available
The identification of emerging topics is of current interest to decision makers in both government and industry. Although many case studies present retrospective analyses of emerging topics, few studies actually nominate emerging topics for consideration by decision makers. We present a novel approach to identifying emerging topics in science and technology. Two large scale models of the scientific literature, one based on direct citation, and the other based on co-citation, are combined to nominate emerging topics using a difference function that rewards clusters that are new and growing rapidly. The top 25 emergent topics are identified for each year 2007 through 2010. These topics are classified and characterized in various ways in order to understand the motive forces behind their emergence, whether scientific discovery, technological innovation, or exogenous events. Topics are evaluated by searching for recent major awards associated with the topic or its key researchers. The evidence presented suggests that the methodology nominates a viable list of emerging topics suitable for inspection by decision makers.
... More than 30 years later, Small's research plan was still ongoing. In the study by Small et al. [107], direct citations and co-citations to identify emerging topics are combined. ...
Article
This article uses the characteristics of citation curves in emerging research topics (ERTs) and combines them with the ERTs’ knowledge bases to draw conclusions by comparing their development patterns. The goal of this study is to enrich the toolset for predicting breakthroughs in scientific research. A set of multidimensional and practical bibliometric indicators is used to identify ERTs, to further identify the knowledge bases of ERTs and construct citation curves for both ERTs and their knowledge bases. The development trends of the citation curves of ERTs and their knowledge bases in different time periods are compared and analysed from two dimensions: knowledge transition and continuous growth. We use the field of stem cell research to test our method. Based on the outcome of the analysis, we can assess the breakthrough potential of ERTs. The stratification, transition and recent changes of the citation curve can be used as a basis for analysing and assessing the ERTs’ breakthrough potential. The combination of different citation diffusion patterns of ERTs and their knowledge bases can improve the effectiveness of identifying ERTs that can become breakthrough innovations.
... The thresholds of the front, middle, and back time zones are set to 2, 3, and 15; 3, 3, and 20; and 3, 3, and 20, respectively. Pathfinder, pruning the merged network and pruning the sliced networks are selected in pruning, and the first 50 bits of data in each time slice are extracted to generate the visual network views of cited articles [58,66,67]. Figure 2 shows the highly cited articles in the stormwater management field and the relationship network among them. ...
Article
Full-text available
The threat of urban floods due to climate change and urbanization has enabled sustained attention to the stormwater management field. Numerous scholars and countries have successively proposed innovative concepts for stormwater management. To grasp the current research focus and status quo and determine the development trend and dynamic direction, this work used CiteSpace, a scientific bibliometric analysis software, to analyze and identify 3080 articles based on the core database of Web of Science from 1980 to 2019. Results show a comprehensive overview of the stormwater management field, including the changes of annual articles with time; the most influential countries, institutions, authors, and articles; and the periodical keywords, highly cited papers, and burst time in the field. A knowledge table in the stormwater management field was obtained, the development context of the field and the research focus of each stage were understood, and the future development trend of the field is inferred. This study aims to provide reference for researchers and practitioners in the stormwater management field.
... He introduced the notion that rapid shifts in research focus, as identified in the scholarly research literature, could be regarded as a signal of 'revolutionary' change (Small, 1977). More than 30 years later, his research program is still on-going -Henry Small and his colleagues combine direct citations and co-citations to help to adequately identify emerging topics (Small et al., 2013). ...
... Given that the two models and maps are based on different theoretical perspectives, we feel no need to choose one over the other; these two models are extremely complementary. For example, preliminary analyses suggest that analysis based on a combination of both maps can be extremely useful for the identification of emerging topics (Small, Boyack, & Klavans, 2013). ...
Conference Paper
Full-text available
As data availability and computing resources increase, the ability to create more detailed and accurate global models of science is also increasing. This article reports on two advances in methodology aimed at creating more accurate versions of these highly detailed, dynamic, global models and maps of science. 1) A combined cocitation/ bibliographic coupling approach for assigning current papers to co-citation clusters is introduced, and is found to significantly increase the accuracy of the resulting clusters. 2) A sequentially hybrid approach to producing useful visual maps from models is introduced. Two maps and models - one based on linked annual cocitation/ bibliographic coupling models, and one based on direct citation - are created from a 16-year (1996-2011) set of Scopus data comprising over 20 million documents. The two models are compared and are found to be very complementary to each other.
... Overall, each nation has its own unique scientific profile. Our analyses also document where future opportunities are emerging (Small, Boyack, & Klavans, 2013). Emerging opportunities occur across the sciences -in medicine and the life sciences, computer science and engineering, and in the physical and chemical sciences. ...
Article
A great deal of work has been done to understand how science contributes to technological innovation and medicine. This is no surprise given the amount of money invested annually in R&D. However, what is not well known is that U.S. science (R&D) investment is only one-sixth that of the annual revenue received by non-profit organizations (NPOs) in the U.S. The large majority of NPO revenues are devoted to the remaining landscape of altruistic causes – those not relying as heavily on scientific inquiry. Given this broader context, one might reasonably expect the non-profit world to have been as well characterized as that of scientific research. The unfortunate truth is that no map of altruistic missions and causes exists; the landscape of altruistic activity is virtually unknown. In this paper, we present the first maps of altruistic mission space. These maps were created using the text from websites of 125,000 non-profit organizations (NPOs) in the U.S. The maps consist of 357 topics covering areas such as religion, education, sports, culture, human services, public policy and medical care. The role of science in this altruistic landscape is examined. Possible applications are discussed.
... We have a variety of efforts underway that use the map to answer detailed questions raised by research planners. For example, the threads from this map have been combined with a direct citation approach to successfully identify recent emerging topics in science (Small, Boyack, & Klavans, 2013). We are investigating new types of metrics to identify innovative (rather than impactful) articles and researchers, and are working with funders to correlate data associated with grant applications with metrics associated with applicants, referees and review panels. ...
Article
The majority of the effort in metrics research has addressed research evaluation. Far less research has addressed the unique problems of research planning. Models and maps of science that can address the detailed problems associated with research planning are needed. This article reports on the creation of an article-level model and map of science covering 16 years and nearly 20 million articles using cocitation-based techniques. The map is then used to define discipline-like structures consisting of natural groupings of articles and clusters of articles. This combination of detail and high-level structure can be used to address planning-related problems such as identification of emerging topics and the identification of which areas of science and technology are innovative and which are simply persisting. In addition to presenting the model and map, several process improvements that result in greater accuracy structures are detailed, including a bibliographic coupling approach for assigning current papers to cocitation clusters and a sequential hybrid approach to producing visual maps from models.
Chapter
Although there are a small number of work to conduct patent research by building knowledge graph, but without constructing patent knowledge graph using patent documents and combining latest natural language processing methods to mine hidden rich semantic relationships in existing patents and predict new possible patents. In this paper, we propose a new patent vacancy prediction approach named PatentMiner to mine rich semantic knowledge and predict new potential patents based on knowledge graph (KG) and graph attention mechanism. Firstly, patent knowledge graph over time (e.g. year) is constructed by carrying out named entity recognition and relation extraction from patent documents. Secondly, Common Neighbor Method (CNM), Graph Attention Networks (GAT) and Context-enhanced Graph Attention Networks (CGAT) are proposed to perform link prediction in the constructed knowledge graph to dig out the potential triples. Finally, patents are defined on the knowledge graph by means of co-occurrence relationship, that is, each patent is represented as a fully connected subgraph containing all its entities and co-occurrence relationships of the patent in the knowledge graph; Furthermore, we propose a new patent prediction task which predicts a fully connected subgraph with newly added prediction links as a new patent. The experimental results demonstrate that our proposed patent prediction approach can correctly predict new patents and Context-enhanced Graph Attention Networks is much better than the baseline.
Article
Research fronts represent areas of cutting-edge study in specific fields. They not only provide insights into current focuses and future trends, but also serve as important indicators for government policymaking with regard to technology. This study employed both bibliographic coupling and co-citation as methods to analyze the evolution of research fronts in the OLED field, and compared the outcomes in order to identify the differences between, and assess the effectiveness of, the two methods in detecting such research fronts. This study indicated that both analytic methods can be employed to track the evolution of research fronts. Compared with co-citation, bibliographic coupling identifies a higher number of research fronts, and detects the emergence of the research fronts earlier, thus showing better performance than co-citation in detecting research fronts.
Conference Paper
Full-text available
An indicator of conformity - the tendency for a scientific paper to reinforce existing belief systems - is introduced. This indicator is based on a computational theory of innovation, where an author's belief systems are compared to socio-cognitive norms. Evidence of the validity of the indicator is provided using a sample of 4180 high impact papers in two experiments. The first experiment is based on a 10 year model of the scientific literature. The robustness of the first experiment is tested using an alternative method for calculating the indicator and two 16-year models of the scientific literature.
Article
Full-text available
Graphene, a novel nanomaterial consisting of a single layer of carbon atoms, has attracted significant attention due to its distinctive properties, including great strength, electrical and thermal conductivity, lightness, and potential benefits for diverse applications. The commercialization of scientific discoveries such as graphene is inherently uncertain, with the lag time between the scientific development of a new technology and its adoption by corporate actors revealing the extent to which firms are able to absorb knowledge and engage in learning to implement applications based on the new technology. From this perspective, we test for the existence of three different corporate learning and activity patterns: (1) a linear process where patenting follows scientific discovery; (2) a double-boom phenomenon where corporate (patenting) activity is first concentrated in technological improvements and then followed by a period of technology productization; and (3) a concurrent model where scientific discovery in publications occurs in parallel with patenting. By analyzing corporate publication and patent activity across country and application lines, we find that, while graphene as a whole is experiencing concurrent scientific development and patenting growth, country- and application-specific trends offer some evidence of the linear and double-boom models.
Article
Full-text available
The phenomenon of specialization in science is receiving increasing attention as it becomes clear that the 'specialty' is the principal mode of social and cognitive organization in modern science. Recently, there have been some attempts to formulate theories of specialty growth and change.' Before sufficient evidence can accumulate to confirm or refute these theories, it seems likely that systematic and consistent methodological frameworks must be developed. The specialty of collagen research is presented here as the vehicle for exploring the possibilities of one such methodology. Collagen research was selected because its recent history illustrates what is possibly an important type of specialty change -- namely, that of rapid shift in research focus, or what might be loosely termed 'revolutionary' change. In this paper I outline a general method for the study of specialty change, based upon citation data. The point of departure is a system which clusters highly cited documents, using their frequency of cocitation as a measure of association.2 This system uses the magnetic tape version of the Science Citation Index (SCI) as its data base. The clusters obtained from the SCI have been found to correspond to narrow subject matter groupings and it has been suggested that these reflect the social and cognitive structures of research specialties. The collagen specialty was selected from roughly 1,000 clusters identified by applying this algorithm to the SCI.3
Article
Full-text available
INTRODUCTION: Regenerative medicine involves research in a number of fields and disciplines such as stem cell research, tissue engineering and biological therapy in general. As research in these areas advances rapidly, it is critical to keep abreast of emerging trends and critical turns of the development of the collective knowledge. AREAS COVERED: A progressively synthesized network is derived from 35,963 original research and review articles that cite 3875 articles obtained from an initial topic search on regenerative medicine between 2000 and 2011. CiteSpace is used to facilitate the analysis of the intellectual structure and emerging trends. EXPERT OPINION: A major ongoing research trend is concerned with finding alternative reprogramming techniques as well as refining existing ones for induced pluripotent stem cells (iPSCs). A more recent emerging trend focuses on the structural and functional equivalence between iPSCs and human embryonic stem cells and potential clinical and therapeutic implications on regenerative medicine in a long run. The two trends overlap in terms of what they cite, but they are distinct and have different implications on future research. Visual analytics of the literature provides a valuable, timely, repeatable and flexible approach in addition to traditional systematic reviews so as to track the development of new emerging trends and identify critical evidence.
Article
Full-text available
Research fronts, defined as clusters of documents that tend to cite a fixed, time invariant set of base documents, are plotted as time lines for visualization and exploration. Using a set of documents related to the subject of anthrax research, this article illustrates the construction, exploration, and interpretation of time lines for the purpose of identifying and visualizing temporal changes in research activity through journal articles. Such information is useful for presentation to members of expert panels used for technology forecasting.
Article
Full-text available
In the past several years studies have started to appear comparing the accuracies of various science mapping approaches. These studies primarily compare the cluster solutions resulting from different similarity approaches, and give varying results. In this study we compare the accuracies of cluster solutions of a large corpus of 2,153,769 recent articles from the biomedical literature (2004–2008) using four similarity approaches: co-citation analysis, bibliographic coupling, direct citation, and a bibliographic coupling-based citation-text hybrid approach. Each of the four approaches can be considered a way to represent the research front in biomedicine, and each is able to successfully cluster over 92% of the corpus. Accuracies are compared using two metrics—within-cluster textual coherence as defined by the Jensen-Shannon divergence, and a concentration measure based on the grant-to-article linkages indexed in MEDLINE. Of the three pure citation-based approaches, bibliographic coupling slightly outperforms co-citation analysis using both accuracy measures; direct citation is the least accurate mapping approach by far. The hybrid approach improves upon the bibliographic coupling results in all respects. We consider the results of this study to be robust given the very large size of the corpus, and the specificity of the accuracy measures used. © 2010 Wiley Periodicals, Inc.
Article
A study is reported which tested the hypothesis that citation indexes are useful heuristic tools for the historian. In this approach, the history of science is regarded as a chronological sequence of events in which each new discovery is dependent upon earlier discoveries. Models of history were constructed consisting of chronological maps or topological network diagrams. Two such models were used here. The first is based on the events in the history of DNA as described by Dr. Isaac Asimov in the Genetic Code. The second is based on the bibliographic citation data contained in the documents which are the original published studies of events represented in the Asimov book. The interdependencies of linkages among 40 major events (nodes) included in both network diagrams were mapped and compared. The study confirmed 65% (28 of 43) of the historical dependencies in the Asimov network by corresponding linkages established by citations. In addition, 31 citation connections were found which did not correspond to any historical dependencies noted in The Genetic Code.
Article
In this paper, we detect emerging research fronts in a huge number of academic papers related to regenerative medicine, a field of radically innovative research. We divide citation networks into clusters using the topological clustering method, track the positions of papers in each cluster, and visualize citation networks with characteristic terms for each cluster. Analyzing the clustering results with the average published year and parent–child relationship of each cluster could be helpful in detecting recent trends. In addition, tracking topological measures, within-cluster degree z and participation coefficient P, enables us to determine whether there are emerging knowledge clusters. Our results show the success of our method in detecting emerging research fronts in regenerative medicine, and these results are confirmed as reasonable by experts. Finally, we predict the future core papers, with the potential of many citations, via the betweenness centralities in the citation network of the research into adult and somatic stem cells.
Article
In this paper, we performed a comparative study in two research domains in order to develop a method of detecting emerging knowledge domains. The selected domains are research on gallium nitride (GaN) and research on complex networks, which represent recent examples of innovative research. We divided citation networks into clusters using the topological clustering method, tracked the positions of papers in each cluster, and visualized citation networks with characteristic terms for each cluster. Analyzing the clustering results with the average age and parent–children relationship of each cluster may be helpful in detecting emergence. In addition, topological measures, within-cluster degree z and participation coefficient P, succeeded in determining whether there are emerging knowledge clusters. There were at least two types of development of knowledge domains. One is incremental innovation as in GaN and the other is branching innovation as in complex networks. In the domains where incremental innovation occurs, papers changed their position to large z and large P. On the other hand, in the case of branching innovation, they moved to a position with large z and small P, because there is a new emerging cluster, and active research centers shift rapidly. Our results showed that topological measures are beneficial in detecting branching innovation in the citation network of scientific publications.
Article
Classifying journals or publications into research areas is an essential element of many bibliometric analyses. Classification usually takes place at the level of journals, where the Web of Science subject categories are the most popular classification system. However, journal-level classification systems have two important limitations: They offer only a limited amount of detail, and they have difficulties with multidisciplinary journals. To avoid these limitations, we introduce a new methodology for constructing classification systems at the level of individual publications. In the proposed methodology, publications are clustered into research areas based on citation relations. The methodology is able to deal with very large numbers of publications. We present an application in which a classification system is produced that includes almost ten million publications. Based on an extensive analysis of this classification system, we discuss the strengths and the limitations of the proposed methodology. Important strengths are the transparency and relative simplicity of the methodology and its fairly modest computing and memory requirements. The main limitation of the methodology is its exclusive reliance on direct citation relations between publications. The accuracy of the methodology can probably be increased by also taking into account other types of relations, for instance based on bibliographic coupling.