ArticlePDF Available

Comparing GDELT and ICEWS event data

Authors:
Comparing GDELT and ICEWS Event Data
Michael D. Ward & Andreas Beger & Josh Cutler & Matthew
Dickenson & Cassy Dorff & Ben Radford
October 15,2013 This research was undertaken at
mdwardlab.com at Duke University. We
thank ICEWS colleagues Liz Boschee
and Mark Hoffman who gave help-
ful feedback and guidance. This was
partially supported by ONR contract
N00014-12-C-0066 to Lockheed Martin’s
Advanced Technology Laboratories
and by NSF Grants SES-1259190 and
SES-1259266.
InAugust,there were 150, 000 views of a map of protest ac-
tivity around the world, based on the GDELT database. These are
event data, a type of data invented in the mid-1960s by Charles Mc-
Clelland, who aimed at creating a way to study diplomatic history
in a systematic way.1From WEIS, through COPDAB, CREON, and
1See “The Acute International Crisis,”
World Politics, Volume 14, Special Issue
01, October 1961, pages 182-204. See
also: Justin Grimmer and Brandon M.
Stewart. "Text as data: The promise and
pitfalls of automatic content analysis
methods for political texts." Political
Analysis 21.3(2013):267-297; Robert
C. North, Ole R. Holsti, George Zani-
novich, and Dina A. Zinnes. Content
analysis: A handbook with applications for
the study of international crisis. Vol. 184.
Evanston, IL: Northwestern University
Press, 1963; and, Deborah J. Gerner, et
al. “The analysis of political events us-
ing machine coded data.” International
Studies Quarterly 38.1(1994): 91-119.
many others, event data collections have long served the policy and
academic community as a working sensor, revealing details about
political interactions among and within countries.2
2COPDAB is introduced in Edward
E. Azar, "The conflict and peace data
bank (COPDAB) project." Journal of
Conflict Resolution 24.1(1980): 143-152.
For CREON see Margaret G. Hermann,
Barbara G. Salmore, and Stephen A.
Salmore. CREON, a foreign events data
set. Beverly Hills: Sage Publications,
1973
In addition to global coverage, some data sets used random sam-
ples. Others focused on specific domains of behavior, such as re-
source nationalism. What is different now is that rather than having
armies of students collect these data, it can be done automatically,
using simple, but powerful, computer programs that scan text and
determine the action and actors involved. Prior efforts had relied on
human coding of compiled chronologies. Philip A. Schrodt was re-
sponsible for the first program (called KEDS) that automated content
analysis of textual information to create event data.3
3Philip A. Schrodt, Shannon G.
Davis, and Judith L. Weddle. "Polit-
ical science: KEDS–a program for the
machine coding of event data." So-
cial Science Computer Review 12.4
(1994): 561-587. A more complete,
and updated summary is available at
http://eventdata.psu.edu/utilities.
dir/KEDS.History.0611.pdf.
CAMEO–a coding scheme descendant of KEDS–serves as the
coding basis for ICEWS and more recently for GDELT, a “global
database of events, language, and tone.” GDELT has been introduced
in the past year and has generated a large amount of excitement
in the policy and academic community. GDELT is well described
elsewhere, and has the great benefit of being both open source and
continuously updated, permitting its widespread use in academic as
well as policy studies. The repository site (http://gdelt.utdallas.
edu/) contains links to many articles covering GDELT, the complete
GDELT documentation, computer programs that have been used
to analyze the data, as well as the actual data. According to recent
reports, GDELT now includes approximately 250 million events,
dated from 1979 to the present.4
4See Phil Schrodt, “GDELT: Global
Data on Events, Location, and Tone,” a
presentation for the Conflict Research
Society, Essex University, 17 September
2013 for current details and planned
enhancements.
ICEWS is an early warning system designed to help US policy an-
alysts predict a variety of international crises to which the US might
have to respond. These include international and domestic crises,
ethnic and religious violence, as well as rebellion and insurgency.
This project was created at the Defense Advanced Research Projects
Agency, but has since been funded (through 2013) by the Office of
Naval Research.5ICEWS began as a 4-year DARPA program in 2007 5Sean P. O’Brien, “Crisis early warning
and decision support: Contemporary
approaches and thoughts on future
research.” International Studies Review
12.1(2010): 87-104. But especially see:
Sean P. O’Brien, “A multi-method ap-
proach for near real time conflict and
crisis early warning.” Handbook of Com-
putational Approaches to Counterterrorism.
Springer New York, 2013.401-418.
to demonstrate the potential of using social science models and the-
comparing gdelt and icews event data 2
ory to forecast and understand nation-state instability across a range
of countries. The program proved successful and spawned 3compo-
nent tools: iTRACE (news analytics), iCAST (instability forecasting),
and iSENT (sentiment analysis and opinion propagation in social
media). While it started with a test bed of twenty-five countries in
the US Pacific Command, currently ICEWS gathers data on about 250
countries and territories, excluding the US. However, the forecast-
ing effort only concerns 167 countries. ICEWS researchers decided
early on not to model instability in smaller countries and territories,
such as the Vatican and Pitcairn Island, even though events may be
collected for them.
Four aspects of the ICEWS project are noteworthy: (1) it produces
and consumes a very rich corpus of text which is analyzed with
powerful techniques of automated event-data production.6Indeed, 6Boschee, Elizabeth, Premkumar
Natarajan, and Ralph Weischedel. “Au-
tomatic extraction of events from open
source text for predictive forecasting.”
Handbook of Computational Approaches to
Counterterrorism. Springer New York,
2013.51-67
Schrodt was involved in the first phases of the project where the ex-
traction techniques for ICEWS event data were developed; (2) it uses
a variety of systematic (mostly statistical) models to generate pre-
dictions for five dependent variables that are created outside of the
event data process: international and domestic crises, insurgency,
rebellion, and ethnic and religious violence. Models, largely based
on event data, make predictions for these five variables for each of
the 167 countries for six months in advance. These predictions are
evaluated for accuracy7; (3) the various predictions are averaged us- 7Michael D. Ward, Nils W. Metternich,
Christopher Carrington, Cassy Dorff,
Max Gallop, Florian M. Hollenbach,
Anna Schultz, & Simon Weschle. “Ge-
ographical Models of Crises: Evidence
from ICEWS,” Advances in Design for
Cross-Cultural Activities, Part I, CRC
Press, edited by Dylan D. Schmorrow
and Denise M. Nicholson, 2012, pp.
429-438
ing ensemble methods to create an average prediction that is more
accurate, with fewer false positives and false negatives, than any of
the individual models8; and importantly (4) a version of this decision
8Jacob M. Montgomery, Florian Hol-
lenbach, Michael D. Ward. “Improving
Predictions Using Ensemble Bayesian
Model Averaging,” Political Analysis 20.3
(2012): 271-291
aid has been in use for several years, and has a large number of gov-
ernment users. The Duke team has been a participant in this research
and has several recent papers related to our efforts at the models and
the statistics behind them.9
9See Michael D. Ward, Nils W. Metter-
nich, Cassy Dorff, Max Gallop, Florian
M. Hollenbach, and Simon Weschle.
“Learning from the past and stepping
into the future: toward a new genera-
tion of conflict prediction.” International
Studies Review 15.4(2013): in press.
GDELT and ICEWS are arguably the largest event data collections
in social science at the moment. During their brief existence they
have also been among the most influential data sets in terms of their
impact on academic research and policy advice. Yet, we know little to
date about how these two repositories of event data compare to each
other. Given the nascent existence of both GDELT and ICEWS event
data, it is interesting to compare these two repositories of event data.
A focused comparison of GDELT and ICEWS data
How to compare different databases? An important dimen-
sion when comparing databases is availability. GDELT has since the
summer of 2013 been open and freely available. That is a big win for
comparing gdelt and icews event data 3
the policy and academic community. Anyone, including researchers
from Walmart, JPMorgan Chase, Goldman Sachs, Barclays, Expedia,
the Central Intelligence Agency, the Human Rights Data Analysis
Group, and even mdwardlab.com can freely use the data. This is
a tremendous achievement and merits both acknowledgment and
recognition. ICEWS data are not widely available. The full story of
why the data are not publicly available can’t be told here, but suffice
it to say that the success of ICEWS within the operational community
of the US government led to a reversal of policy and the contraven-
tion of extensive plans, operational as recently as 2010, to make all
the ICEWS data freely available to all users. Thus, at present there
are a limited number of researchers that have access to ICEWS event
data. Currently, ICEWS event data are available only for government
use. There are thousands of users with access to these data through
ISPAN and/or the W-ICEWS servers. The real data limitation for
research is that these data are being treated as for official use only
(FOUO) data at this point and are therefore not available to everyone.
While constraining in one sense, that limitation allows the W-ICEWS
data to be linked back to the originating full story (English, Spanish,
Portuguese, etc) so that the event can be examined within a textual
context. This is less important for modeling but for the many users
that use iTRACE to maintain situation awareness, having access to
the full story is important. The ICEWS license from FACTIVA (and
the Open Source Center) allow this for government consumption, but
not for redistribution.
A second approach is to look at the goals of each database. The
ICEWS event data collection has a traditional approach, but modern
mechanisms. The collection tries to accurately reflect the activities
among and within nations and their main, political actors. Thus, a
fair amount of effort goes into filtering the raw stream of reported
stories into a unique stream of events. Stories about the history of
violence between, for example, Japan and Korea, during the 1930s are
eliminated from the stream of events that apply to the current era,
even if they appear in the contemporary press.10 Also winnowed out 10 See the spike in conflict found
in GDELT between Russia and
Afghanistan in 2011. The US was
considering undertaking military action
in Afghanistan and many stories about
Russian involvement in the previous
century surfaced.
are stories about the “war” being waged by the Bank of Japan on the
Indian currency, as are the many business and sports stories that use
the language of politics to describe contests that fall largely outside
the realm of politics.11 In addition, a large effort went into to refining
11 Stories involving Israel are often
written with conflictual language that
results in conflict events being created,
even when the subject does not involve
any conflict.
the actor dictionaries, so that stories could be parsed into precise
events among specific actors. While not perfect, this is an important
aspect of correctly coding events.
The ICEWS data team improved the CAMEO ontology, largely
by resolving overlaps and clarifying guidelines for each extant type
of event.12 In order to gauge the effectiveness of these changes, as 12 Explained by Liz Boschee (personal
communication): We expanded the
codebook with additional guidelines
and examples designed to clarify
potential ambiguities and to resolve
overlap between event codes and
subcodes.
comparing gdelt and icews event data 4
well as to provide an assessment of accuracy, an experiment was con-
ducted for ICEWS by Liz Boschee, of BBN. As a comparison of the
most recent ICEWS data gathered using advanced, graph-theoretic,
natural language processing (NLP) techniques and the amplified
ontology (“Serif”) with the earlier vintage (“Jabari”) coding system,
events for four CAMEO codes (14,17,18,19) were randomly selected
for each system (3000 total events). These were randomly shuffled
and then presented anonymously to trained coders who graded each
coding as correct or incorrect. The results show a substantial jump in
accuracy, illustrated in Table 1. The original coding algorithms were
accurate in fewer than one-half of the randomly selected events, ac-
cording to the trained coders. However, more than two-thirds of all
events were correctly classified using the amplified and elaborated
framework for coding to the CAMEO codes. The improved accuracy
was accomplished without any loss in the number of correct events
produced. Initially, only four CAMEO codes focused on conflictual
events were studied, but currently ten codes are being used and all
the codes in the entire CAMEO ontology are scheduled for October
2013 completion.13 13 One minor point, that is nonetheless
important: the Jabari program was itself
an improvement over the TABARI pro-
gram, and that is currently being used
(instead of TABARI) for those codes not
coded with the SERIF approach.
Table 1:Coding Accuracy in Random
Sample of 3000 Events, coded differently
Category
CAMEO
Code
Jabari Serif
Protest 14 42%86%
Coerce 17 43%83%
Violence 18&19 45%74%
Mean 45%81%
At present, the ICEWS event data go back to 2001 and contain
about 30 million “stories” that are parsed and coded using NLP tech-
niques based on word graphs using a specially developed ontology
based on CAMEO. These are gleaned from about 6000 sources, but
many of these are aggregators of hundreds of other sources. So the
number of sources is not really informative. What is useful to know
is that these media span international, regional, national, and local
sources. Importantly, these are all filtered and subjected to the de-
veloped ontology using the NLP techniques developed by BBN. The
stability in the rate of collected stories, events, and stories with events
is quite remarkable. A modest increase in events and stories is seen
in the period from 2000 to about 2003, but thereafter the number of
events is fairly constant, as shown in Figure 1. This stability does
not characterize the GDELT data, as shown in Figure 2. ICEWS has
contracted for data back to 1990, and these data are scheduled to be
available and coded with the new ontologies by the end of 2013.
Figure 1:Stories (in grey) in ICEWS
corpus, 1January 2001 until 30 April
2013. Events harvested from these stories
are shown in black. Stories increase a bit
over the period, but for the most part, the
number of events is relatively stable. About
26 million stores comprise the current
ICEWS corpus; there are approximately 16
million events. This averages to about 700
events per country per month.
Unfortunately, there is no ground truth to use to gauge the accu-
comparing gdelt and icews event data 5
racy of these data. Each data point needs to be assessed by drilling
down to the story, reading it, and figuring out if the coding is correct.
To do so obviates the goal of automated event coding, but can be
useful in identifying errors in that coding. While individual mileage
may vary, our experience has been reasonably reassuring to us that
generally ICEWS is getting at something real. Users of GDELT doubt-
less are also convinced that it is getting at something real. Of course,
it is impossible to know what stories were not written or even sup-
pressed, and like the well known bias in SIGACTS, events only hap-
pen when they get reported.
Figure 2:GDELT data density over time
in Gigabytes per year. Taken from Phil
Schrodt’s slide presentation to the Workshop
at the Conflict Research Society, Essex
University, 17 September 2013 (slide 18).
The GDELT data collection starts from an entirely different phi-
losophy. Rather than trying to get to the “truth” it tries to capture an
extensive picture of what is reported, both in its details (who, what,
where, when) and its extensiveness (how many reports are there).
Therefore GDELT has many more events per country per unit time,
since it does not winnow stories extensively. GDELT has about 68, 000
country-months (34 years by 167 countries) compared to about 24,000
in ICEWS. Yet, GDELT has an order of magnitude more events. Im-
portantly, the volume of data being harvested by GDELT is growing
exponentially, as are the base level of events therein–the density of
data is about 100 Giga bytes in 1997 and has grown to over 600 Gb in
2011. GDELT has–at present–by design a collection mechanism that
tries to actually maximize reports, but no extensive mechanism for
pruning those events to eliminate the false positive reports. It does
have a reduced version that we did not use, that limits to one record
of each event type between actors per day. ICEWS data, on the other
hand, are extensively winnowed and exhibit no corresponding expo-
nential increase, though there is a much smaller time frame involved
at present. Indeed, the number of events is relatively stable since 2001
to the present as shown in Figure 1.
We also could, for example, compare the overall correlations for
all countries in all time points. If these correlation were really high,
it would give to some confidence that both components were mea-
suring the same thing. But, since the two technologies have different
goals, this kind of comparison is uninformative. Scholars at Penn
State have shown that in total, and for most countries in the Pacific
Rim, there are more GDELT events than ICEWS events. These com- See Bryan Arva, John Beieler, Benjamin
Fisher, Gustavo Lara, Philip A. Schrodt,
Wonjun Song, Marsha Sowell, and
Sam Stehle. “Improving Forecasts of
International Events of Interest.” In
EPSA 2013 Annual General Conference
Paper, vol. 78.2013
parisons use an early version of the ICEWS data that is not represen-
tative of the techniques currently employed in the generation of event
data by the ICEWS team.
While we have no desire to redo the massive comparisons un-
dertaken by the PSU scholars, we found it insightful to perform a
more modest comparison of results based upon GDELT and current
ICEWS data for an analysis of three countries that have been the sites
comparing gdelt and icews event data 6
of contemporary crises.
Protest and demonstrations in Egypt and Turkey, and fighting
in Syria provide a specific, small set of interesting cases on which to
compare the widely available GDELT data with the latest event data
used by the ICEWS project.
Table 2:Daily Events in Egypt during
November 2011 and November 2012
Protest Events
November 2011 November 2012
Day ICEWS GDELT ICEWS GDELT
1 4 23 1 7
2 2 19 1 19
3 2 34 0 7
4 0 20 0 15
5 0 8 0 21
6 0 15 4 10
7 0 7 0 12
8 1 8 1 16
9 0 5 2 10
10 0 13 0 10
11 1 17 0 4
12 0 21 0 14
13 4 20 3 11
14 2 31 2 15
15 2 17 2 28
16 0 25 1 85
17 1 34 2 38
18 5 93 4 14
19 33 130 32 43
20 77 200 23 29
21 104 162 14 30
22 72 204 13 43
23 40 199 29 180
24 31 161 22 128
25 30 145 20 108
26 20 130 19 85
27 3 88 40 153
28 17 88 28 159
29 10 40 8 67
30 8 42 8 72
We begin with an analysis of Egyptian protest in November of
2011. There were many protests in Cairo, and across the country,
aimed at speeding up the reforms one the one hand, and an end
to military rule on the other, ideally followed by a quick election
and a new constitution. Statements by the military led to massive
clashes on the 19th of November, in which many hundreds, including
several deaths, were causalities of clashes with the military, especially
in Tahrir Square. Clashes continued through November and into
December.
Moving ahead one year to 2012, November continues to be a vio-
lent month in recent Egyptian history. Around the 18th of November
secular, anti-Morsi groups abandoned the constitutional assembly
in anticipation of the passage of additional anti-secular laws. Once
again Tarhir Square filled with protesters on both sides. Some of
these protests were to commemorate the clashes between pro and
anti-Morsi forces exactly a year earlier. By the 22nd Morsi began
purging judicial officials perceived to be anti-government, and by
the 23rd protests and demonstrations were seen not only in Cairo,
but throughout Egypt. The rest of 2012 and the first half of 2013 con-
tinued to be contentious and by June 2013 Morsi was removed from
office by a military coup de état.
Looking at both event streams, GDELT and ICEWS, the signal of
increasing protests is evident during the unfolding of the Egyptian
Revolution and Aftermath in November of both 2011 and 2012. It is
clear that GDELT has more reports of events, but this doesn’t mean
that there are more events–even if we know that all protests are not
reported in the press. ICEWS reports also shows the evolution of
protest behavior, but instead of focusing on reports, it focuses on
what are purported to be events. The correlation between the two, in
this case, indicates that about 2/3 of the variance in these two series
is shared (actually 71%). Neither stream is perfect, nor pretends to
be.
What is clear is that in 2011 both GDELT and ICEWS pick up the
main protests in Egypt, with ICEWS peaking on the 21st and GDELT
peaking on the 22nd, but having 200 events reported on the 20th
as well. In 2012, GDELT peaks on Friday, the 23rd, and ICEWS the
following week on the 27th (a Tuesday). It should be remembered
that the GDELT data are growing logarithmically, yet do not appear
comparing gdelt and icews event data 7
to be more frequent in Egypt for November 2012 than a year earlier.
If we look at the series for order of magnitude changes, the picture
is a little different as both GDELT and ICEWS show 2011, November
19th as a breakpoint. In 2012, ICEWS also has the 19th as a tipping
point, while GDELT has double digit daily counts over much of the
month, but shows a breakpoint on the 23rd.
Table 3:Geographical Variance for ICEWS
and GDELT.
Country Source Lat ˆ
σLon ˆ
σ
Egypt ICEWS 0.22 0.38
Egypt GDELT 0.74 2.25
Syria ICEWS 0.43 1.06
Syria ICEWS 0.82 1.21
Turkey ICEWS 11.37 1.01
Turkey ICEWS 22.19 1.81
The accompanying Web page (http://mdwardlab.com/gdelt-and- icews)
provides a better illustration of these data. Therein you can dynam-
ically examine protests in Egypt and Turkey over the past few years,
both in terms of their timeline and geographical distribution. In ad-
dition, we have included material conflict for Syria. These displays
allow one to compare the ICEWS and the GDELT data visually in
these specific cases. As shown numerically in Table 3GDELT data
appear to have a wider range of geolocations than the ICEWS data.
Many ICEWS events are geographical disambiguated to central loca-
tions, a characteristic that is not shared by the GDELT events. But this
pattern is not uniform among all countries, nor among all categories
of events. Egypt shows more geographical variance in each country,
but the differences are modest, except in Turkey where GDELT shows
protests happening in virtually every locale, whereas the ICEWS
protest data for Turkey is more concentrated in population centers.
Figure 3:Interactive comparison of ICEWS
and GDELT over time and space for three
countries (available at present athttp:
//mdwardlab.com/gdelt-and- icews/
index.html).
In Turkey, the picture is similarly complicated, as shown in Fig-
ures 3&4. Recent protests were widespread, and this will have
been widely reported in the Turkish press, but maybe not else-
where. Recent government estimates have suggested that only four
provinces out of 81 remained completely calm in the post-June era.
comparing gdelt and icews event data 8
But most of these protests took place in cities or other population
centers, and few events took place in smaller counties. Moreover,
both GDELT and ICEWS capture the Kurdish protests (mainly in
southeast Turkey), but these protests are not part of the post-June
anti-government protests. For example, ICEWS shows the high level
of protests in Diyarbakir. These are Kurdish protests nearly all of
which took place before the post-June movement, and which voiced
the demands of this ethnic minority. These protests are unrelated to
the post-June movement. GDELT has few protests in Anatolia, and
there were in fact some small protests there in June 2013 and after-
ward. It appears that ICEWS understates the geographical spread of
the recent protests in Turkey, but GDELT may overstate it. Both pick
up the Kurdish protests as well as the anti-government protests. The
general impression provided to a small group of Turkey experts we
asked to compare these two sets of data is that GDELT overstates by
a lot the amount of protest, representing protests in areas that are
unlikely to have been involved in the Gezi protests. That said, the
ICEWS data probably understate the geographical spread of these
protest. Table 4reports these data from for four weeks around the
Gezi protest. Figure 4illustrates that both series pick up the main
onset of protests in Turkey, but then ICEWS comes back to a much
lower level–an order of magnitude lower–of protest counts by June
15th.
Figure 4:ICEWS (blue) and GDELT
(green) plots of protests during May and
June 2013.
Table 4:Pre- and Post-Gezi Protests, as
reported by ICEWS and GDELT databases.
Date ICEWS GDELT
May 29 1 8
May 30 0 6
May 31 15 83
June 1 56 189
June 2 48 142
June 3 94 207
June 4 50 136
June 5 37 135
June 6 26 99
June 7 13 65
June 8 19 64
What is the take-away from these comparisons?
First, most of the shortcomings of the GDELT data are well known
and well established–even if they are ignored by many users and
pushers alike. They are well known by the community that creates
and uses these data, but largely overlooked by the community that
uses creations based on these data.14 The community that has cre-
14 As an example of the wisdom of
the community, see Philip Schrodt’s
analysis: http://asecondmouse.
wordpress.com/2013/05/03/
seven-remarks- on-gdelt/.
ated these data, and stewards their growing use is well aware of the
shortcomings of these data, as well as the strengths. Many different
client communities will be able to write filters–perhaps in the form of
user friendly widgets–that focus only at some feature of these data.
In this way, the GDELT approach of collating and encoding all the
printed news, may also serve as a data source for event data encod-
ings that have specific substantive foci, such as human rights abuses
or disappearances of political actors. These filters will get good, in
short order, at elimination of some of the false positives as well–the
historical references that often confuse NLP text encoding. Schrodt
noted these data are in BETA, but many treat them as fully finished.
However, it is one thing to have a great data set that is newly avail-
comparing gdelt and icews event data 9
able, but has a high rate of error. It is quite another to have to explain
to General Dempsey why you woke him up, and find out upon fur-
ther inspection that it was because of a false positive generated by the
data collection algorithm. Thus, it is important to have some sense
of the error bands on whatever uses the data are employed to ac-
complish. Our sense is that the uncertainty on events in ICEWS is
less than GDELT, a judgement presaged by the goals of each collec-
tion, but validated in research as well. These data serve the modeling
goals of the ICEWS research project, at present. That said, the avail-
ability of GDELT data is terrific, and we have little doubt that these
data can be utilized for similar purposes.
Second, even automated approaches to text processing need an on-
tology from which to construct meaning. The CAMEO framework is
a very good one, one that has been improved on considerably over
time, and according to Phil Schrodt–the originator of CAMEO–will
shortly be supplanted by a new one, PETRARCH. The ICEWS elab- https://github.com/eventdata/
PETRARCH
oration of the CAMEO ontology undertaken for ICEWS by Elizabeth
Boschee is superb, and along with the introduction of advanced NLP
techniques produced a substantial improvement in the quality of the
data over the prior CAMEO framework we used. Insofar as we know,
no other automated coding framework has been examined against
the “ground truth” in this way. Without that improvement the accu-
racy of the coding system as gauged by trained human coders was
less than 50% in correctly identifying the type of event. A fifty per-
cent improvement in accuracy is substantial and affects not only false
positives, but also false negatives. This evidence undergirds much of
our confidence in the ICEWS data.
Figure 5:Map of GDELT Protests in
Turkey
Figure 6:Map of ICEWS Protests in
Turkey
Third, country-level analyses can not tell the whole story of po-
litical instability. When ICEWS began in 2007 there was hope that
models could be disaggregated to give localized predictions. But
geo-location was not then possible. However, it is now possible to get
a much more disaggregated map of where there is instability using
automated techniques. This is important not only for the data, but
ultimately for models and clients that use these data. GDELT has a
method for the resolution of geographic location of events that pro-
vides more specific locations, at least in the countries we examined.
The wrong question to ask is whether ICEWS or GDELT is supe-
rior. But more sensible is the question about which data can be use-
fully applied to what kinds of questions. Are the data complemen-
tary? Is one database better at addressing under-reported parts of
the world, such as three of the largest countries in the world: China,
comparing gdelt and icews event data 10
India, and Indonesia? And most importantly, what can each database
be used to accomplish in an academic as well as policy setting? It is
clear to us that both databases pick up major events remarkably well.
The volume of GDELT data is very much larger than the correspond-
ing ICEWS data, but they both pick up the same basic protests in
Egypt and Turkey, and the same fighting in Syria. GDELT may have
553 protests in Egypt on January 27,2011 and ICEWS reports only 95,
but both give a similar message. Which is correct? Users would like
to know the whether erring on the side of of false positives (GDELT)
is than the ICEWS strategy of avoiding false positives. Which gets
more events correct? Unfortunately, we don’t know the answer to this
question, but it should be possible to answer.15 It seems clear, how- 15 We have designed such a study, for
which we hope to have results soon.
ever, that GDELT over-states the number of events by a substantial
margin, but ICEWS misses some events as well.
Characteristic of many decision-making problems, the choice is
between willingness to be wrong and desire to be right.
... There are eight conflict event categories recorded in the ACLED database including organized group "battles" among others; our main variable of interest is the "protests and riots" category which we refer to as protest here. Since ACLED does not contain data from the military period from 1988 to 1996, we use another dataset, the Global Data on Events Location and Tone (GDELT) project, which similarly records geocoded data on conflict and mediation events similarly extracted from media and news agency sources over the 1979 to 1999 military period in our sample (Ward et al., 2013). GDELT includes twenty main event categories, including classifications like organized group "fights" or battles and, our main variable of interest, "protest", including reported protest and riot events. ...
Preprint
Full-text available
Can citizen-led protests lead to meaningful economic redistribution and nudge governments to increase redistributive efforts of fiscal resources? We study the effects of protests on fiscal redistribution using evidence from Nigeria. We digitized twenty-six years of public finance data from 1988 to 2016 to examine the effects of protests on intergovernmental transfers. We find that protests increase transfers to protesting regions, but only in areas that are politically aligned with disbursing governments. Protesters also face increased police violence. Non-protest conflicts do not affect transfers and protests do not affect non-transfer revenue. The results show that protests can influence fiscal redistribution. JEL classification: D7, H2, H7, O10, O43, N37
... ICEWS05-15, YAGO3, GDELT (Global Database of Events, Language and Tone) [40]. ICEWS14 and ICEWS05-15 are a subset of ICEWS (Integrated Crisis Early Warning System) [41]. ...
Article
Full-text available
Knowledge graph completion (KGC) can be interpreted as the task of missing inferences to real-world facts. Despite the importance and abundance of temporal knowledge graphs, most of the current research has been focused on reasoning on static knowledge graphs. The data they are applied to usually evolves with time, such as friend graphs in social networks. Therefore, developing temporal knowledge graph completion (temporal KGC) models is an increasingly important topic, although it is difficult due to data non-stationarity, and its complex temporal dependencies. In this paper, we propose block decomposition based on relational interaction for temporal knowledge graph completion (TBDRI), a novel model based on block term decomposition (which can be seen as a special variant of CP decomposition and Tucker decomposition) of the binary tensor representation of knowledge graph quadruples. TBDRI considers that inverse relations, as one of the most important types of relations, occupy an important share in the real world. Although some existing models introduce inverse relation into the model, it is not enough to only learn the inverse relation independently. TBDRI learns inverse relation in an enhanced way to strengthen the binding of forward and inverse relation. Furthermore, TBDRI first uses the core tensor as temporal information to bind timestamps more adequately. We prove TBDRI is full expressiveness and derive the bound on its entity, relation, and timestamp embedding dimensionality. We show that TBDRI is able to outperform most previous state-of-the-art models on the four benchmark datasets for temporal knowledge graph completion.
... GAP provides researchers with a state-of-the-art framework for KG-to-text models. Though we experiment with supervised baselines which include a handcrafted dataset, WebNLG, and an automatically generated dataset, EventNarrative, repositories of structured data exist in the clinical (Johnson et al., 2016), medical (Bodenreider, 2004, and news crises (Leetaru and Schrodt, 2013;Ward et al., 2013) domains. By transforming clinical data into natural language narratives, patients with low health-literacy can benefit by more easily understanding their electronic medical records (EMRs), and doctors can more easily transcribe patient data for future use cases, i.e. connecting such data to the medical literature. ...
Preprint
Full-text available
Recent improvements in KG-to-text generation are due to additional auxiliary pre-trained tasks designed to give the fine-tune task a boost in performance. These tasks require extensive computational resources while only suggesting marginal improvements. Here, we demonstrate that by fusing graph-aware elements into existing pre-trained language models, we are able to outperform state-of-the-art models and close the gap imposed by additional pre-train tasks. We do so by proposing a mask structure to capture neighborhood information and a novel type encoder that adds a bias to the graph-attention weights depending on the connection type. Experiments on two KG-to-text benchmark datasets show these models to be superior in quality while involving fewer parameters and no additional pre-trained tasks. By formulating the problem as a framework, we can interchange the various proposed components and begin interpreting KG-to-text generative models based on the topological and type information found in a graph.
... The results yielded by approaches of both communities to date are either not of sufficient quality, require tremendous effort to be replicated with both in-and out-of-distribution data, are immeasurable in terms of quality as there is not any gold standard list of events, or is not comparable to each other (Wang et al., 2016;Ward et al., 2013;Ettinger et al., 2017;Plank, 2016;Demarest and Langer, 2018). ...
Preprint
Full-text available
We propose a dataset for event coreference resolution, which is based on random samples drawn from multiple sources, languages, and countries. Early scholarship on event information collection has not quantified the contribution of event coreference resolution. We prepared and analyzed a representative multilingual corpus and measured the performance and contribution of the state-of-the-art event coreference resolution approaches. We found that almost half of the event mentions in documents co-occur with other event mentions and this makes it inevitable to obtain erroneous or partial event information. We showed that event coreference resolution could help improving this situation. Our contribution sheds light on a challenge that has been overlooked or hard to study to date. Future event information collection studies can be designed based on the results we present in this report. The repository for this study is on https://github.com/emerging-welfare/ECR4-Contentious-Politics.
... Using big data allows us to identify patterns of human interactions and behaviors that ultimately form a network of relationships and linkages which can be useful for organizational decision making (McAfee and Brynjolfsson, 2012). For instance, GDELT has been used to analyze the sociological evolution of specific important eventssuch as the Spanish government's energy policies (Bodas-Sagi and Labeaga, 2016), the 2011 Egyptian revolution (Ward et al., 2013), the political conflicts in Afghanistan and Syria (Yonamine, 2013), and the Arab Spring (Levin et al., 2018). Here, we follow those previous studies in using GDELT to understand how different actors are interconnected and form networks among them. ...
Article
Full-text available
This article disentangles the global interorganizational network by analyzing the ties of international actors—comprising multinational companies, intergovernmental organizations, and international nongovernmental organizations. The onset of COVID-19 is a rare opportunity to explore how this network has evolved in an exogenous event. Using a unique GDELT big dataset of events reported by the world’s media, we extract and analyze the interorganizational interactions of international actors before and after the World Health Organization (WHO) announced that the COVID-19 outbreak was a public health emergency of international concern. Adopting an exploratory and descriptive approach at multiple levels of analysis, we draw on the theory of networks to uncover the fragmented, polycentric, and complex characteristics of the global interorganizational network. Our study highlights the use of media-reported events and the Goldstein scale as means to unpack the difficult-to-capture relational dynamics of international actors, which can help in theory development of the global interorganizational network that is crucial for collective action to address societal grand challenges.
... There are eight conflict event categories recorded in the ACLED database including organized group "battles" among others; our main variable of interest is the "protests and riots" category which we refer to as protest here. Since ACLED does not contain data from the military period from 1988 to 1996, we use another dataset, the Global Data on Events Location and Tone (GDELT) project, which similarly records geocoded data on conflict and mediation events similarly extracted from media and news agency sources over the 1979 to 1999 military period in our sample (Ward et al., 2013). GDELT includes twenty main event categories, including classifications like organized group "fights" or battles and, our main variable of interest, "protest", including reported protest and riot events. ...
Article
This paper examines how protests spread across countries in the 2011 Arab Spring. Based on the diffusion literature, we form hypotheses about the factors that influence the transmission of protests across borders. To test the hypotheses, we use an events data set measuring media reports of protests, government reforms, and acts of repression on a daily basis by country. We show that the strength of the protest movement in one country is significantly affected by protest activities in other countries over the previous 1 or 2 weeks and that protests were more likely to spread between countries that had high levels of bilateral trade. When we examine periods longer than 2 weeks, we find that protests spread across borders only when they were successful in pressuring Arab governments into enacting reforms and when the protests did not lead to government reprisals. In all our models, government repression in one country significantly stifled protests in other countries. Each country was thus significantly affected by the choices that governments in other Arab League nations made, and this interdependence meant that governments had incentives to cooperate with each other in their responses to the Arab Spring protests. Este artículo analiza cómo se extendieron las protestas a través de distintos países durante la Primavera Árabe de 2011. Basándonos en la literatura sobre difusión, formulamos hipótesis sobre los factores que influyen en la transmisión de las protestas a través de fronteras. Para comprobar nuestras hipótesis, utilizamos un conjunto de datos de eventos que miden la información de los medios de comunicación sobre las protestas, las reformas gubernamentales y los actos de represión a diario por país. Demostramos que la fuerza del movimiento de protesta en un país se ve significativamente afectada por las actividades de protesta en otros países durante la semana o las dos semanas anteriores y que las protestas eran más propensas a extenderse entre los países que tenían altos niveles de comercio bilateral. Cuando examinamos periodos superiores a dos semanas, comprobamos que las protestas se extienden a través de las fronteras solo cuando estas consiguen presionar a los gobiernos árabes para que promulguen reformas y cuando las protestas no provocan represalias por parte del gobierno. En todos nuestros modelos, la represión gubernamental en un país frenó significativamente las protestas en otros países. Por lo tanto, cada país se vio significativamente afectado por las decisiones que tomaron los gobiernos de otras naciones de la Liga Árabe, y esta interdependencia significó que los gobiernos tenían incentivos para cooperar entre sí en sus respuestas a las protestas de la Primavera Árabe. Le présent article analyse la manière dont les protestations se sont propagées à travers les pays lors du Printemps arabe de 2011. Sur la base de la documentation diffusée, nous émettons des hypothèses sur les facteurs ayant influencé la transmission des protestations à travers les frontières. Afin de vérifier les hypothèses, nous utilisons un ensemble de données d’événements évaluant les comptes-rendus que les médias ont fait des protestations, des réformes du gouvernement et des actes de répression, jour après jour et par pays. Nous montrons que la force du mouvement de protestation dans un pays est affectée de manière significative par les actes de protestation dans d’autres pays au cours de la ou des deux semaines précédentes, et que les protestations avaient beaucoup plus de chance de s’étendre entre des pays ayant des niveaux élevés de commerce bilatéral. Lorsqu’on examine des périodes supérieures à deux semaines, on observe que les protestations se propagent à travers les frontières uniquement lorsqu’elles ont réussi à faire pression sur les gouvernements arabes pour qu’ils adoptent des réformes et lorsque ces protestations n’ont pas entraîné de représailles de la part du gouvernement. Dans tous nos modèles, la répression du gouvernement dans un pays a sensiblement étouffé les protestations dans les autres pays. Chaque pays était donc affecté de manière significative par les choix effectués par les gouvernements dans les autres nations de la Ligue arabe, et cette interdépendance signifiait que les gouvernements avaient intérêt à coopérer les uns avec les autres quant à leur réponse face aux protestations du Printemps arabe.
Article
How international is political text-analysis research? In computational text analysis, corpus selection skews heavily toward English-language sources and reflects a Western bias that influences the scope, interpretation, and generalizability of research on international politics. For example, corpus selection bias can affect our understanding of alliances and alignments, internal dynamics of authoritarian regimes, durability of treaties, the onset of genocide, and the formation and dissolution of non-state actor groups. Yet, there are issues along the entire “value chain” of corpus production that affect research outcomes and the conclusions we draw about things in the world. I identify three issues in the data-generating process pertaining to discourse analysis of political phenomena: information deficiencies that lead to corpus selection and analysis bias; problems regarding document preparation, such as the availability and quality of corpora from non-English sources; and gaps in the linguist analysis pipeline. Short-term interventions for incentivizing this agenda include special journal issues, conference workshops, and mentoring and training students in international relations in this methodology. Longer term solutions to these issues include promoting multidisciplinary collaboration, training students in computational discourse methods, promoting foreign language proficiency, and co-authorship across global regions that may help scholars to learn more about global problems through primary documents.
Article
Knowledge Graph (KG) provides high-quality structured knowledge for various downstream knowledge-aware tasks (such as recommendation and intelligent question-answering) with its unique advantages of representing and managing massive knowledge. The quality and completeness of KGs largely determine the effectiveness of the downstream tasks. But in view of the incomplete characteristics of KGs, there is still a large amount of valuable knowledge is missing from the KGs. Therefore, it is necessary to improve the existing KGs to supplement the missed knowledge. Knowledge Graph Completion (KGC) is one of the popular technologies for knowledge supplement. Accordingly, there has a growing concern over the KGC technologies. Recently, there have been lots of studies focusing on the KGC field. To investigate and serve as a helpful resource for researchers to grasp the main ideas and results of KGC studies, and further highlight ongoing research in KGC, in this paper, we provide a all-round up-to-date overview of the current state-of-the-art in KGC. According to the information sources used in KGC methods, we divide the existing KGC methods into two main categories: the KGC methods relying on structural information and the KGC methods using other additional information. Further, each category is subdivided into different granularity for summarizing and comparing them. Besides, the other KGC methods for KGs of special fields (including temporal KGC, commonsense KGC, and hyper-relational KGC) are also introduced. In particular, we discuss comparisons and analyses for each category in our overview. Finally, some discussions and directions for future research are provided.
Preprint
Full-text available
In recent years, Knowledge Graph (KG) development has attracted significant researches considering the applications in web search, relation prediction, natural language processing, information retrieval, question answering to name a few. However, often KGs are incomplete due to which Knowledge Graph Completion (KGC) has emerged as a sub-domain of research to automatically track down the missing connections in a KG. Numerous strategies have been suggested to work out the KGC dependent on different representation procedures intended to embed triples into a low-dimensional vector space. Given the difficulties related to KGC, researchers around the world are attempting to comprehend the attributes of the problem statement. This study intends to provide an overview of knowledge bases combined with different challenges and their impacts. We discuss existing KGC approaches, including the state-of-the-art Knowledge Graph Embeddings (KGE), not only on static graphs but also for the latest trends such as multimodal, temporal, and uncertain knowledge graphs. In addition, reinforcement learning techniques are reviewed to model complex queries as a link prediction problem. Subsequently, we explored popular software packages for model training and examine open research challenges that can guide future research.
ResearchGate has not been able to resolve any references for this publication.