Content uploaded by Alexander Panchenko
Author content
All content in this area was uploaded by Alexander Panchenko on Feb 28, 2017
Content may be subject to copyright.
new/s/leak – Information Extraction and Visualization for
Investigative Data Journalists
Seid Muhie Yimam†and Heiner Ulrich‡and Tatiana von Landesbergerand
Marcel Rosenbach‡and Michaela Regneri‡and Alexander Panchenko†and
Franziska Lehmannand Uli Fahrer†and Chris Biemann†and Kathrin Ballweg
†FG Language Technology
Computer Science Department
Technische Universit¨
at Darmstadt
Graphic Interactive Systems Group
Computer Science Department
Technische Universit¨
at Darmstadt
‡SPIEGEL-Verlag
Hamburg, Germany
Abstract
We present new/s/leak, a novel tool de-
veloped for and with the help of journal-
ists, which enables the automatic analysis
and discovery of newsworthy stories from
large textual datasets. We rely on differ-
ent NLP preprocessing steps such named
entity tagging, extraction of time expres-
sions, entity networks, relations and meta-
data. The system features an intuitive
web-based user interface based on net-
work visualization combined with data ex-
ploring methods and various search and
faceting mechanisms. We report the cur-
rent state of the software and exemplify
it with the WikiLeaks PlusD (Cablegate)
data.
1 Introduction
This paper presents new/s/leak1, the network of
searchable leaks, a journalistic software for inves-
tigating and visualizing large textual datasets (see
live demo here2). Investigation of unstructured
document collections is a laborious task: The
sheer amount of content can be vast, for instance,
the WikiLeaks PlusD3dataset contains around
250 thousand cables. Typically, these collections
largely consist of unstructured text with additional
metadata such as date, location or sender and re-
ceiver of messages. The largest part of these
documents are irrelevant for journalistic investi-
gations, concealing the crucial storylines. For in-
stance, war crime stories in WikiLeaks were hid-
1http://newsleak.io
2http://bev.lt.informatik.tu-darmstadt.de/
newsleak/
3https://wikileaks.org/plusd/about
den and scattered among hundreds of thousands
of routine conversations between officials. There-
fore, if journalists do not know in advance what
to look for in the document collections, they can
only vaguely target all people and organizations
(named entities) of public interest.
Currently, the discovery of novel and relevant
stories in large data leaks requires many people
and a large time budget, both of which are typi-
cally not available to journalists: If the documents
are confidential like datasets from an informer,
only a few selected journalists will have access to
the classified data, and those few will have to carry
the whole workload. On the other hand, if the doc-
uments are publicly available (e.g. a leak posted
on the web), the texts have to be analyzed under
enormous time pressure, because the journalistic
value of each story decreases rapidly if other me-
dia publish it before.
There is a plethora of tools (see Section 2)
for data journalist that automatically reveal and
visualize interesting correlations hidden within
large number-centric datasets (Janicke et al., 2015;
Kucher and Kerren, 2014). However, these tools
provide very limited automatic support with vi-
sual guidance through plain text collections. Some
tools include shallow natural language processing,
but mostly restricted to English. There is no tool
that works for multiple languages, handles a large
number of text collections, analyzes and visual-
izes named entities along with the relations be-
tween them and allows editing what is considered
as an entity. Moreover, available software is usu-
ally not open source but rather expensive and often
requires substantial training, which is unsuitable
for a journalist under time pressure and no prior
experience with such software.
The goal of new/s/leak is to provide journalists
with a novel visual interactive data analysis sup-
port that combines the latest advances from natural
language processing approaches and information
visualization. It enables journalists to swiftly pro-
cess large collections of text documents in order to
find interesting pieces of information.
This paper presents the core concepts and ar-
chitecture behind the new/s/leak. We also show
an in-depth analysis of user requirements and how
we implement and address these. Finally, we dis-
cuss our prototype, which will be made available
in open source4under a lenient license.
2 Related work
Kirkpatrick (2015) states that investigative jour-
nalist should look at the facts, identify what is
wrong with the situation, uncover the truth, and
write a story that places the facts in context. This
work explains traditional story discovery engine
components which, on the technical side, consist
of a knowledge base, an inference engine and an
easy to use user interface for visualization.
Discussions with our partner journalists confirm
this view: newsworthy stories are not evident from
leaked documents, but following up on informa-
tion from leaks might reveal relevant pointers for
journalistic research.
Cohen et al. (2011) outline a vision for “a cloud
for the crowd” system to support collaborative in-
vestigative journalism, in which computational re-
sources support human expertise for efficient and
effective investigative journalism.
The DocumentCloud5and the Overview6
project are the most popular tools comparable to
the new/s/leak, both designed for journalists deal-
ing with large set of documents. DocumentCloud
is a tool for building a document archive for the
material related to an investigation. Similarly to
our system, people, places and organizations are
recognized in documents. We put additional fo-
cus on the UI by adding a graph-based visualiza-
tion and better support for document browsing.
Overview is designed to help journalists find sto-
ries in large number of documents by topical clus-
tering. This is a complementary approach to ours:
rather than keyword-based topics, our tool centers
around entities and text-extracted relations. Fur-
4http://github.com/tudarmstadt-lt/newsleak
5https://www.documentcloud.org/home
6https://https://www.overviewdocs.com
ther, we added advanced editing capabilities for
users to add entities and edit the whole network.
Aleph.grano.cc visualizes connections of peo-
ple, places and organizations extracted from text
documents, but leaves the relationship types un-
derspecified. Detective.io, a platform for collabo-
rative network analysis, does some of the visual-
izations and annotations we are working on. How-
ever, there is a crucial difference: Detective.io as-
sumes a rigid data structure (e.g.“corporate net-
works”) and the user has to fill the underlying
database entirely manually. Our tool targets on
unstructured sources and extracts networks from
text, enabling navigation to the sources. Jigsaw7
extracts named entities from text collections and
computes various plots from them, but received lit-
tle attention from journalists due to its unintuitive
user interface.
With new/s/leak, we want to develop existing
tools a step further, by combining the automation
of entity and relationship extraction with an intu-
itive and appealing visual interface.
New/s/leak is based on two prior systems devel-
oped namely from the works of (Benikova et al.,
2014), Network of the Day8and (Kochtchi et al.,
2014), Network of Names9. Both tools automat-
ically analyze named entities and their relation-
ships and present an interactive network visual-
ization that allows to retrieve the original sources
for displayed relations. In the current project, we
add support for faceted data browsing, a timeline,
a better access to source documents and a possi-
bility to tag and edit visualizations.
3 Objectives and User Requirements
The objective of new/s/leak is to support investiga-
tive data journalists in finding important facts and
stories in large unstructured text collections. The
two key elements are an easy-to-use interactive vi-
sualization and linguistic preprocessing.
To gain a more precise focus for our devel-
opment, we conducted structured interviews with
potential users. They seek for an answer to the
question ”Who does what to whom?” – possibly
amended with ”When and where?”. At the same
time, they always need access to the source docu-
ments, to verify the machine-given answers. The
7http://www.cc.gatech.edu/gvu/ii/jigsaw
8http://tagesnetzwerk.de
9http://maggie.lt.informatik.tu-darmstadt.
de/thesis/master/NetworksOfNames
outcome of these interviews are summarized in the
following requirements.
1) Identify key persons, places and organizations.
2) Browse the collection, identify interesting doc-
uments and read documents in detail.
3) Analyze the connections between entities.
4) Assess temporal evolution of the documents.
5) Explore geographic distribution of events.
6) Annotate documents with findings as well as
edit the data to enhance data quality.
7) Save and share selected documents.
While some of these requirements (like docu-
ment browsing) are standard search features, oth-
ers (like annotation and sharing) are usually not
yet integrated in journalism tools. The final ver-
sion of our system will include all of these fea-
tures.
4 Implementation details
4.1 System
The new/s/leak tool consists of two major parts
shown in Figure 1: The backend provides various
NLP tools for document pre-processing and docu-
ment analysis (Sec. 5), interactive visualization for
investigative journalism (Sec. 6). The implemen-
tation of the backend and frontend components of
new/s/leak are integrated in a modular way that
allows e.g. adding a different relation extraction
mechanism or support for multiple languages.
Figure 1: Schema of new/s/leak for visual support
of investigative analysis in text collections.
4.2 Demo datasets
We demonstrate capabilities of our system on two
well-known cases:
1) The well-investigated WikiLeaks PlusD “Ca-
blegate” collection, a collection of diplomatic
notes from over 45 years originating from US em-
bassies all over the world.
2) The Enron email dataset (Klimt and Yang,
2004) is a collection of email messages which
is available publicly from the Enron Corporation.
The dataset comprises over 600,000 messages be-
tween 158 employees. This example shows the
tools’ general applicability to email leaks.
5 Backend: information extraction
The new/s/leak backend uses different NLP tools
for preprocessing and analysis, integrated with a
relational database, a NoSQL document store and
some specific retrieval methods. First, documents
are pre-processed and converted to an interme-
diate new/s/leak document representation. This
format is generic enough to represent collections
from any possible source, such as emails, rela-
tional databases or XML documents.
In a second step, our NLP preprocessing ex-
tracts important entities, metadata (like time and
location), term co-occurrences, keywords, and re-
lationships among entities. We store relevant doc-
uments and extracted entities in a PostgreSQL10
database and an ElasticSearch11 index. We ex-
plain the following steps using the Wikileaks Ca-
blegate dataset as an example, but the tool is not
limited to this single dataset.
5.1 Preprocessing
The first step of new/s/leak workflow is prepro-
cessing the input documents and represent them
in new/s/leak’s intermediate document representa-
tion. Preprocessing of the Cablegate dataset re-
quires truecasing the original documents which
are mostly in capitals. The work of Lita et al.
(2003) indicates that truceasing improves the F-
score of named entity recognition by 26%. We
used a frequency based approach for case restor-
ing based on a very large true-cased background
corpus. Once case is restored, we extract meta-
data from the document, including the document
creation dates, subject, message sender document
creator, and other metadata already marked in the
dataset. Metadata is stored in a database as triples
of (n,t,v) where nis the name, tis the data type
(e.g. text or date) and vis the value of the meta-
data. The tool further supports manual identifica-
tion of metadata during the analysis and produc-
tion stages.
10http://www.postgresql.org
11https://www.elastic.co
Dataset #Documents #Entities #Relations
WikiLeaks 251,287 1,363,500 163,138,000
Enron 255,636 613,225 81,309,499
Table 1: Statistics on WikiLeaks PlusD and Enron
5.2 Entity, relation, co-occurrence, keyword,
and event time extractions
Recognition of named entities and related terms
are key steps in the investigative data-driven
journalistic process. For this purpose, we first
automatically identify four classes of entities,
namely person (PER), organization (ORG), loca-
tion (LOC) and miscellaneous (MISC) using the
named entity recognition tool from Epic12. We as-
sume relationships between entities whenever the
two entities co-occur in a document. In order to
extract relevant keywords regarding two entities,
we follow the approach by Biemann et al. (2007).
Furthermore, we extract entity relation labels by
computing document keywords using JTopia13.
JTopia extracts relevant key terms for search based
on part of speech information. To label the rela-
tionship between two entities, the most frequent
keywords from the documents where the two en-
tities appeared together is used. To extract tempo-
ral expressions, we use the Heideltime (Str¨
otgen,
2015) tool, which disambiguates temporal expres-
sions based on document creation times. For the
WikiLeaks dataset, it is possible to extract and dis-
ambiguate more than 3.9 million temporal expres-
sions.
All the extraction and processing work-flows of
the new/s/leak components are implemented using
the Apache Spark cluster computing framework
for parallel computations. Table 1 shows the dif-
ferent statistics for the WikiLeaks PlusD and En-
ron datasets.
6 Frontend: interactive visualization
Journalists can browse through the document col-
lection using an interactive visual interface (see
Figure 2). It enables faceted document exploration
within several views:
1) Graph view: shows named entities and their re-
lations.
2) Map view: shows document distribution in ge-
ographic space.
3) Document timeline view: shows document fre-
12https://github.com/dlwh/epic
13http://github.com/srijiths/jtopia
quency over time.
4) Document view: is composed of a) document
list and b) document text for reading.
The views are interactive so that the user can
browse and explore the document collection on de-
mand. The user starts with exploring entities and
their connections in the graph view or by search-
ing for entities and keywords. All interactions in
the views define a filter that constrains the current
document set, which in turn changes information
displayed in the views. The user can assess docu-
ment frequency in a map or in the timeline, drill
down and select documents from the result list,
and read them closely. User-selected entities are
highlighted in the documents.
Graph view: entities and their co-occurrences
The graph view shows a set of entities as nodes
and their connections as links. Node size denotes
the frequency of an entity in the document col-
lection, node color denotes the entity type. The
co-occurrence of entities within the documents is
shown by edge thickness and edge label.
The user can explore the entities and their connec-
tions via expanding the graph along the neighbors
of a selected entity (plus button). The expanded
entities are slowly faded in, so that the user can
easily spot which entities appeared. Moreover, the
user can drill down into the data by displaying the
so-called ego network of a selected entity. Click-
ing on nodes and edges retrieves the respective
documents.
Map view. The map view shows the document
frequency distribution over the geographic space.
Users can hover over a country to see the number
of documents mentioning this geographic area, ef-
fectively using the map as faceted search.
Document timeline. The document timeline
shows the number of documents over time. We
use a bar chart with logarithmic scale as it better
adheres to the exponential document distribution
characteristics. The users can drill down in time to
see the document distribution over years, months
or days. It is also possible to select a time interval
for which the corresponding documents are shown
in the document view (see below).
Document view. The document view shows a
list of documents with their title or subject as se-
lected by the currently active filters. For large
document collections, the documents are loaded
Figure 2: Interactive Visualization Interface composed of graph view on entities, document distribution
over time, list of selected documents and a map of document geo-distribution.
on demand. The user can browse the list and
identify documents for close reading (bold, open
folder icon). The document text view shows the
full text of the document, where the entities dis-
played in the graph are underlined. The under-
line color corresponds to the type of entity. If the
user selected entities in the graph, these entities
are highlighted with the background color of the
entity. This “close reading” mode enables users
to verify hypotheses they perceive in the “distant
reading” (Moretti, 2007) visualizations.
6.1 Faceted search
The different levels of visualizations enable users
to explore the dataset via navigation through doc-
ument collection (Miller and Remington, 2004).
Another highly complementary paradigm for in-
formation retrieval used in our system, and also
one highly expected by users, is searching. Dur-
ing navigation, users explore the document collec-
tion interactively, browse with the help of the vi-
sualization interfaces, zoom in and out, gradually
discovering topics, entities and facts mentioned in
the corpus.
In contrast, searching in our system as-
sumes that a user issues a conventional faceted
search (Tunkelang, 2009) query being a free com-
bination of a full text queries with filters on docu-
ment metadata. The result of such a faceted search
is a set of documents that satisfy the specified
facets, i.e. search conditions, formulated at query
time, acting as a filter. Here facets can be any
metadata initially associated with the document
(e.g. date, sender or destination) extracted auto-
matically (e.g. named entities or topics). Search
results are presented to the user as a list of docu-
ments or in a form of a graph that corresponds to
the document view of our system (see Figure 2).
Search and navigation compliment each other
during a journalistic investigation. For instance, a
user can get an idea about information available
in the collection via interactive visualization inter-
face (see Section 6). This navigation session may
trigger a concrete journalistic hypothesis. In this
case, during the next step a user may want to issue
a specific search query to find documents that sup-
port or falsify the hypothesis. To implement this
search functionality, we rely on ElasticSearch.
7 User Study
We conducted a user study to analyze the usability
of the provided functions in the interactive visual
interface. We asked 10 volunteer students of var-
ious major subjects to perform investigative tasks
using our tool. The tasks covered all views. They
included finding documents of a particular date,
opening a document, selecting countries in a map,
showing and expanding entity network, assessing
entity type etc. We assessed subjective user expe-
rience.
The results showed that the implemented visu-
alizations were intuitive and the interactive func-
tions were easy to use for the requested tasks. The
volunteers especially appreciated the drill down
in the timeline, the fading in of newly appeared
nodes and edges in the graph. The legend and
tooltips were very often used as a guidance in
the interface. Upon users’ feedback, we extended
zooming and panning functions in the graph and
included highlighting of open documents in the
document list. We improved the graph layout and
look for reducing edge and node overplotting.
8 Conclusion and future work
In this paper, we presented new/s/leak, an in-
vestigative tool to support data journalists in ex-
tracting important storylines from large text doc-
uments. The backend of new/s/leak comprises
of different NLP tools for preprocessing, analyz-
ing and extracting objects such as named enti-
ties, relationships, term co-occurrences, keywords
and event time expressions. In the frontend,
new/s/leak provides network visualization with
different views supporting navigating, annotating,
and editing extracted information. We also devel-
oped a demo system presenting the current state of
the new/s/leak tool and conducted a user study to
evaluate the effectiveness of the system.
We are currently extending the tool to meet the
remaining journalists’ requirements. In particu-
lar, we include features for annotating entities and
their relations with explanations, for saving a par-
ticular view to share it with colleagues or for later
use. Moreover, we will provide journalists with
the possibility to further edit this sharable view.
Additional features for manual data curation will
enhance data quality for analysis while ensuring
protection of sources and compliance with legal
issues. We will also integrate adaptive annotation
machine learning approach (Yimam et al., 2016)
into new/s/leak to automatically identify interest-
ing objects based on the journalists’ interaction
and feedback. Further, we will investigate pulling
in other information from linked open data and the
web.
Acknowledgements
The authors are grateful to data journalists at
Spiegel Verlag for their helpful insights into jour-
nalistic work and for the identification of tool re-
quirements. The authors wish to thank Lukas Ray-
mann, Patrick Mell, Bettina Johanna Ballin, Nils
Christopher Boeschen, Patrick Wilhelmi-Dworski
and Florian Zouhar for their help with system im-
plementation and conduction of the user study.
The work is being funded by Volkswagen Foun-
dation under Grant Nr. 90 847.
References
D. Benikova, U. Fahrer, A. Gabriel, M. Kaufmann,
S. M. Yimam, T. von Landesberger, and C. Biemann.
2014. Network of the day: Aggregating and visual-
izing entity networks from online sources. In Proc.
NLP4CMC Workshop at KONVENS, Hildesheim,
Germany.
C. Biemann, G. Heyer, U. Quasthoff, and M. Richter.
2007. The Leipzig Corpora Collection – monolin-
gual corpora of standard size. In Proc. Corpus Lin-
guistics, Birmingham, UK.
S. Cohen, C. Li, J. Yang, and C. Yu. 2011. Com-
putational journalism: A call to arms to database
researchers. In Proc. CIDR-11), pages 148–151,
Asilomar, CA, USA.
S. Janicke, G. Franzini, M. F. Cheema, and G. Scheuer-
mann. 2015. On Close and Distant Reading in Digi-
tal Humanities: A Survey and Future Challenges. In
Proc. EuroVis, Cagliari, Italy.
K. Kirkpatrick. 2015. Putting the data science into
journalism. Commun. ACM, 58(5):15–17.
B. Klimt and Y. Yang. 2004. The Enron Corpus: A
New Dataset for Email Classification Research. In
Proc. ECML 2004, pages 217–226, Pisa, Italy.
A. Kochtchi, T. von Landesberger, and C. Biemann.
2014. Networks of names: Visual exploration
and semi-automatic tagging of social networks from
newspaper articles. Computer Graphics Forum,
33(3):211–220.
K. Kucher and A. Kerren. 2014. Text visualization
browser: A visual survey of text visualization tech-
niques. online textvis.lnu.se.
L.V. Lita, A. Ittycheriah, S. Roukos, and N. Kamb-
hatla. 2003. tRuEcasIng. In Proc. ACL ’03, Sap-
poro, Japan.
C. S. Miller and R. W Remington. 2004. Model-
ing information navigation: Implications for infor-
mation architecture. Human-computer interaction,
19(3):225–271.
F. Moretti. 2007. Graphs, maps, trees : abstract mod-
els for a literary history. Verso, London, UK.
J. Str¨
otgen. 2015. Domain-sensitive Temporal Tagging
for Event-centric Information Retrieval. Ph.D. the-
sis, University of Heidelberg.
D. Tunkelang. 2009. Faceted search. Synthesis lec-
tures on information concepts, retrieval, and ser-
vices, 1(1):1–80.
S. Yimam, C. Biemann, L. Majnaric, ˇ
S. ˇ
Sabanovi´
c, and
A. Holzinger. 2016. An adaptive annotation ap-
proach for biomedical entity and relation recogni-
tion. Brain Informatics, pages 1–12.