Conference PaperPDF Available

Network of the Day: Interactive Visualization of Time-Dependent Entity Relation Networks from Online Sources

Authors:
Network of the Day:
Aggregating and Visualizing Entity Networks from Online Sources
Darina Benikova, Uli Fahrer, Alexander Gabriel, Manuel Kaufmann,
Seid Muhie Yimam, Tatiana von Landesberger, Chris Biemann
Computer Science Department, TU Darmstadt, Germany
www.tagesnetzwerk.de
Abstract
This software demonstration paper presents
a project on the interactive visualization of
social media data. The data presentation
fuses German Twitter data and a social re-
lation network extracted from German on-
line news. Such fusion allows for compara-
tive analysis of the two types of media. Our
system will additionally enable users to ex-
plore relationships between named entities,
and to investigate events as they develop
over time. Cooperative tagging of relation-
ships is enabled through the active involve-
ment of users. The system is available on-
line for a broad user audience.
1 Introduction
The constantly growing interest in social media
raises a need for new tools enabling wide audi-
ence to analyze and explore the available data.
Our work addresses this need via the interactive
online visual system Network of the Day (Netz-
werk des Tages). It combines information ex-
tracted from the social media platform Twitter and
online newspaper articles. Network of the Day of-
fers a transparent exploration of current media to
politically interested non-experts.
The visualization shows the most important
current entities discussed in online media in a
compact and interactive form. The presented data
is kept up to date on a daily basis. We present the
This work is licensed under a Creative Commons Attri-
bution 4.0 International License (CC BY 4.0). Page num-
bers and proceedings footer are added by the organizers.
License details: http://creativecommons.org/
licenses/by/4.0/
media data in several interlinked views. First, we
extract and show the relationships between enti-
ties (i.e., persons and organizations) in a network.
Interaction with this network enables the users to
tag the relations between entities, which creates
additional semantics in the data. Second, a line
chart shows the occurrences of most popular enti-
ties for the respective day over the past months.
This offers the possibility to spot the develop-
ment of important topics over time. Third, this
enables the user to compare commonalities and
differences of the two media. Finally, the user can
search for entities of her interest in order to gain
information on media developments, which are of
relevance to her.
2 Related work
Summarizing and extracting information from
media databases has been a task of great interest
in natural language processing, as the amount of
information is too large to be processed by hu-
mans without automatic aids.
In recent years, the possibilities of opinion ex-
pression or social-media communication have in-
creased, resulting in a surge of sentiment analysis
tools (Pang and Lee, 2008). Especially there is a
need for filtering and exploring events and opin-
ions in high-volume social media data.
The visualization of social network data, Twit-
ter data and news has gained importance. Several
approaches have been developed. TextViz1pro-
vides an overview of text visualization techniques
from various areas. Most relevant to our work are
the visualization of word co-occurrence in Twit-
1http://textvis.lnu.se
ter messages and visualizations of relations be-
tween named entities. For example, Phrase Nets
(Van Ham et al., 2009) show co-occurrence of
words as a network, however they do not allow for
exploring time dependent changes. On the con-
trary, Topic Competition (Xu et al., 2013) shows
the development of word and topic frequencies
over time. However, the relationships between
topics and entities are not visible. A further rele-
vant work by Biemann et al. (2004) shows paths
through networks extracted from news. While this
software is interactive, relations between entities
cannot be labeled interactively and developments
over time are not shown.
In this work, the social media communication
is represented by the Twitter2platform. Meckel
and Stanoevska-Slabeva (2009) investigated the
reflexion of politics upon Twitter. Twitterbarome-
ter 3is a tool developed by the Buzzrank company
which measures the political mood in real time by
capturing tweets related to parties – as indicated
by hashtags – and classifying them as positive or
negative.
3 Description of main components
This section presents the main components of the
project. We first describe the data sources, their
deployment and their processing. We then present
two main components of the project – the Twitter
contrast analysis and Network of Names. These
components form a basis of the new system pre-
sented in Section 4.
3.1 Data Sources
The data sources used in our system are online
news from “W¨
orter des Tages” and online mes-
sages from Twitter.
3.1.1 Online News
The project “W¨
orter des Tages”4(Quasthoff et
al., 2002) serves as our source of daily news ar-
ticles. Frequently appearing words are extracted
daily by a text mining suite from daily newspa-
pers and news services.
2http://www.twitter.com
3http://twitterbarometer.de
4http://www.wortschatz.uni-leipzig.
de/wort-des- tages/
The project “W¨
orter des Tages” extracts its data
mostly from German online sites, resulting in a
daily dataload of approximately 20,000 - 50,000
sentences. The texts are segmented and indexed,
the terms are quantitatively acquired and statis-
tically significant co-occurrences are computed.
The main parameters for the term selection are
the frequency in the current daily corpus, the fre-
quency in the already mentioned reference corpus
“Deutscher Wortschatz” and the factor of relative
frequencies between the two corpora of the term
(Quasthoff et al., 2002).
3.1.2 Twitter
We download Twitter data using its public
Streaming API5that gives developers access to
Twitter’s global stream of Tweet data. This stream
is filtered according to previous selected most im-
portant keywords, i.e. as extracted by (Quasthoff
et al., 2002).
3.2 Basis Software Components
Two recent works form the basis of this project:
Fahrer’s implementation (2014) of a Twitter
contrast-analysis, which shows words frequently
co-occurring with search terms and the work of
Kochtchi et al. (2014), which visualizes the re-
lationships between people and organizations us-
ing online newspaper articles as a source. Both
projects provide full provenance information, i.e.
users are not only able to see and manipulate the
display of automatically extracted relationships,
but also to access the text sources from which the
relationships are extracted.
3.2.1 Twitter contrast-analysis
The component by Fahrer (2014) provides a
contrastive co-occurrence analysis that contrasts
two separate keywords regarding their strongly
associated words in Twitter messages. For exam-
ple, Figure 1 shows a contrastive analysis for the
keywords Br¨
uderle and Trittin, who are promi-
nent German politicians from two different par-
ties. The left side of the graph shows words only
co-occurring with the keyword Br¨
uderle and the
right side shows only co-occurring words with
Trittin. The overlap in the middle indicates words
that are co-occurring with both terms. Results
5https://dev.twitter.com/docs/api/
show that the overlap in the contrast analysis
gives a sensible reflection of main political events.
Furthermore, most of the relevant newspaper top-
ics regarding the contrastive analysis are reflected
in Twitter.
The data for a study on the German parliament
election was collected from Twitter between Au-
gust 2, 2013 and October 9, 2013. Overall a
corpus of 10,524,367 Twitter messages was col-
lected. For the tokenization, the Twitter tokenizer
from Gimpel et al. (2011) was employed. To de-
termine the words strongly co-occurring with a
given word the log-likelihood measure (Dunning,
1993) was applied to rank the vocabulary accord-
ing to descending values (Fahrer, 2014).
Figure 1: Sample contrastive analysis with the search
terms “Br¨
uderle” (light bars) and “Trittin” (dark bars)
with 40 result terms, cf. (Fahrer, 2014)
3.2.2 Network of Names
The second basic component is the exploration
of relationships between named entities presented
by Kochtchi et al. (2014). This interactive system
derives a social network graph from information
extracted from online publications of newspaper
articles.
The visualization enables to explore and inves-
tigate the relationships between people and orga-
nizations of public interest, reflecting the inter-
action between public protagonists and the influ-
ence of their surroundings, sociality and public
policy. Kochtchi et al. (2014) used the Leipzig
Corpora Collection (Richter et al., 2006), con-
taining about 70 million of sentences extracted
from German online newspapers between 1995
and 2010, as the text source of his project.
In the course of preprocessing, Kochtchi et al.
(2014) extracted Named Entities using the Stan-
ford Named Entity Parser (Faruqui and Pad´
o,
2010; Finkel et al., 2005) and calculated normal-
ized PMI scores (Bouma, 2009) of co-occurrence.
The Network of Names component offers the pos-
sibility of collaborative social tagging. By click-
ing on the edges between entities, users can en-
ter a relation label of this relationship. The users
base these labels on the sentences containing the
two entities. The sentences are shown in an extra
frame next to the relationship. While the Network
of Names was a static visualization of a large cor-
pus, we use parts of this technology to create daily
networks and components display changes over
time.
4 Combination of social-media and
computer-mediated communication
The main goal of “Network of the Day” is to
present current main topics and their relationship
on the basis of combining online news and social
media. The combination represents the contrast
of the presentation of events by the German on-
line media and the reaction to the situation of a
part of the German online Twitter community.
Figure 2 illustrates the visualization for net-
works extracted from daily news. Our visualiza-
tion comprises four main parts, which are interac-
tively linked: daily network, social tagging, time
line and twitter contrast analysis.
Networks are constructed on a daily basis, rep-
resent important events of the day, and can be vi-
sually compared to networks from the past. Each
network shows the relationships between the most
important persons and organisations of the day.
Entities are nodes and their co-occurrence is de-
noted by edges. The user can select entities from
the graph and their most important co-occurring
terms over time. The network is clustered with the
Markov Cluster Algorithm (van Dongen, 2000),
and clusters can be unfolded and collapsed by
clicking on them. Cluster labels are the most cen-
tral three nodes within a cluster that are calcu-
lated using the Pagerank algorithm (Page et al.,
1999). We use a flexible force-directed layout for
Figure 2: Visualization of a Network of the Day for September 8, 2014 after a search for ”Fernando Alonso”.
Two clusters about motor sports are unfolded, the sources for the link between ”Nico Rosberg” and ”Mercedes”
are shown and their relation is labeled as ”f¨
ahrt f¨
ur” (drives for).
the graph rendering that is implemented using the
D3.js6JavaScript visualization library.
Clicks on links result in the display of source
sentences, which are linked to the original online
articles. Users can tag relationships of entities us-
ing the interactive social tagging component, see
right side of Figure 2. Further, selecting an edge
also invokes a contrast analysis of the two con-
nected entities based on Twitter data, cf. Section
3.2.1 (not shown due to space constraints). The
search mask allows the user to search for entities
of her choice in arbitrary time spans, and to obtain
a detailed analysis. This allows for user specific
exploration of current and past social media.
The dynamics of word frequency over time is
exemplified in Fig. 3 and displayed below the net-
work. Initially, it shows terms that were popular
on the respective day, but arbitrary terms from the
network can be selected, and compared in the fre-
quency diagram.
5 Outlook and Further work
Network of the Day offers a transparent aggre-
gation of current media to laymen interested in
politics and other daily affairs. Moreover, it of-
fers them the possibility to collaboratively tag in-
teresting relationships. Very importantly, the vi-
sualization provides full provenance, as original
sources are linked.
6http://d3js.org/
Figure 3: Frequency diagram of trending terms on
September 8, 2014, reflecting the bi-weekly schedule
of Formula 1 races.
By extracting the current information on rela-
tions, people, organizations and events from Twit-
ter, the result of this project may be used in polit-
ical education or serve voters as an overview. In
this study only a comparison of data containing
the search terms, as described above, may be pro-
vided. In a further study, a direct comparison of
entities such as persons, organizations and events,
appearing in both Twitter and online newspaper
articles may be conducted.
The software is available as an online website7,
and is expected to be finalized in October 2014.
7available on http://maggie.lt.informatik.
tu-darmstadt.de/nod/ via http://
tagesnetzwerk.de/
Acknowledgements
“Netzwerk des Tages” (Network of the Day) is
funded by BMBF via a grant from Hochschulwet-
tbewerb 20148.
References
Chris Biemann, Karsten B ¨
ohm, Gerhard Heyer, and
Ronny Melz. 2004. Automatically building con-
cept structures and displaying concept trails for the
use in brainstorming sessions and content manage-
ment systems. In Proceedings of I2CS, Guadala-
jara, Mexico. Springer LNCS.
Gerlof Bouma. 2009. Normalized (Pointwise) Mutual
Information in Collocation Extraction. In Proceed-
ings of the International Conference of German So-
cienty for Computational Linguistics and Language
Technology, pages 31–40, Potsdam, Germany.
Ted Dunning. 1993. Accurate methods for the statis-
tics of surprise and coincidence. Computational
linguistics, 19(1):61–74.
Uli Fahrer. 2014. Contrastive Co-occurrence Analysis
on Twitter for the German Election 2013. In GI-
Edition: Lecture Notes in Informatics, pages 257–
260, Potsdam, Germany.
Manaal Faruqui and Sebastian Pad´
o. 2010. Train-
ing and evaluating a German named entity recog-
nizer with semantic generalization. In Proceedings
of KONVENS, pages 129–133, Saarbr¨
ucken, Ger-
many.
Jenny Rose Finkel, Trond Grenager, and Christopher
Manning. 2005. Incorporating non-local informa-
tion into information extraction systems by gibbs
sampling. In Proceedings of the 43rd Annual Meet-
ing on Association for Computational Linguistics,
pages 363–370, Michigan, USA.
Kevin Gimpel, Nathan Schneider, Brendan O’Connor,
Dipanjan Das, Daniel Mills, Jacob Eisenstein,
Michael Heilman, Dani Yogatama, Jeffrey Flani-
gan, and Noah A. Smith. 2011. Part-of-speech
Tagging for Twitter: Annotation, Features, and Ex-
periments. In Proc. ACL-HLT-2011, pages 42–47,
Portland, OR, USA.
Artjom Kochtchi, Tatiana von Landersberger, and
Chris Biemann. 2014. Networks of Names: Vi-
sual Exploration and Semi-Automatic Tagging of
Social Networks from Newspaper Articles. Com-
puter Graphics Forum, 33(3).
Miriam Meckel and Katarina Stanoevska-Slabeva.
2009. Auch Zwitschern muss man ¨
uben: Wie Poli-
tiker im deutschen Bundestagswahlkampf “twit-
terten”. Neue Z¨
urcher Zeitung.
8http://www.hochschulwettbewerb2014.
de/
Lawrence Page, Sergey Brin, Rajeev Motwani, and
Terry Winograd. 1999. The pagerank citation rank-
ing: Bringing order to the web. Technical Report
1999-66, Stanford InfoLab, November. Previous
number = SIDL-WP-1999-0120.
Bo Pang and Lillian Lee. 2008. Opinion Mining and
Sentiment Analysis. Found. Trends Inf. Retr., 2(1-
2):1–135.
Uwe Quasthoff, Matthias Richter, and Christian Wolff.
2002. “ W ¨
orter des Tages”-Tagesaktuelle wissens-
basierte Analyse und Visualisierung von Zeitungen
und Newsdiensten. In ISI, pages 369–372.
Matthias Richter, Uwe Quasthoff, Erla Hall-
steinsd´
ottir, and Chris Biemann. 2006. Exploiting
the Leipzig Corpora Collection. In Proceedings of
the IS-LTC 2006, pages 68–73, Ljubljana, Slovenia.
Stijn van Dongen. 2000. Graph Clustering by Flow
Simulation. Ph.D. thesis, University of Utrecht,
Utrecht.
Frank Van Ham, Martin Wattenberg, and Fernanda B
Vi´
egas. 2009. Mapping text with phrase nets. Vi-
sualization and Computer Graphics, IEEE Trans-
actions on, 15(6):1169–1176.
Panpan Xu, Yingcai Wu, Enxun Wei, Tai-Quan Peng,
Shixia Liu, Jonathan JH Zhu, and Huamin Qu.
2013. Visual analysis of topic competition on so-
cial media. Visualization and Computer Graphics,
IEEE Transactions on, 19(12):2012–2021.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Understanding relationships between people and organizations by reading newspaper articles is difficult to manage for humans due to the large amount of data. To address this problem, we present and evaluate a new visual analytics system, which offers interactive exploration and tagging of social networks extracted from newspapers. For the visual exploration of the network, we extract “interesting” neighbourhoods of nodes, using a new degree of interest (DOI) measure based on edges instead of nodes. It improves the seminal definition of DOI, which we find to produce the same “globally interesting” neighbourhoods in our use case, regardless of the query. Our approach allows answering different user queries appropriately, avoiding uniform search results. We propose a user‐driven pattern‐based classifier for discovery and tagging of non‐taxonomic semantic relations. Our approach does not require any a‐priori user knowledge, such as expertise in syntax or pattern creation. An evaluation shows that our classifier is capable of identifying known lexico‐syntactic patterns as well as various domain‐specific patters. Our classifier yields good results already with a small amount of training, and continuously improves through user feedback. We conduct a user study to evaluate whether our visual interactive system has an impact on how users tag relationships, as compared to traditional text‐based interfaces. Study results suggest that users of the visual system tend to tag more concisely, avoiding too abstract or overly specific relationship labels.
Article
Full-text available
How do various topics compete for public attention when they are spreading on social media? What roles do opinion leaders play in the rise and fall of competitiveness of various topics? In this study, we propose an expanded topic competition model to characterize the competition for public attention on multiple topics promoted by various opinion leaders on social media. To allow an intuitive understanding of the estimated measures, we present a timeline visualization through a metaphoric interpretation of the results. The visual design features both topical and social aspects of the information diffusion process by compositing ThemeRiver with storyline style visualization. ThemeRiver shows the increase and decrease of competitiveness of each topic. Opinion leaders are drawn as threads that converge or diverge with regard to their roles in influencing the public agenda change over time. To validate the effectiveness of the visual analysis techniques, we report the insights gained on two collections of Tweets: the 2012 United States presidential election and the Occupy Wall Street movement.
Chapter
Full-text available
In this paper we describe a web service which presents media analysis results as both, structured lists of Words of the Day, as well as visualizations of rele-vant concepts in their semantic context and concept relevance timelines. Media analysis is based on a daily collection of online newspapers and news services. For each day, this collection is processed by a text mining suite. Dieser Beitrag versucht aufzuzeigen, wie Text Mining-Verfahren für die Ana-lyse von Medienprodukten genutzt werden können und so als „angewandte Medieninformatik“ einen interdisziplinären Beitrag zur Medienanalyse leisten können. Es wird ein im World Wide Web verfügbarer Informationsdienst vorgestellt, der tagesaktuell überregionale Online-Medien auswertet und beg-riffsbasiert die jeweils als relevant erkannten Konzepte als „Wörter des Ta-ges“ präsentiert. Die „Wörter des Tages“ wurden im Rahmen des Projekts „Deutscher Wortschatz“ am Institut für Informatik der Universität Leipzig entwickelt (vgl. Quasthoff & Wolff 2000) und sind im World Wide Web un-ter http://www.wortschatz.uni-leipzig.de/wort-des-tages/ verfügbar.
Article
Full-text available
We present a freely available optimized Named En- tity Recognizer (NER) for German. It alleviates the small size of available NER training corpora for German with distributional generalization features trained on large unlabelled corpora. We vary the size and source of the generalization corpus and find improvements of 6% F1 score (in-domain) and 9% (out-of-domain) over simple supervised training.
Article
Full-text available
In this paper the Leipzig Corpora Collection is introduced as a contribution to the idea that there is need for standardization of multilingual language resources. We explain the steps of building, processing and presenting corpora of comparable sizes and in a uniform format. Results from intra-and interlingual comparisons of corpora are given and methods that can build upon these corpora are shown. Uporaba lepiziške korpusne zbirke clanku je leipziška korpusna zbirka predstavljena kot prispevek k ideji o standardizaciji večjezičnih jezikovnih virov. Razložimo postopke gradnje, procesiranja in predstavitve korpusov primerljive velikosti in v enovitem formatu. Podani so rezultati znotraj-in medjezikovne primerjave korpusov ter predstavljene metode, ki lahko zrasejo na njihovi osnovi.
Conference Paper
Full-text available
The automated creation and the visualization of concept structures become more important as the number of relevant information continues to grow dramatically. Especially information and knowledge intensive tasks are relying heavily on accessing the relevant information or knowledge at the right time. Moreover the capturing of relevant facts and good ideas should be focused on as early as possible in the knowledge creation process. In this paper we introduce a technology to support knowledge structuring processes already at the time of their creation by building up concept structures in real time. Our focus was set on the design of a minimal invasive system, which ideally requires no human interaction and thus gives the maximum freedom to the participants of a knowledge creation or exchange processes. The initial prototype concentrates on the capturing of spoken language to support meetings of human experts, but can be easily adapted for the use in Internet communities that have to rely on knowledge exchange using electronic communication channels.
Article
In this paper, we discuss the related information theoreti-cal association measures of mutual information and pointwise mutual information, in the context of collocation extraction. We introduce nor-malized variants of these measures in order to make them more easily interpretable and at the same time less sensitive to occurrence frequency. We also provide a small empirical study to give more insight into the be-haviour of these new measures in a collocation extraction setup.
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Conference Paper
We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90% accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related social media data sets.