Conference PaperPDF Available

Mapping the Blogosphere with RSS-Feeds

Authors:

Abstract

The massive adoption of social media has provided new ways for individuals to express their opinions online. The blogosphere, an inherent part of this trend, contains a vast array of information about a variety of topics. It is thus a huge think tank that creates an enormous and ever-changing archive of open source intelligence. Modeling and mining this vast pool of data to extract, exploit and describe meaningful knowledge in order to leverage (content-related) structures and dynamics of emerging networks within the blogosphere is the higher-level aim of the research presented here. This paper focuses on this project's initial phase, in which the above-mentioned data of interest needs to be collected and made available offline for further analyses. Our proprietary development of a tailor-made feed-crawler meets exactly this need. The main concept, the techniques and the implementation details of the crawler thus form the main interest of this paper and furthermore provide the basis for future project phases.
Mapping the Blogosphere with RSS-Feeds
Justus Bross, Matthias Quasthoff, Philipp Berger, Patrick Hennig, Christoph Meinel
Hasso-Plattner Institute, University of Potsdam
Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany
{justus.bross, matthias.quasthoff, office-meinel}@hpi.uni-potsdam.de
{philipp.berger, patrick.hennig}@student.hpi.uni-potsdam.de
Abstract— The massive adoption of social media has provided
new ways for individuals to express their opinions online. The
blogosphere, an inherent part of this trend, contains a vast
array of information about a variety of topics. It is thus a huge
think tank that creates an enormous and ever-changing
archive of open source intelligence. Modeling and mining this
vast pool of data to extract, exploit and describe meaningful
knowledge in order to leverage (content-related) structures
and dynamics of emerging networks within the blogosphere is
the higher-level aim of the research presented here. This paper
focuses on this project’s initial phase, in which the above-
mentioned data of interest needs to be collected and made
available offline for further analyses. Our proprietary
development of a tailor-made feed-crawler meets exactly this
need. The main concept, the techniques and the
implementation details of the crawler thus form the main
interest of this paper and furthermore provide the basis for
future project phases.
Keywords – weblogs, rss-feeds, network analysis, content analysis
I. INTRODUCTION
Since the end of the 90s, weblogs have evolved to an
inherent part of the worldwide cyber culture [8]. In the year
2008, the worldwide number of weblogs has increased to a
total in excess of 133 million [16] [18]. Compared to around
60 million blogs in the year 2006, this constitutes the
increasing importance of weblogs in today’s internet society
on a global scale.
Technically, weblogs are an easy-to-use, web-enabled
Content Management System (CMS), in which dated articles
(“postings”), as well as comments on these postings, are
presented in reverse chronological order [3]. Their potential
fields of application are numerous, beginning with personal
diaries, reaching over to knowledge and activity
management platforms, and finally to enabling content-
related and journalistic web offerings [9] [13]. This makes
their point of origin indefinable.
One single weblog is embedded into a much bigger
picture: a segmented and independent public that
dynamically evolves and functions according to its own rules
and with ever-changing protagonists, a network also known
as the “blogosphere” [19]. A single weblog is embedded into
this network through its trackbacks, the usage of hyperlinks
as well as its so-called “blogroll” – a blogosphere-internal
referencing system.
This huge think tank creates an enormous and ever-
changing archive of open source intelligence [14]. However,
the biggest congeniality of the blogosphere – the absence or
independence of any centralized control mechanism –is
meanwhile its biggest shortcoming: Modeling and mining
the vast pool of data generated by the blogosphere to extract,
exploit and represent meaningful knowledge in order to
leverage (content-related) structures and dynamics of
emerging social networks residing in the blogosphere
seemed so far virtually impossible.
II. PROJECT SCOPE
Facing this unique challenge we initiated a project with
the objective to map, and ultimately reveal, content-, topic-
or network-related structures of the blogosphere by
employing an intelligent RSS-feed-crawler. A crawler, also
known as an ant, automatic indexer, worm, spider or robot, is
a program that browses the World Wide Web (WWW) in an
automated, methodical manner [10].
A feed is a standardized format, provided as RSS or
ATOM by almost all content providers in the internet, to
easily distribute content information or news about their
website [17]. In the blogosphere, RSS-feeds are usually
provided whenever a new post or comment is published in
weblogs. Due to the standardized format of RSS-feeds,
machines or program routines can automatically analyze
them, and are consequently able to provide subscribers with
updated and current content of these feeds. You could thus
say that the sum of all feeds represents the network’s entire
structure.
To allow the processing of the enormous amount of
content in the blogosphere, it is necessary to make that
content available offline for further analysis. Our feed-
crawler completes this assignment as described in the
following sections.
While section three focuses on the crawler’s functionality
and its corresponding workflows, section four is digging
deeper into the technical realization of the crawler. The time
since notification of acceptance till final submission of this
research paper was meanwhile used for further
enhancements of the crawler - hereafter presented in section
five. Section six is dedicated to related academic work that
describes distinct approaches of how and for what purpose
the blogosphere’s content can be mapped. Based on these
insights, we provide an outlook as well as recommendations
for further research with the overall objective to further
enhance our crawler and to ultimately use the data collected
to show interesting patterns and insights of segmented
blogosphere networks. A conclusion is given in section nine,
followed by the list of references and an appendix with
attendant figures of the research discussed in this paper.
III. ORIGINAL CRAWLER FUNCTIONALITY
AND WORKFLOW
A. Action-Sequence of the Crawler
The crawler starts his assignment with a predefined and
arbitrary list of blog-URLs. It downloads all available post-
and comment-feeds of that blog and stores them in a
database. It than scans the feed’s content for links to other
resources in the web, which are then also crawled and
equally downloaded in case these links point to another blog.
Again, the crawler starts scanning the content of the
additional blog feed for links to additional weblogs. The
crawler repeats this iterative process till it comes up against a
link that was either already scanned, or till he comes across
so-called “isles” – a smaller network of blogs that only link
to each other and have no connection to the rest of the
blogosphere. To avoid that the crawler gets stuck on one of
these isles, we include blog URLs from different
geographical regions, as well as blogs that cover diverse and
unrelated topics as regards content in the arbitrary starting
list. This furthermore increases the odds that the crawler
covers the whole of the blogosphere within a minimum of
time. With maximal content-related diversity in the arbitrary
starting list, data can also be meaningfully analyzed in an
early stage. By following-up links, it can be guaranteed that
any URL of the worldwide top weblogs will sooner or later
be crawled. The representation of the most influential
opinion leaders is therefore feasible.
B. Recognizing Weblogs
Whenever a link is analyzed, we first of all need to assess
whether it is a link that points to a weblog, and also with
which software the blog is created. Usually this information
can be obtained via attributes in the metadata of a weblogs
header. It can however not be guaranteed that every blog
provides this vital information for us as described before.
There is a multitude of archetypes across the whole HTML
page of a blog that can ultimately be used to identify a
certain class of weblog software. By classifying different
blog-archetypes beforehand on the basis of predefined
patterns, the crawler is than able to identify at which
locations of a webpage the required identification patterns
can be obtained and how this information needs to be
processed in the following.
Originally the crawler knew how to process the
identification patterns of the three most prevalent weblog
systems 1 around: MovableType, Blogger.com and
Wordpress.com [11]. In the course of the project,
identifications patterns of other blog systems will follow. In
1 http://wordpress.org, http://blogger.com, http://www.movabletype.org
a nutshell, the crawler is able to identify any blog software,
whose identification patterns were provided beforehand.
C. Recognizing Feeds
The recognition of feeds can similarly to any other
recognition-mechanism be configured individually for any
blog-software there is. Usually, a web service provider that
likes to offer his content information in form of feeds,
provides an alternative view in the header of its HTML
pages, defined with a link tag. This link tag carries an
attribute (rel) specifying the role of the link (usually
“alternate”, i.e. an alternate view of the page). Additionally,
the link tag contains attributes specifying the location of the
alternate view and its content type. The feed crawler checks
the whole HTML page for exactly that type of information.
In doing so, the diversity of feed-formats employed in the
web is a particular challenge for our crawler, since on top of
the current RSS 2.0 version, RSS 0.9, RSS 1.0 and ATOM
among others are also still used by some web service
providers. Some weblogs above all code lots of additional
information into the standard feed. Momentarily, our crawler
only supports standard and well-formed RSS 2.0 formats, of
which all the information of our currently employed object-
model is readout. It is the aim of our project team to include
as many RSS-formats as possible in the future.
D. Storing Crawled Data
Whenever the crawler identifies an adequate (valid) RSS-
feed, it downloads the entire corresponding data set. The
content of a feed incorporates all the information necessary,
to give a meaningful summary of a post or comment – thus a
whole weblog and ultimately the entire blogosphere. General
information like title, description, categories as well as the
timestamp indicating when the crawler accessed a certain
resource, is downloaded first. Single items inside the feed
represent diverse posts of a weblog. These items are also
downloaded and stored in our database using object-
relational mapping2 (refer to figure two in the appendix). The
corresponding attributes are unambiguously defined by the
standardized feed formats and by the patterns that define a
certain blog-software.
On top of the general information of a post, a link to the
corresponding HTML representation is downloaded and
stored as well. In case this information is not provided in the
feed information of a blog provider, we are thus still able to
use this link information at a later point for extended
analyses that would otherwise not be possible.
Comments are the most important form of content in
blogs next to posts, and they are usually provided in form of
feeds as well. However, we do need to take into account that
a comment’s feed-information is not always provided in the
same form by all blog software systems. This again explains
why we pre-defined distinct blog-software classes in order to
provide the crawler with the necessary identification patterns
of a blog system. Comments can either be found in the
HTML header representation or in an additional XML
attribute within a post feed. Comment feeds are also not
2 https://www.hibernate.org/
provided by every blogging system. With the predefined
identification patterns, our crawler is however able to
download the essential information of the comment and store
it in our database.
Another important issue is the handling of links that are
usually provided within posts and comments of weblogs. In
order to identify network characteristics and interrelations of
blogs within the whole of the blogosphere, it is not only
essential to store this information in the database, but to save
the information in which post or comment this link was
embedded.
E. Refreshing Period of Crawled Data
How often a single blog is scanned by our crawler
depends on its cross-linking and networking with other
blogs. Blogs that are referenced by other blogs via
trackbacks, links, pingbacks or referrers are thus visited with
a higher priority than others by the crawler. Well-known
blogs that are referenced often within the blogosphere are
also revisited and consequently updated more often with our
original algorithm. It can be considered possible that with
this algorithm blogs of minor importance are visited rarely –
a side-effect that we do not consider to be limiting at this
time.
It would be fairly easy to put a different algorithm into
action that might for instance make use of a ranking-score.
With this “importance” score, it could be guaranteed that
blog feeds that are considered as not that important in the
community would not that often, but at least regularly be
updated. Implementing a different algorithm can at all times
be realized by substituting the so-called “scheduler” of our
crawler (refer to figure one in appendix). It might prove to be
necessary at a later time to apply certain heuristics on the
blogs, to decide upon the sequence of their eradication.
F. Practical Experience
It is only a matter of minutes till the crawler finds a
considerable amount of links that need to be analyzed. With
a predefined starting list of only four weblog-URLs that hold
a fairly good ranking within one of the major weblog ranking
sites like Technorati 3 for instance, the crawler originally
found up to several hundred links within 3-4 minutes. It
should also be noted that with such a starting list of well-
known blogs, the chance that the crawler ends up on one of
the before-mentioned isles is minimal. As a matter of fact we
did not experience such a dead end within this setting during
our testing phase. If you choose four blogs of minor
importance, the chance of ending up in a dead end is
considerably higher, since unimportant blogs usually have a
rather low degree of referencing and interlinking within the
blogosphere.
We did also experience time critical limitation during our
testing phase. Due to the enormous amount of links that are
waiting for further analysis after only a couple of minutes the
crawler is active, it takes time until a blog with all its
postings and comments and the corresponding links is
completely covered and downloaded. We acknowledge this
3 http://technorati.com/
time issue, since it just clearly demonstrates how widely
ramified the blogosphere actually is. We therefore consider it
more crucial in the first project phase that the crawler
collects links of as much distinct blogs as possible, rather
than covering blog’s content in its entirety. At the time the
crawler has identified a reasonable share of a (national)
blogosphere, it might in a second phase than be more
important to cover the entire content of blogs (refer to
section V regarding “ongoing optimization efforts”). It might
then make sense to adapt the crawling algorithm after a
certain period of time accordingly. It is in any case essential
to allow the crawler a minimum operational time to actually
deliver meaningful data for further analyses.
IV. IMPLEMENTATION DETAILS
The feed crawler software is implemented in Groovy4, a
dynamic programming language for the Java Virtual
Machine (JVM) [7]. Built on top of the Java programming
language, Groovy provides excellent support for accessing
resources over HTTP, parsing XML documents, and storing
information to relational databases. Features like inheritance
of the object-oriented programming language are used to
model the specifics of different weblog systems. Both the
specific implementation of the feed crawler on top of the
JVM, as well its general architecture separating the crawling
process into retrieval scheduling, retrieval, and analysis as
described in section three, allow for a distributed operation
of the crawler in the future. Such distribution will become
inevitable once the crawler is operated in long-term
production mode. Both the structured nature of RSS and
ATOM feeds and the best practices developed for weblog
systems make the crawling of the blogosphere a task
different from regular web crawling. Instead of indexing
“documents” only, the feed crawler is aware of different
types of documents in a weblog system, i.e., postings and
comments, and can handle semantic relations between these
different types of documents, such as a comment being a
reply to a posting or another comment. Passing this
information from retrieval to analysis and to the next round
of retrieval scheduling, the information collected by the
crawler will be much more valuable for the analyses we have
in mind than regular web-crawler data.
V. ONGOING OPTIMIZATION EFFORTS
We did observe several critical issues during the
crawler’s testing phase that apparently put a severe strain on
its performance of.
It proved to be a major problem to download non-HTML
content such as pictures or videos. Since the crawler follows
any link on a webpage, it often downloads multimedia data
exceeding 5 MB that does not contain any relevant
information for our project. Downloading such huge data
elements results in higher network traffic and consequently
in lower crawler performance. Our solution was the inclusion
of a so-called “black list” of file extensions like avi, mov, or
png. Whenever the crawler encounters a link that points
4 http://groovy.codehaus.org/
towards such a file extension, the downloading process is
interrupted and an absolute term assigned as the content of
that link in the database.
Another major issue was that the crawler can only handle
valid RSS 2.0 feeds, because of which a considerable amount
of blogs could not be downloaded and consequently
analyzed at all. We therefore tested several frameworks (e.g.
Project ROME5) to circumvent this problem, but ultimately
realized that all these frameworks make too high demands on
the validity of XML pages. Since the share of blogs not
offering RSS 2.0 or ATOM feeds is minimal, it was decided
to solely include the compatibility of ATOM feeds in our
crawler implementation.
Contrarily to this cognizance, the share of those weblogs
providing invalid feeds because of non-valid or inaccurate
tags or other illicit character strings was exceptionally high.
The crawler makes use of the XMLSlurper embedded
within Groovy, which is build upon the SAX implementation
of Java. Since we are dependent upon this internal
implementation, we can at this moment only react on bugs
during the parsing process. A solution that allows us to
analyze virtually all XML pages is therefore crucial (refer to
section VIII). We also intend to optimize the crawler
framework by possibly abandoning the currently employed
hibernate library. Even though this library simplifies
complexity by mapping java objects on the database, the
framework seems not always be able to cope with the
excessive data-collection of our crawler. Another
enhancement was realized by extending the known blog-
classes in our framework (refer to section III.B) to the ones
of Serendipity and TypePad, as well as all those blogs that
support the XHtml Friends Network6.
We also intended to improve crawler-performance via the
amplification of hardware resources employed. One
approach we took was to rely on synergies of distributed
computing. In doing so, our central server hosted the
database while the crawler-software was running on up to
three different clients downloading and parsing data to the
central database. As can be inferred from figure three (see
appendix), an increase in performance could indeed be
observed. However, this increase was minimal, since the
clients were wasting a considerable amount of time and
internal resources in waiting for the central database to
process their information.
This finding called for investments in central computing
power. The new project hardware features a rack-server with
24 GB Ram and eight cores with each 2,4 GHZ. The first
five days into operations this hardware boosted the number
of processed jobs with factor 10. The indexing of those
database-fields used within the WHERE-clause doubled this
performance to meanwhile 500 blogs that the crawler finds –
as opposed to 24 in the beginning. In summary, we could
increase the performance of the crawler by factor 20 since
the start of the project (refer to figure three in the appendix).
5 https://rome.dev.java.net
6 http://gmpg.org/xfn/
VI. RELATED WORK
Certainly, the idea of crawling the blogosphere is not a
novelty. But the ultimate objectives and methods behind the
different research projects regarding automated and
methodical data collection and mining differ greatly as the
following examples suggest:
While Glance et. al. employ a similar data collection
method in the blogosphere as we do, their subset of data is
limited to 100.000 weblogs and their aim is to develop an
automated trend discovery method for the blogosphere in
order to tap into the collective consciousness of the
blogosphere [6]. Song et al. in turn try to identify opinion
leaders in the blogosphere by employing a special algorithm
that ranks blogs according to not only how important they
are to other blogs, but also how novel the information is they
contribute [15]. Bansal and Koudas are employing a similar
but more general approach than Song et al. by extracting
useful and actionable insights with their BlogScope-Crawler
about the ‘public opinion’ of all blogs programmed with the
blogging software blogspot.com [1]. Extracting geographic
location information from weblogs and indexing them to city
units is an approach chosen by Lin and Halavais [12]. Bruns
tries to map interconnections of individual blogs with his
IssueCrawler research tool [4]. His approach comes closest
to our own project’s objective of leveraging (content-related)
structures and dynamics of emerging networks within the
blogosphere. His data set and project scope are however not
as extended as ours, since he focuses on the Australian
blogosphere that is concerned with debating news and
politics.
VII. OUTLOOK AND FURTHER RESEARCH
The feed crawling framework presented in this paper will
allow for a variety of promising future applications. The
most basic analysis will focus on the social structure exposed
by the graph of interlinked weblogs. Per blog author or per
region, interesting variables can be measured, e.g. the
dominance of uni-directional links (e.g., followers linking to
the opinion leader’s blog) vs. balanced hyperlinks (e.g.,
authors mutually quoting each other), the frequency of new
appearing postings, clustering coefficients etc. For this
analysis, weblogs and postings can also be grouped by
categories and user annotation [5] assigned by the authors,
and the same variables can be measured per category or tag.
More advanced studies of the material gathered will
involve temporal analyses. Special focus will be put on event
detection: It is a frequently recurring phenomenon in the
blogosphere that either an author starts a controversy in one
posting, or several authors pick up topics from the general
media, e.g. on politics, marketing, or other society-related
topics. Afterwards, myriads of other webloggers start
quoting these initial postings, comment on them, even do
serious research work underpinning either side of the
controversial debate, which frequently leads to the traditional
media picking up the topic again. It has not yet been
completely understood why and how the interest in certain
topics grows, while other, probably equally important topics
do not reach beyond occasional discussions in single
weblogs. To investigate such questions, the data gathered by
our crawler can be used to track topics (i.e. keywords such as
categories and tags) across hyperlinks, and also across
inverse hyperlinks which cannot be accessed directly
otherwise, as they are insufficiently mimicked by track-
backs, link-backs etc. A large part of these studies will deal
with an appropriate and meaningful visualization of these
data. The visualization efforts are in a first step based upon
the open source visualization tool flare 7 that will be
configured to fit the needs of content representation in single
weblogs and of larger subsets like national blogospheres. We
will make use of an interactive visualization technique based
on the ‘Eigenfactor™ Metrics’8 and called ‘Well-Formed-
Eigenfaktor’, which are well-suited in providing graphical
overviews about citation networks.
The third major direction of research deals with making
the huge dataset created during the course of this project
available to researchers and the public. This will be
accomplished following the Linked Data principles [2]. The
Semantically Interlinked-Online Communities (SIOC)
project9 provides an RDF vocabulary to expose the structure
of online communities such as forums or blogs as RDF
graphs. Publishing these valuable data in a standardized
format will help a wide range of researchers from social
scientists over computer network specialists to semantic web
researchers to investigate various aspects of web
communities such as the blogosphere, or the numerous
national blogospheres.
VIII. LIMITATIONS
RSS-feed information of the blogosphere allows us to
gain insights into the blogosphere that are not available
elsewhere. Since the analysis of the data collected by our
crawler is part of our project’s second stage, it is out of scope
for this paper. Our focus here mainly lies on describing the
functionality and working-method of our crawler in detail. In
doing so, our present crawler implementation constitutes a
decent solution, sufficient for initial research in our research
topic and for beginning data analyses, but is due to the
following reasons not fully optimized yet:
Firstly, currently the crawler can only handle well-
formed RSS 2.0 and ATOM feeds. RSS 0.9 or RSS 1.0
feeds, as well as feeds that have additional information coded
into their standard information and are thus not well-formed
are currently not supported by our crawler. Embedding a
robust feed-parser framework therefore constitutes an
essential element of future optimization efforts as described
in sections III.C. and V.
An eventual configuration of our crawling algorithm
might pose an additional challenge the project team. While
the implementation of a new crawling algorithm can be
quickly realized by substituting the “scheduler” of the
crawler (refer to section III.C), it seems rather challenging to
choose the algorithm in the first place. The current setting of
the crawler dictates to gather as many links of diverse blogs
7 http://flare.prefuse.org
8 http://eigenfactor.org/methods.htm
9 http://sioc-project.org
as possible, rather than downloading the entire content of one
weblogs before moving along to the next. It might therefore
prove to be essential in a next step to make the crawling
algorithm more intelligent and thus more sensible to our
specific requirements regarding data collection to analyze
network characteristics in the blogosphere. It is indeed
important for us to get as many links from as many different
weblogs as possible, but when a critical mass is reached, the
crawler should understand that it is more important at this
point to reproduce a blog’s entire content. In case the crawler
would have indefinite time to collect feeds of weblogs, it will
eventually reflect the whole blogosphere. Regrettably, this
issue is highly time-critical and the crawler might under
circumstances not always have that time.
This is due to the fact that the number of visible feeds is
predefined and limited within every blog-software. Even
though this value can be changed after the software
installation, no weblog usually displays more than 10-20
feeds in the corresponding list. This means that when a blog
with a predefined feed-number of 10 for instance publishes
its 11th post or comment, the feed of the corresponding first
post published would fall out of that list and would as a
consequence not be attainable for our crawler. Fortunately,
the widely-deployed WordPress weblog software allows to
query feeds arbitrarily back in time, and the WordPress
module of our crawler knows how to issue these queries. It
proves to be a major challenge to find a fair balance between
minimum time a blog will not be visited and the objective to
collect as many distinct blog-feeds as possible here.
IX. CONCLUSION
Generally, we try to investigate in what patterns, and to
which extent blogs are interconnected. In doing so we want
to face the challenge of mapping the blogosphere on a global
scale, maybe limited to national boundaries. The
visualization of link patterns, a thorough social network
analysis, and a quantitative as well as qualitative analysis of
reciprocally-linked blogs will to a large extend form the
second part of our overall project, which will build upon the
data collection method and technique described in this paper.
We do expect the so-called “A-list blogs”- the most
influential blogs- being overrepresented and central in the
network, although other groupings of blogs can also be
densely interconnected. We do however also expect that a
majority of blogs link sparsely or not at all to other blogs - a
notion we referred to as “isles” before, suggesting -
contrarily to common belief - that the blogosphere is only
partially interconnected. Currently, the main objective in our
project work is to implement a meaningful revisiting
algorithm for our crawler that allows us to analyze daily
updated content of (partial) blogospheres.
REFERENCES
[1] N. Bansal, N. Koudas, “Searching the blogosphere”, Proc. 10th
International Workshop on Web and Databases (WebDB 2007), June
15, Bejing, China, 2007, doi:10.1234/12345678.
[2] T. Berners-Lee, “Linked data”, 2006, Available:
http://www.w3.org/DesignIssues/LinkedData.html
[3] J. Bross, A. Acar, P. Schilf, C. Meinel: “Spurring Design Thinking
through educational weblogging”, Pro. 2009 IEEE International
Conference on Social Computing, IEEE Press, Volume 4, 29-31
Aug. 2009, pp. 903 – 908, doi: 10.1109/CSE.2009.207
[4] A. Bruns, “Methodologies for Mapping the Political Blogosphere: An
Exploration Using the IssueCrawler Research tool”, First Monday,
vol. 12 no.5, 2007, available at:
http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/vie
w/1834/1718
[5] B. Gaiser, T. Hampel, S. Panke, Good Tags - Bad Tags: Social
Tagging in der Wissensorganisation, 1st ed. Waxmann, 2008
[6] N. S. Glance, M. Hurst, T. Tomokiyo, “BlogPulse: Automated Trend
Discovery for Weblogs”, WWW 2004 Workshop on the Weblogging
Ecosystem, ACM, New York, 2004,
http://www.blogpulse.com/papers/www2004glance.pdf
[7] J. Gosling, B. Joy, G. Steele, G. Bracha, The Java Language
Specification, 3rd ed. Amsterdam: Addison Wesley; 2005
[8] S. C. Herring, L. A. Scheidt, S. Bonus and E. Wright, "Bridging the
Gap: A Genre Analysis of Weblogs," hicss, pp.40101b, Proceedings
of the 37th Annual Hawaii International Conference on System
Sciences - Track 4, 2004,
http://www.ics.uci.edu/~jpd/classes/ics234cw04/herring.pdf
[9] H. Kircher, Web 2.0 - Plattform für Innovation. it - Information
Technology (49) 1, Oldenbourg Wissenschaftsverlag, pp. 63-6, 2007.
[10] M. Kobayashi, K. Takeda, "Information retrieval on the web". ACM
Computing Surveys (ACM Press) 32 (2): 144–173, 2000
[11] C. Leisegang, S. Mintert,„Sieben frei verfügbare Weblog-Systeme –
Liebes Tagebuch...“ iX, 7/2008, p. 42: Blogging-Software
[12] J. Lin, A. Halavais, “Mapping the blogosphere in America”,
Presented at the Workshop on the Weblogging Ecosystem at the 13th
International World Wide Web Conference (2004)
http://www.blogpulse.com/papers/www2004linhalavais.pdf
[13] M. Ojala, Blogging for knowledge sharing, management and
dissemination. Business Information Review, Vol. 22, No. 4. S. 269-
276, 2005.
[14] J. Schmidt, Weblogs – Eine kommunikations-soziologische Studie.,
2006, UVK Verlagsgesellschaft mbH.
[15] X. Song, Y. Chi, K. Hino, B. L. Tseng, “Identifying Opinion Leaders
in the Blogosphere”, Proceedings of the sixteenth ACM conference
on information and knowledge management (CIKM '07), pp. 971-
974, ACM, New York, USA, 2007
[16] Technorati, State of the blogosphere, 2008,
http://technorati.com/blogging/state-of-the-blogosphere.
[17] S. Thies, Content-Interaktionsbeziehungen im Internet. Ausgestaltung
und Erfolg, 1st ed., Gabler, 2005
[18] Universal McCann, International Scoial Media Research Wave 3,
2008, http://www.universalmccann.com/Assets/2413%20-
%20Wave%203%20complete%20document%20AW%203_20080418
124523.pdf
[19] D. Whelan, In a fog about blogs. American Demographics, Vol. 25,
No. 6, July/August, S. 22-23, 2003.
APPENDIX
Figure 1. Action Sequence of RSS-Feed Crawler
Figure 2. Data Structure
Figure 3. Performance Development of RSS-Feed Crawler (Note: Logarithmic Scale on y-Achsis)
... This interconnected think tank thus creates an enormous and ever-changing archive of open source intelligence [12]. Modeling and mining the vast pool of data generated by the blogosphere to extract, exploit and represent meaningful knowledge in order to leverage (content-related) structures of emerging social networks residing in the blogosphere were the main objective of the projects initial phase [4]. ...
... To allow the processing of the enormous amount of content in the blogosphere, it was necessary to make that content available offline for further analysis. The first prototype of our feedcrawler completed this assignment along the milestones specified in the initial project phase [4]. However, it soon became apparent that a considerable amount of optimization would be necessary to fully account for the strong distinction between crawling regular web pages and mining the highly dynamic environment of the blogosphere. ...
... We perceive it as nearsighted to base research like the ones mentioned before on data of external services like Technorati, BlogPulse or Spinn3r [6]. We also have ambitious plans of how to ultimately use blog data [3] -we at least make the effort of setting up our own crawling framework to ensure and prove that the data employed in our research has the quantity, structure, format and quality required and necessary [4]. ...
Article
Full-text available
The massive adoption of social media has providednew ways for individuals to express their opinions online. Theblogosphere, an inherent part of this trend, contains a vastarray of information about a variety of topics. It is a hugethink tank that creates an enormous and ever-changingarchive of open source intelligence. Mining and modeling thisvast pool of data to extract, exploit and describe meaningfulknowledge in order to leverage structures and dynamics ofemerging networks within the blogosphere is the higher-levelaim of the research presented here. Our proprietearydevelopment of a tailor-made feed-crawler-framework meetsexactly this need. While the main concept, as well as the basictechniques and implementation details of the crawler havealready been dealt with in earlier publications, this paperfocuses on several recent optimization efforts made on thecrawler framework that proved to be crucial for theperformance of the overall framework.
... En effet, elle se base sur de nombreux critères qui doivent etre pris en compte pour assembler les services et le fait que ces services soient hétéroclites ne permet pas une génération aisée des applications. De plus de nouvelles ontologies, sources, systèmes d'annotations apparaissent régulièrement tandis que le web se démocratise [2,4], ce qui occasionne des misesà jour régulières de ces workflows. ...
... In other words, to become 3 visible, the image of the blogosphere must be constructed, either by blogosphere related services such as directories, web rings and blog search or by academic network visualizations. By means of similar techniques such as contemporary blog related services, current network visualizations are commonly constructed by employing RSS feed-crawlers to fetch the content -current and newly updated blog posts and their links -of blogs using their feeds (Bross, 2010) or by using web crawlers such as the IssueCrawler which performs issue network crawls based on hyperlink network analysis to identify ''patterns of interconnections in the population of websites discovered in the process'' (Bruns, 2007: 1). Although different tools and methods produce different network visualizations, they all provide graphical representations of interconnections and insights into the overall structure of the blogosphere and its actors (Highfield, 2009). ...
Article
The blogosphere has played an instrumental role in the transition and the evolution of linking technologies and practices. This research traces and maps historical changes in the Dutch blogosphere and the interconnections between blogs, which - traditionally considered - turn a set of blogs into a blogosphere. This paper will discuss the definition of the blogosphere by asking who the actors are which make up the blogosphere through its interconnections. This research aims to repurpose the Wayback Machine so as to trace and map transitions in linking technologies and practices in the blogosphere over time by means of digital methods and custom software. We are then able to create yearly network visualizations of the historical Dutch blogosphere (1999-2009). This approach allows us to study the emergence and decline of blog platforms and social media platforms within the blogosphere and it also allows us to investigate local blog cultures.
... Even though POSTCONNECT focuses on the visual representation of a single blog's content, we consider it to be crucial to include the bigger picture in which a single weblog is embedded: a segmented and independent public that dynamically evolves and functions according to its own rules and with ever-changing protagonists, a network also known as the " blogosphere " [14]. A single weblog is embedded into this network through the usage of hyperlinks as well as its so-called " blogroll " , a blogosphereinternal referencing system [5], while single posts are interconnected with the rest of the blogosphere via track-or pingbacks as well as referrers. Since content-and contextual relationships are not limited to the boundaries of a single weblog platform, these in-and outgoing (content-related) linkages of single posts with the rest of the blogosphere needed to be included in our visualization. ...
Conference Paper
There has been virtually little in the way of user interfaces designed for the exploration and information gathering from large weblog datasets to allow for an integrated and aggregated knowledge collection and information analysis tool. Users have to rely on their own capability to find, select or filter entries and navigate through a blog archive. For weblogs with a large collection of entries this task easily becomes tedious, since current blog interfaces lack fundamental support for facilitating the exploration of their archives. A solution to this problem could be POSTCONNECT, a mature blog-archive visualization tool presented in this paper.
... The input data needed for the visualization are mined by a RSS-feed crawler framework set up exclusively for the BLOGINTELLIGENCE portal (see also [26]) and stored in a MySQL 5 database. The same crawler also gathers data for the POSTCONNECT application [25]. ...
Conference Paper
Full-text available
It can be highly meaningful for individuals, institutions and even governments to extract reliable and insightful trends, opinions or particular pieces of information out of the network single web logs are embedded in - the blogo sphere. However, there has been virtually little in the way of user interfaces designed for the exploration and information gathering from the blogo sphere as a whole to allow for an integrated knowledge collection and information extraction tool. Users have to rely on their own capability to find, select or filter entries and navigate through this complex network, a task that can easily become tedious. An approach to solve this problem could be BLOGCONNECT presented in the following, a tool to visualize the inherent network characteristics of and the aggregated knowledge pool of the blogosphere.
... We perceive it as nearsighted to base research like the ones mentioned before on data of external services like Technorati, BlogPulse or Spinn3r [23]. We at least make the effort of setting up our own crawling framework to ensure and prove that the data employed in our research has the quantity, structure, format and quality required and necessary [24]. ...
Conference Paper
Full-text available
The massive adoption of social media has provided new ways for individuals to express their opinions online. The blogosphere, an inherent part of this trend, contains a vast array of information about a variety of topics. Thus, it is a huge think tank that creates an enormous and ever-changing archive of open source intelligence. Modeling and mining this vast pool of data to extract and describe meaningful knowledge in order to leverage (content-related) structures and dynamics of emerging networks within the blogo sphere is the higher-level aim of the research presented here. While the concept of our tailor-mode feed-crawler was already discussed in two earlier publications this paper focuses on our approach to extend the earlier feed crawler to a more universal and highly scalable blog-crawler.
Article
The widespread availability of analytical tools for Big Data offers enormous opportunities and challenges for communication researchers. In contrast to user-generated texts, digital trace data (evidence of online user activities such as hyperlinks and retweets) represent a new methodological frontier for the field. However, interpretive strategies remain scattered and ad hoc with few best practices to guide them. To help remedy this situation, this article reviews recent scholarship in both communication and social computing research that has incorporated three common types of trace data: hyperlinks, Twitter followers, and retweets. It finds that while researchers in both fields have interpreted each trace in a variety of ways, they have largely declined to explain the validity of their interpretations.
Article
It was already shown on several occasions that it can be highly meaningful for individuals, institutions or even governments to find ways and measures in order to extract reliable and insightful trends, opinions or partic- ular pieces of information out of the blogosphere. However, it is increasingly difficult if not impossible for the average internet user and sympathizer of we blogs to grasp the blogosphere’s complexity as a whole, due to thousands of new weblogs and an almost uncountable number of new posts adding up to the before-mentioned collective on a daily basis. Mining, analyzing, mod- cling and presenting this vast pool of knowledge in one central framework to extract, exploit and represent meaningful knowledge for the common blog user forms the basis of this paper. The result of the corresponding long-term research initiative presented here is BLOGIXTELLIGEXCE. It is an inte- grated blog analysis framework with the objective to leverage content- and context-related structures and dynamics residing in the blogosphere and to make these findings available in an appropriate format to anyone interested. We hereafter refer to these structures and dynamics as social physics of the blogosphere.
Article
The development of a resilient weblog ranking metric within the global blogosphere, capable of identifying the most important or influential weblogs around, forms the central aspect of this paper. Because well-established ranking algorithms for traditional web pages are not perfectly applicable to the deviant linking characteristics of the blogosphere, blog engines, such as Technorati, BlogPulse or PostRank have developed their own tailor-made ranking metric. This paper will analyze and compare the ranking criteria of these service providers and reveal their conceptional shortcomings and discuss their strengths. Ultimate objective of this paper is to introduce a novel ranking metric, the so-called “BlogIntelligence-Impact-Score” or “BI-Impact” for short. It represents one of the central informational offerings of the forthcoming “BlogIntelligence” portal.
Conference Paper
Full-text available
Weblogs (blogs) - frequently modified Web pages in which dated entries are listed in reverse chronological sequence - are the latest genre of Internet communication to attain widespread popularity, yet their characteristics have not been systematically described. This paper presents the results of a content analysis of 203 randomly-selected Weblogs, comparing the empirically observable features of the corpus with popular claims about the nature of Weblogs, and finding them to differ in a number of respects. Notably, blog authors, journalists and scholars alike exaggerate the extent to which blogs are interlinked, interactive, and oriented towards external events, and underestimate the importance of blogs as individualistic, intimate forms of self-expression. Based on the profile generated by the empirical analysis, we consider the likely antecedents of the blog genre, situate it with respect to the dominant forms of digital communication on the Internet today, and advance predictions about its long-term impacts.
Book
Full-text available
"Teile und sammle" könnte der moderne Leitspruch für das Phänomen "Social Tagging" heißen. Die freie und kollaborative Verschlagwortung digitaler Ressourcen im Internet gehört zu den Anwendungen aus dem Kontext von Web 2.0, die sich zunehmender Beliebtheit erfreuen. Der 2003 gegründete Social Bookmarking Dienst Del.icio.us und die 2004 entstandene Bildersammlung Flickr waren erste Anwendungen, die Social Tagging anboten und noch immer einen Großteil der Nutzer/innen an sich binden. Beim Blick in die Literatur wird schnell deutlich, dass "Social Tagging" polarisiert: Von Befürwortern wird es als eine Form der innovativen Wissensorganisation gefeiert, während Skeptiker die Dienste des Web 2.0 inklusive Social Tagging als globale kulturelle Bedrohung verdammen. Launischer Hype oder Quantensprung - was ist dran am "Social Tagging"? Mit der Zielsetzung, mehr über die Erwartungen, Anwendungsbereiche und Nutzungsweisen zu erfahren, wurde im Frühjahr 2008 am Institut für Wissensmedien (IWM) in Tübingen ein Workshop der Gesellschaft für Medien in der Wissenschaft (GMW) durchgeführt. Die vorliegende Publikation fasst die Ergebnisse der interdisziplinären Veranstaltung zusammen.
Article
Full-text available
Over the past few years, weblogs have emerged as a new com-munication and publication medium on the Internet. In this pa-per, we describe the application of data mining, information ex-traction and NLP algorithms for discovering trends across our sub-set of approximately 100,000 weblogs. We publish daily lists of key persons, key phrases, and key paragraphs to a public web site, BlogPulse.com. In addition, we maintain a searchable index of we-blog entries. On top of the search index, we have implemented trend search, which graphs the normalized trend line over time for a search query and provides a way to estimate the relative buzz of word of mouth for given topics over time.
Article
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited by the sources are consistent and point to exponential growth in the past and in the coming decade. Hence it is not surprising that about 85% of Internet users surveyed claim using search engines and search services to find specific information. The same surveys show, however, that users are not satisfied with the performance of the current generation of search engines; the slow retrieval speed, communication delays, and poor quality of retrieved results (e.g., noise and broken links) are commonly cited problems. We discuss the development of new techniques targeted to resolve some of the problems associated with Web-based information retrieval,and speculate on future trends. Categories and Subject Descriptors: G.1.3 [Numerical Analysis]: Numerical Linear Algebra-Eigenvalues and eigenvectors (direct and iterative methods); Singular value decomposition; Sparse, structured and very large systems (direct and iterative methods); G.I.I [Numerical Analysis]: Interpolation; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval-Clustering; Retrieval models; Search process; H.m [Information Systems]: Miscellaneous General Terms: Algorithms, Theory.
Conference Paper
A growing number of those millions of internet users that are down to the present day excessively using weblogs to reflect on their experiences, recommendations and thoughts in a private or business context, now increasingly employ this medium to bring the teaching- and learning- environment to a new level. The use case about the D-School-Blog - a collaborative working and communication platform to support the innovative process of Design Thinking (DT)- presented in this paper, proves the applicability of weblogs in this particular environment.
Article
A new method for encoding a videoconference image sequence, termed adaptive neural net vector quantisation (ANNVQ), has been derived. It is based on Kohonen's self-organised feature maps, a neural network type clustering algorithm. The new method differs from it, in that after training the initial codebook, a modified form of adaptation resumes, in order to respond to scene changes and motion. The main advantages are high image quality with modest bit rates and effective adaptation to motion and scene changes, with the capability to quickly adjust the instantaneous bit rate in order to keep the image quality constant. This is a good match to packet switched networks where variable bit rate and uniform image quality are highly desirable. Simulation experiments have been carried out with 4 × 4 blocks of pixels from an image sequence consisting of 20 frames of size 112 × 96 pixels each. With a codebook size of 512, ANNVQ results in high image quality upon image reconstruction, with peak signal-to-noise ratio (PSNR) of about 36 to 37 dB, at coding bit rates of about 0.50 bit/pixel. This compares quite favourably with classical vector quantisation at a similar bit rate. Moreover, this value of PSNR remains approximately constant, even when encoding image frames with considerable motion.
Article
Using weblogs, or blogs, as vehicles for knowledge management initiatives is a relatively new concept, but one that has gained rapid recognition. The earliest weblogs appeared only a few years ago. As personal journals, often espousing individual political views or chronicling personal daily events, blogs did not seem to fit into an organizational knowledge management framework. Attitudes towards weblogs and uses of blogs are changing quickly, however. In a collaborative work environment, blogs bring significant benefits to enterprises willing to adopt the technology. Writers of blogs, called bloggers, can add to the sum total of knowledge for research projects, share industry and product knowledge, capture and disseminate pertinent news from outside the enterprise, and contribute valuable insights on specific subjects. They are particularly useful for promoting knowledge in cross-cultural environments.