ChapterPDF Available

Log File Analysis as a Method for Automated Measurement of Internet Usage

Authors:

Figures

Content may be subject to copyright.
318
christian strippel
Log File Analysis as a Method for Automated
Measurement of Internet Usage
1. Introduction
Within the last two decades, the technical and societal changes commonly
summarized under the term ›digitization‹ have led to numerous new forms
of media usage. In particular, the Internet has brought forth a multitude
of new and complex situations of usage that are increasingly challenging
the research methods of media and communication studies. In a time when
more and more people use the Internet to read the news, watch movies,
share pictures, chat with friends, discuss with strangers, or play games
against algorithmsto mention a few –, traditional research methods such
as online or telephone surveys reach their limits in terms of measurement
accuracy and validity (see, e.g., savage/burrows 2007). Particularly, when
it comes to the examination of individual usage patterns, distortions of
self-disclosure caused by reactive eects and limited memory capacity of re-
spondents are important (see, e.g., ansolabehere/iyengar 1995). Accord-
ingly, there are systematic measurement errors in self-report data, for ex-
ample, when estimating media exposure (see, e.g., prior 2009; scharkow
2018). Observational methods such as media diaries (see., e.g., berg/düvel
2012) are considerably more accurate; however, they are time-consuming,
costly, and also have problems with regard to reactivity and reliability (see,
e.g., david/seo/german 2010).
The systematic and detailed investigation of these new and complex
forms of Internet use calls for alternative methods of data collection that
meet the corresponding challenges. In recent years, such methods have been
319
Log File Analysis as a Method for Automated Measurement of Internet Usage
discussed under keywords such as ›digital methods‹ (rogers 2013), ›big
data‹ (mahrt/scharkow 2013), or ›computational social science‹ (lazer
et al. 2009). These approaches have in common that they aim to investigate
social phenomena through the analysis of computer-mediated communi-
cation, which manifests in so-called ›digital footprints‹ (thatcher 2014)
or ›digital traces‹ (freelon 2014) such as texts, pictures, videos, or meta-
data. In contrast to the concept of ›virtual methods‹ (hine 2005), which
basically adapts traditional methods to investigate communication in the
›virtual‹ environment of the Internet, the idea behind these approaches is
to also use the technical infrastructure of the Internet as an instrument for
(academic) research. In the context of Internet use studies, such computa-
tional methods can be useful to adequately capture the complexity of the
new situations of usage mentioned above: Automated procedures of data
collection are precise over long periods of time, they are cheap and only
partially or not reactive at all (see, e.g., janetzko 2008).
On the following pages, this article provides an overview of the ›log file
analysis‹ method as one example of computational methods in the context
of Internet use studies. Against the background of the increasing relevance
of interdisciplinary cooperation in the field of Internet research (as web
science; see berners-lee et al. 2006; hendler et al. 2008), particularly be-
tween social and computer scientists (see, e.g., kinder-kurlanda/weller
2014), it is important that scholars with limited technical background know
more about the possibilities and limitations of this method. Accordingly,
this article aims to introduce the basic and most important aspects of log
file analyses for the purpose of Internet use studies to everyone interested
in working with it in future research projects. Although it cannot provide
a complete report and discussion of the current state of knowledge in the
field, nonetheless it should work as a first starting point referencing helpful
texts for a closer reading and deeper examination.1
This text is structured as follows: In section 2, we will explain what log
files are and define them as technical metadata and digital traces of Internet
usage. For a better understanding of the technical basics, the process of a
prototypical user request on the Internet is described and three types of log
files are distinguished: client log files, server log files, and proxy log files.
1 I want to thank the editors as well as Salwa Aleryani, Laura Laugwitz and Roland Toth for
their helpful comments on an earlier version of this manuscript.
320
christian strippel
Section 3 gives an overview of the possibilities and limitations of log file
analyses on these three levels with regard to the requirements of academic
Internet use studies. Finally, the findings are summarized in section 4.
2. Log files as traces of Internet usage
In general, log files or simply ›logs‹ – are automatically generated text
files that record specific technical information of a broad range of events
taking place in a computer system or software application such as date, time,
and type of event or executed action (for more background see chuvakin/
schmidt/phillips 2013). This data is distinct from the concrete meanings
or contents of these events or actions;it is rather their metadata. When we
open a document on our computer, for example, the respective log data
does not tell us whether this action was intended or not, or what this docu-
ment contains; but it tells us what type of command we gave, what kind
of document we opened, and when this action took place. In relation to
Internet use, those events or actions are data exchanges between comput-
ers; and log files are the metadata of these data exchanges. Whenever we
refer to ›log files‹ or ›log data‹ below, we mean this metadata of Internet
trac. Other types of log files, such as application or message logs, as well
as keyboard, mouse or screen logging data, are not considered.
The analysis of log files originates from technical hardware and software
monitoring. In the past years, however, Internet-related log data in par-
ticular has increasingly gained a political dimension as it is used by com-
panies and states for extensive tracking of Internet users on a large scale
(»Mass Internet Metadata Surveillance«, ni loideain 2015). In the context
of Internet use studies, log data can be seen as traces of Internet usage. In
order to better understand where, when, and how log files are generated,
it is helpful to understand the technical processes in the background of
data trac on the Internet. These processes are therefore described in more
detail below. We distinguish three types of log files: client log files, server
log files and proxy server log files.
Client and server log files
Whenever a user opens a web browser and visits a website on his or her com-
puter, two technically dierentiated steps take place in the background:
321
Log File Analysis as a Method for Automated Measurement of Internet Usage
First, the computeror the respective web browser – sends a request via
the Internet to the computer on which the requested document is stored.
This second computer respectively the running software application re-
ceives this request and then sends the requested files back to the computer
from which the request came in the first place, so that the user can view the
corresponding website (see Figure 1). This two-sided exchange is based on
the so-called ›client-server model‹ of the Internet (see, e.g., berson 1992),
according to which the two computers take on dierent roles: The request-
ing computer is called ›client‹, whereas the computer providing the doc-
uments is called ›server‹.2 Both on the client and on the server side, each
request and response are recorded in dedicated log files: On the client side,
web browsers log Internet usage in complex databases to provide a ›brows-
ing history‹ (or other usability enhancements) based on this data. On the
other side, server log files (also ›server logs‹ or ›web logs‹) are broadly used
for technical performance monitoring and error analysis of the respective
server, which is why standardized formats have been developed (see, e.g.,
rice/borgman 1983). Furthermore, server log data is used to produce ac-
cess statistics in order to evaluate the impact of a published website or to
improve its usability or personalization (see section 3.1).
figure 1
Data retrieval with client- and server-side log files
request
response
server client L L
Explanation: The client sends a request to the server, which in turn responds by providing
the requested website or files. This process is recorded on both sides in locally stored log
files (L).
2 It is important to note that, in this client-server model, clients are not only able to request and
receive data but also to send data, for example, when posting a tweet or uploading a video
on YouTube. In return, the corresponding servers are not only able to host and provide data
but also to receive and store the data sent by clients – at least in the context of dynamic web
applications and websites.
322
christian strippel
Proxy server log files
The client-server model with clients on the one hand and servers on the
other is incomplete as long as we do not consider a third type of computer
that is essential to understand the structure and the principles of data
exchange on the Internet: proxy servers (or simply ›proxies‹). Generally
speaking, proxy servers canbe seen as intermediaries between clients and
servers (see, e.g., luotonen/altis 1994): They are connected between them
during the two-way exchange by request and response described above. In
such cases, the data trac gets redirected, maybe modified (according to
their function and programming), and finally relayed to the other address.
In the context of this redirected data retrieval, the proxy server also records
the respective events of data exchange in separate log files (see Figure 2).
Despite the fact, that in the client-server model, proxies are servers,
they still have dual roles as servers to clients and clients to servers. In this
intermediary position, proxy servers can have quite diverse functions. For
example, they can be used to anonymize the data trac routed through
them or to act as a firewall against spam and viruses, as an ad-blocker, or
as a ›buer‹ or ›cache‹ for frequently requested files in order to make the
Internet’s data trac more ecient (see, e.g., glassman 1994). For exam-
ple, many universities oer proxy servers with certain licenses to give their
employees and students access to paid databases.
figure 2
Data retrieval via an intermediary proxy server
request
response
client server proxy
L L L L
Explanation: The client sends a request to the server via a proxy. The server responds
to the request by sending the requested page or files to the proxy, which relays it to the
client. This process is recorded in locally stored log files (L) for all three parties involved.
Technical standards and log file formats
To enable and enhance the interoperability of computer systems involved
in data trac on the Internet, certain technical standards were defined in
the past by institutions such as the International Organization for Stand-
323
Log File Analysis as a Method for Automated Measurement of Internet Usage
ardization (iso) or the World Wide Web Consortium (W3C). For example,
to enable clients and servers to communicate by sending requests and re-
sponses (with or without proxy servers involved), a sort of common lan-
guage is needed. This language is standardized in so-called ›protocols‹,
which can be understood as specific sets of technical communication rules
(see, e.g., hall 2009). Based on the Transmission Control Protocol (tcp),
for example, they can build up reliable connections; based on the Internet
Protocol (ip), they coordinate how to route their communication through
the network; and on the basis of the Hypertext Transfer Protocol (http)
they exchange the respective datagrams for providing and receiving docu-
ments such as websites on the ›World Wide Web‹ (berners-lee et al. 1994).
Other well-known and widely used protocols are the Simple Mail Transfer
Protocol (smtp), the Post Oce Protocol (pop) and the Internet Message
Access Protocol (imap) to send and receive emails, or the File Transfer Pro-
tocol (ftp) to exchange computer files of any type.
In the case of log files, such technical standards were defined as well. The
most common log file standards are the Common Log Format set by the Na-
tional Center for Supercomputing Applications (ncsa) and the Extended
Log Format defined by the W3C (see ›web log file‹ in ince 2009). Both stand-
ards apply to server log files, because here standardization is highly needed:
As shown at the beginning of this section, the function of server logs is to
record basic technical information about every client request that servers
receive. The standardization of this metadata serves its follow-up analysis
in the context of technical evaluation or the creation of access statistics. In
contrast to client log files, which are usually produced and analyzed by the
same software, namely the web browser, various software packages exist
for the analysis of server and proxy log files. In the process of their creation,
the standardization of their format already prepares server-sided log files
for further data processing of any kind. However, these standards also cre-
ate certain limitations for respective log file analyses, which is particularly
relevant for academic Internet use studies. Therefore, it is important to un-
derstand which technical information is being recorded. As an example, the
Common Log Format is described below.
Common Log Format
The Common Log Format is one of the oldest web log standards as it is
based on the ncsa HTTPd server, which was one of the earliest web serv-
324
christian strippel
ers in the World Wide Web at the beginning of the 1990s (see, e.g., kwan/
mcgrath/reed 1995). For that reason, this standard is also called ›ncsa
Common Log Format‹ or ncsa Access Log‹. As it is designed for http
servers, it is limited to the http protocol, which means that only requests
of documents on the basis of this protocol, such as websites, are recorded.
The logging works as follows: For each client http request received by a
server, this server adds a separate entry to its log file, which is basically a
chronological list of all requests received by this server so far. In the case of
the Common Log Format, every log file entry is a single text line contain-
ing seven fields of information each dedicated to a specific aspect of the
respective client request. The syntax of this text line, i.e., the sequence of
these seven information fields, is structured as follows (see luotonen 1994):
remotehost rfc931 authuser [date] "request" status bytes
The technical information recorded in each of the seven fields is de-
scribed with examples in Table 1. Beside three fields for the technical iden-
tification of the requesting client (remotehost, rfc 93 1 and authuser),
the date and time of each request ([date]) as well as the concrete request
line ("request"), the result of a request (status) and the size of the
requested files (bytes) are recorded (for more background see markov/
larose 2007: 143-155). Personalized data about the users behind the clients
or content-related information is not collected at least as long as the re-
quested web address in the request line (usually a Uniform Resource Lo-
cator, url) is not containing meaningful terms that describe the requested
content. However, there are ways to identify users even on the basis of this
depersonalized data (see section 3.1).
Following the fictional examples in Table 1, a log file entry as it would
be shown in a text editor would look as follows:
192.168.178.35 - johndoe [25/Apr/2018:05:11:19 +2000]
"GET /index.html HTTP/1.0" 200 4150
The format and sequence of these fields are fixed and cannot be custom-
ized. Therefore, analyses that work with this log format are limited to this
metadata. That is why other formats emerged over time: One example is
the W3C Extended Log Format, which »permits customized logfiles to be
recorded in a format readable by generic analysis tools« (hallam-baker/
behlendorf 1996). Another example is the ncsa Combined Log Format,
which is an extension of the Common Log Format by three additional fields:
the referrer field with the url of the website that linked to the server,
the user_agent field with information about the web browser and the
325
Log File Analysis as a Method for Automated Measurement of Internet Usage
operating system used by the client, and the cookie field for an advanced
user identification.3 The basic function and structure of the dierent log
file formats is nevertheless similar to the Common Log Format and there-
fore do not need to be explained more in detail.
3 Furthermore, depending on the server system, the dierent server log data is often also stored
in separate log files such as access logs, agent logs, error logs, and referrer logs (see, e.g., ber-
tot et al. 1997).
table 1
Common Log Format
Field Description Example
remote-
host
IP address of the client that made the
request.
192.168.178.35
rfc931
(or
rfc1413)
User-identifier based on the RFC 931
standard, which should »determine the
identity of a user of a particular TCP con-
nection« (St. Johns 1985). This standard
was later obsoleted by RFC 1413 (St.
Johns 1993). In both cases, »this informa-
tion is highly unreliable« and only works
»on tightly controlled internal networks«
(Apache 2018), which is why usually a
hyphen marks a missing information.
authuser In case of password protected documents
and/or a required user authentication,
this is the name or pseudonym of the
user. In most cases, also here a hyphen
marks missing information respective an
anonymous user.
johndoe
[date] Time stamp of when the request was re-
ceived, containing the local date and time
and the GMT time zone in the format:
dd/MMM/yyyy:hh:mm:ss +-hhmm.
[25/Apr/2018:05:11:19
+2000]
"request" Request line as it came from the client,
containing the HTTP method, the re-
quested resource and the HTTP protocol
version.
"GET /index.html
HTTP/1.1"
status HTTP status code, which marks the re-
quest result and is sent back to the client
(for an overview of all status codes for
HTTP 1.1 see Fielding/Reschke 2014).
200
bytes Size of the requested file sent by the
server in bytes.
4150
326
christian strippel
Cookies
Since the three fields remotehost, rfc931 and authuser of the Common
Log Format are not sucient for the server to identify a client behind an
arriving request, the Netscape Communication Corporation, which also
provided the Netscape web browser at that time, introduced so-called ›cook
-
ies‹ as a more advanced way of user identification. Similar to the rfc931
field of the Common Log Format, a cookie is also an identifier field, but
it is based on the rfc 6265 standard, which allows »http servers to store
state [...] at http user agents, letting the servers maintain a stateful session
over the mostly stateless http protocol« (barth 2011). This means that a
server canin response to a client request fill the cookie field with an
identification string and store it on the computer of the requesting client
(or user) to recognize them in later requests. This is helpful, for example, in
the context of online shopping, where a server has to constantly associate
clients with specific items in their virtual shopping carts (see, e.g., pitkow
1997: 1344). Another important function of cookies is to automatize the
authentication in a login process in the case of revisiting a website. In the
context of analyzing server log files, cookies can be used to better identify
individual clients who (repeatedly) visit a website (see section 3.1). Mainly
two dierent types of cookies can be distinguished: First-party cookies
are set by the server where the requested website is stored, whereas third-
party cookies are set by other servers that provide external content items
such as banner advertisements.
Against the background of this basic knowledge about log files and
their role in the technical infrastructure of the Internet, the possibilities
and limitations of the dierent forms of the log file analysis method for
the automated measurement of Internet usage will be introduced and dis-
cussed in the following section.
3. Log file analysis
As mentioned above, the log file analysis method originates from the au-
tomated procedures of monitoring web server performance that became
necessary with the client-server Internet in the early 1990s. Therefore, it
would be an exaggeration to call it a new or innovative method; after all,
it has a history of more than 20 years. However, its application in empiri-
327
Log File Analysis as a Method for Automated Measurement of Internet Usage
cal research has been scarce, primarily when it comes to the measurement
of Internet usage (see, e.g., de vreese/neijens 2016: 70). Although some
anthologies such as The sage Handbook of Online Research Methods (janetzko
2008) or The International Encyclopedia of Communication (jansen/spink 2008)
contain introductions to this method, there is still a big knowledge gap
about the possibilities and limitations of the three dierent variants of log
file analysis for Internet use studies, especially amongst social scientists.
Whereas the analysis of server log files is,for several reasons, quite popular
and often used in the field of market research, both client and proxy log
file analyses are despite their potentials for Internet use research – still
in their infancy. This is problematic as no methodological standards have
yet emerged. Researchers working with these methods have to develop in-
dividual solutions that are often opaque, dicult to understand, and not
verifiable for follow-up research. Against this background, this section pro-
vides an overview of the three variants of log file analysis. It discusses their
possibilities and limitations for academic Internet use studies on the basis
of recent studies and addresses open questions for future work with these
method variants. The main criteria to evaluate the approaches are: meas-
urement accuracy, data validity, representativeness, reactivity, and costs.
3.1 Server log file analysis
With regard to the client-server model of the Internet introduced in the
previous section, the server log file analysis can be defined as a method
of evaluation and interpretation of the log file entries that are generated
by a server with every client request it receives. In the context of Internet
use studies, these entries can be read as traces left by Internet users in the
process of using websites or other http resources that are stored on this
respective server.
In contrast to the other two variants, the analysis of server log files is
quite an easy undertaking. On the one hand, it is based on already defined
technical standards such as the Common Log Format, which ensures a cer-
tain interoperability of the collected server log files with already existing
analysis software. On the other hand, its background as a technical proce-
dure of monitoring the server performance as well as the huge interest of
server providers and website owners to analyze and evaluate the demand
and usage of their products or services have led to serious eorts to im-
328
christian strippel
prove this method. As a result, in the last two decades a lot of knowledge
in the use of this method has been gained in the fields of information
technology and computer science as well as in commercial market and
audience research. Under the umbrella term ›web usage mining‹ (srivas-
tava et al. 2000), a huge amount of literature on cleaning, preprocessing,
and clustering of server log data in order to identify users and their usage
patterns on a website in further analyses can be found (see, e.g., cooley/
tan/srivastava 1997; markov/larose 2007: 143-212; liu 2011: 449-483).
In the context of these analyses, metrics such as clicks, hits, page im-
pressions, and visits are used to describe usage behavior on a website more
specifically (see Table 2; for more metrics see, e.g., burton/walther 2001).
The basic challenge of web usage mining in this regard is the identifica-
tion of individual users and their usage behavior (e.g., cooley/mobasher/
srivastava 1999; suneetha/krishnamoorthi 2009). This is relevant for
being able to analyze and predict user navigation through a website, for
example, to improve its usability and personalization (see, e.g., mobasher/
cooley/srivastava 2000; benbunan-fich 2001; pierrakos et al. 2003),
but also to measure the reach of a website (see, e.g., stout 1997).
Popular freeware and open-source tools for the analysis of server log
files include, amongst others, Analog, AWStats, http-Analyze, Open Web
Analytics, and Webalizer. Additionally, Leiner, Scherr, and Bartsch (2016)
provide a software plug-in particularly for researchers to enrich survey data
with observational data. Furthermore, Audit Bureaus of Circulations, such
as the North American Alliance for Audited Media (aam) and the Informa-
tion Community for the Assessment of the Circulation of Media (ivw) in
Germany, and commercial web analytics services, such as Adobe Analytics,
Google Analytics, Webtrends, and SimilarWeb, oer external analyses of
website usage. For this purpose, they work with tracking code added to
the web pages of a website where it collects the respective usage data and
sends it to the analytics service (e.g., google 2018).
Due to the infrastructural support of cost-free software, the analysis of
server log files has become quite popular in academic research that focuses
on individual websites and platforms: It is cheap, non-reactive, and accurate
in its measurements even over a long time-span (see, e.g., jansen/spink/
taksa 2009). Accordingly, it is often used in consumer and marketing re-
search, which is interested in user behavior on individual websites (see, e.g.,
chatterjee/hoffman/novak 2003; sismeiro/bucklin 2004). However,
329
Log File Analysis as a Method for Automated Measurement of Internet Usage
the literature also provides server log file analysis for online news sites (see,
e.g., aikat 1998; nicholas et al. 2000), websites of political parties (see e.g.,
hooghe/teepe 2007), social networks (see, e.g., benevenuto et al. 2009;
burke/kraut 2016), or for the usage of audiovisual content (see, e.g., costa
et al. 2004; yu et al. 2006).
Another important field of server log file analysis is research on the use
of information retrieval systems such as web search engines (e.g., silver-
table 2
Metrics of server log file analysis
Metrics Description
Click Single user action that leads to a client request (e.g., a
mouse click on a hyperlink). Since one click can execute
more than one client request at a time (see ›Hit‹), it is not
possible to directly deduce user actions from the client
requests.
Hit Successful client request for a single item from a web page
(e.g., a single image). Since the request of a complete web
page (see ›Page Impression‹) produces a various amount of
hits, this metric is not helpful to measure usage behavior. It
is mainly used to measure the load of the respective server.
Page
impression
Visual contact with a web page by a user: When all items of a
web page are successfully requested and sent to the client by
the server, the user who requested this web page is able to
completely see it in his or her browser. He or she gets a full
impression of it. This metric is also called ›page view‹. It is
used to measure the exposure to the single pages of a web-
site, which is especially relevant in the context of commercial
audience research. When a user requests additional web
pages of a website by following hyperlinks, for example, this
leads to multiple page impressions (see ›Visit‹).
Visit Coherent usage process of a website that starts at the
moment a user enters the website (with the first ›page im-
pression‹) and ends when this user leaves (which is actually
not measurable but estimated): When a user only requests
one web page, his or her visit is limited to this single page
impression. But since a website usually consists of numerous
pages, users also follow hyperlinks to these other pages of
the website. With the help of the referrer field, cookies, and
tracking code, it is – to a certain extent – possible to connect
the client requests made by a user and group them together
to a visit. In this case one visit contains more than one page
impression. A metric similar to the ›visit‹ (and recently also
more often used) is the ›session‹ (Meiss et al. 2009). Further-
more, in consumer and marketing research the path of how a
consumer navigated through a website is called ›clickstream‹
(see, e.g., Park/Fader 2004; Montgomery et al. 2004).
330
christian strippel
stein et al. 1999; spink et al. 2001; jansen/spink 2006; for mobile web
search see, e.g., kamvar et al. 2009), digital libraries (e.g., jones et al. 2000;
nicholas/huntington/watkinson 2005), or information and commu-
nication systems (e.g., deane/podd/henderson 1998; dirksen/huizing/
smit 2010).4 In political science and communication studies, for example,
search engine log files are used to identify search trends as an indicator for
the public agenda in agenda-setting research (see, e.g., weeks/southwell
2010; scharkow/vogelgesang 2011; maurer/holbach 2016). Additionally,
server log files are used as so-called ›paradata‹ (see, e.g., kreuter 2013) for
the automated observation of participants in the context of psychological
experiments, which is helpful to better understand the experimental out-
comes (e.g., lee 2008; garrett 2009; bulkow/urban/schweiger 2013).
In addition, they are combined with other methods such as ethnography
(e.g., clark et al. 2006), network analysis (see, e.g., dirksen/huizing/
smit 2010), and surveys (see, e.g., goulet/hampton 2012).
Limitations
In literature, mainly three limitations of server log file analyses are dis-
cussed regarding the measurement of Internet usage (see, e.g., pitkow
1997; yun et al. 2006; clay/barber/shook 2013):
First, server log analyses are inherently limited to the websites the
respective researchers have access to. This is why most of the previously
mentioned studies are also case studies for single websites or servers. Of
course, it is possible to collect log data from various web servers and then
compare it, but in any case the findings of such server-sided log file analy-
sis are very limited and strongly dependent on the selection of the servers
examined. Representative studies on Internet usage are therefore only
theoretically conceivable but practically impossible as it is unimaginable
to gain access to all existing web servers.
4 In the context of information retrieval systems, log files are called ›transaction logs‹ or ›query
logs‹. The use of these specific terms highlights which aspects of the server (or database) logs
are important to the researcher: Whereas the analysis of transaction logs investigates the
user behavior during the search engine use, the analysis of query logs provides information
on the concrete search terms used (for more background see, e.g., peters 1993; kurth 1993;
borgman/hirsch/hiller 1996; jansen 2006; agosti/crivellari/di nunzio 2012).
331
Log File Analysis as a Method for Automated Measurement of Internet Usage
Secondly, the validity of server log data as digital traces for the usage of
an analyzed website or server is limited. As already mentioned earlier, the
basic challenge of web usage mining in the form of server log file analyses
is the identification of individual users and their usage behavior. Although
a great deal of eort has already been invested to meet this challenge, some
problems still remain: For example, dierent cache systems (e.g., in web
browsers and proxy servers) store frequently requested files in order to
make the Internet’s data trac more ecient, which can lead to the prob-
lem that a server not always receives every client request directed to it. In-
stead, the cache system itself already provides the stored file without re-
laying the request to the server. Another problem arose with the ›Do Not
Track‹ initiative of the u.s. Federal Trade Commission that aims to curb
user tracking (for example via cookies). In response to this initiative, the
W3C is working on a specified »http mechanism for expressing a user’s
preference regarding tracking« (fielding/singer 2017). When more and
more users choose to prevent servers from tracking them, this negatively
aects the validity of server log file analyses. The same applies to the Gen-
eral Data Protection Regulation (gdpr) of the European Union.
Thirdly, server log file analyses are also limited in their range, because
they only allow statements about the actual usage of a website but not
about the individual users –as long as no other measurements such as
personalized logins are involved, which in turn carry other problems (see
section 3.2). In these analyses, users usually remain almost completely
anonymous. Therefore, without additional background information about
the individuals behind the usage traces, it his hardly possibly to explain
dierences in their behaviors.
To sum up, server log file analyses are particularly suitable for case
studies that aim to investigate the demand, outreach, usage, and usability
of individual websites or for comparisons between two or more dierent
websites. In addition, experimental studies are conceivable in which certain
characteristics of a website are altered and then the corresponding usage is
compared to one another. For these purposes, this method provides some
important strengths: It is cheap and non-reactive, and its measurements
are accurate even over a long time-span. However, its most important
limi tations are its low data validity and the impossibility of conducting
representative Internet use studies. Particularly for research that seeks to
rather investigate the use behavior by individual users in detail than the
disperse usage of individual websites, server-sided log file analyses are
332
christian strippel
not helpful. In these cases, the two other variants of the log file analysis
method are much more appropriate.
3.2 Client log file analysis
The client-sided or user-centric log file analysis can be defined as a method
of evaluation and interpretation of log data, which is generated by soft-
ware applications on devices that individuals use for Internet access.
5
In
the context of Internet use studies, this data can be taken as traces left by
Internet users in the process of using such applications. As this method is
as ›close‹ to the user as possible, and as the usage is observed immediately
where it happens, the measurement is highly accurate.
Since the web browser has been the most important software applica-
tion to request and display http websites from early on, the first academic
analyses of client-sided log files used log data from web browsers such as
ncsa Mosaic (the client equivalent to the ncsa httpd server) or Netscape
in order to examine patterns of Internet usage on a larger scale (e.g., cat-
ledge/pitkow 1995; crovella/bestavros 1996; tauscher/greenberg
1997; huberman et al. 1998; cockburn/mckenzie 2001). Since then, this
method has been used, for example, to investigate patterns of web navi-
gation and browsing (e.g., obendorf et al. 2007; adar/teevan/dumais
2008; kumar/tomkins 2010), of news consumption and selective exposure
to political websites (e.g., munson/lee/resnick 2013; flaxman/goel/rao
2016). In this context, one important technical challenge is to make the web
browser logs readable for data analysis software. As mentioned above, web
browsers usually log Internet usage and store it into complex databases
using their own formats. Consequently, researchers cannot fall back on
standardized log file entries as in the case of server log files. In literature,
this problem is mainly solved in three ways: (1) The web browser log data
is modified and converted into analyzable data, which requires a certain
technical expertise; (2) the history files of web browsers are simply copied,
which considerably reduces the available information; or (3) specialized
5 Since log data is not only generated in the context of Internet usage but also by operating sys-
tems and software applications that are not related to the Internet, the literature on log file
analysis is not limited to Internet-related applications. In this text, however, we focus only
on Internet use studies.
333
Log File Analysis as a Method for Automated Measurement of Internet Usage
tracking software is developed (e.g., web browser add-ons or extensions)
and installed on the participants devices, which requires even higher tech-
nical skills or high financial resources (for a discussion of such software
see fourie/bothma 2007).
Besides such studies of web browser log data, we can find analyses of log
files generated by other Internet-related applications: Kraut and colleagues
(1996), for example, developed an operating system called ›HomeNet‹ to
investigate Internet usage at home on the basis of behavioral log data and
implemented questionnaires; Orthmann (2000) analyzed the chat log files
of Internet Relay Chats (irc) of a client-sided chat program; and Heerwegh
(2003) used client-sided paradata in web surveys and experiments. In all
these cases, researchers have to deal with similar technical problems and
have to find appropriate solutions as in the analyses of web browser log files.
All studies mentioned so far focus on Internet usage of individuals on
desktop computers or laptops. In the past ten years, however, this research
has been extended onto mobile devices such as smartphones. Based on the
assumption that mobile Internet usage fundamentally diers from that
on stationary devices (e.g., chae/kim 2003), dierent mobile Internet ac-
tivities
6
were investigated using client-sided log data: Do and Gatica-Perez
(2010), for example, logged the smartphone app usage of 111 volunteers;
Kobayashi and Boase (2012) tracked data about voice calls, sms messages, and
Gmail activity to compare it with the self-reports of their participants in
additional pop-up questionnaires; Wieland and colleagues (2018) combined
client-sided usage tracking on desktop computers and smartphones, expe-
rience sampling, and daily surveys in order to obtain a broad impression on
the media use of the test persons; and Masur (2019) uses a similar method
triangulation to pursue his theory of situational privacy and self-disclosure.
In all these studies, the log data was collected by self-developed software
installed on the participants’ smartphones. According to Bouwman and
colleagues (2013), the wide-ranging possibilities and opportunities of data
collection are oset by a number of unanswered questions and challenges,
for example, concerning privacy, representativeness, and the evaluation of
applied tracking solutions.
6 As smartphones are multi-functional devices, people can use them for purposes other than
using the Internet. Accordingly, we find many studies that are interested in smartphone us-
age independently of mobile Internet (for an overview see, e.g., raento/oulasvirta/eagle
2009; kwok 2009). However, due to the limited space, we cannot go into detail here.
334
christian strippel
Finally, user centric analyses of log files from various devices are also
used by commercial market research institutes such as comScore (Media
Metrix), Nielsen (Digital Content Ratings), GfK (Media Measurement),
YouGov (Pulse/Profile), and Kantar (tns, Millward Brown), by software
companies such as nipo and Wakoopa as well as by web analytics services
such as Amazon’s Alexa Internet (for more background see webster/pha-
len/lichty 2014). Although their solutions are primarily focused on the
commercial interests of their customers, their data is increasingly used for
scientific research. For example, they are used to analyze news consumption
and selective exposure (e.g., tewksbury 2003; gentzkow/shapiro 2011;
kalogeropoulos/fletcher/nielsen 2018), online search for information
and products (e.g., johnson et al. 2004; moe/fader 2004), browsing be-
havior across dierent platforms and websites (e.g., lee/zufryden/drèze
2003; goel/hofman/sirer 2012), audience fragmentation (e.g., webster/
ksiazek 2012; mahrt 2019), and the accuracy and biases of self-reports in
Internet use studies (e.g., scharkow 2016; araujo et al. 2017). The advan-
tages of using – which in most cases corresponds to buying –such data
are the technological expertise and financial capability of such companies
to conduct client-sided log file analyses on a very large scale. However, the
tracking procedures lack a certain transparency, even for the academic re-
searchers themselves, which questions their reliability. Additionally, as
the datasets are mostly aggregated on the level of media outlets, they are
limited to certain analyses. In his critical reflection on commercial audi-
ence measurement data, Taneja (2016) concludes: »In interpreting findings
from commercial audience measurement data, we must be mindful of their
design biases. However, despite their ›flaws‹, these are the most reliable
measures of exposure available today« (pp. 177-178).
Limitations
According to literature (see also pitkow 1997; yun et al. 2006; clay/barber/
shook 2013), client log file analyses come with three major limitations in
relation to the measurement of Internet usage:
First, the method requires high technical expertise and considerable
financial resources. Only few researchers are capable of developing and
maintaining a suitable tracking software (or system) for their studies (which
is applicable to dierent devices, operating systems, and browsers), and
even in those cases the recruiting process for volunteering participants
335
Log File Analysis as a Method for Automated Measurement of Internet Usage
and incentives for their eorts are highly costly. This is why many research-
ers fall back on the data of commercial market research institutes, which
comes with certain limitations for the reliability of the corresponding use
studies. A possible solution to these challenges could be an academically
organized tracking panel that brings together interdisciplinary expertise
and addresses the specific needs and requirements of academic Internet
use studies.
Secondly, client log file analyses are highly invasive for the research
participants. As such studies can only be conducted with volunteers, who
have to consent to their observation and also take part in the installing of
the software (and maybe the delivery of the data as well), there is a high
potential for reactivity and artificial behavior, which can limit the validity
of the observations. This problem can be addressed by a detailed screening
of how the observed usage of each participant evolves over time. More re-
search is needed on these issues. Additionally, challenges of privacy issues,
research ethics, and data security remain and have to be addressed as well.
Thirdly, the reach of the results of client-sided tracking studies strongly
depends on their actual sample of participants. Besides the common prob-
lem of possible biases through voluntary participation, another relevant
limitation for representative results lies in the fact that not all devices
allow usage tracking to the same extent. The closed operating systems
of many smartphones would particularly require innovative solutions of
client-sided usage analysis.
To sum up, client log file analyses are very suitable for studies that aim
to investigate usage behavior on an individual level. For this purpose, it
provides some important strengths: The measurement accuracy is high
even over a long time-span, the data is comparably valid (at least after a
certain time-span when the participants are not aware of the observation
anymore), and there is even a chance to conduct studies with a representa-
tive sample if it is combined with a corresponding recruitment process and
eventually a certain panel maintenance. On the other hand, this method
is highly invasive and costly. As mentioned before, these challenges can
only be satisfactorily addressed by a professionally organized tracking
panel. As long as the scientific community is not capable of organizing
such a panel, researchers will have to fall back on the solutions and data
of commercial organizations.
336
christian strippel
3.3 Proxy log file analysis
The analysis of ›web proxy logs‹ (fei et al. 2006) is a hybrid of the two other
approaches and therefore can be helpful to address some of their limita-
tions. Generally speaking, the proxy log file analysis can be defined as the
method of evaluation and interpretation of log data that is generated by
proxy servers intermediating between clients and web servers. Due to
its dual role in the client-server model of the Internet (proxy servers are
servers to clients and clients to servers), a proxy can either be seen from
a user-centric perspective as an extension of clients (or ›second level cli-
ent‹) or from a server-centric perspective as a hub or gateway that collects
and relays client requests. In the context of Internet use studies, however,
proxy log data can be considered in both cases as traces of Internet usage
by clients who redirect their Internet trac through the given proxy server.
Early proxy log analyses were not interested in examining Internet
usage but rather in describing and evaluating the technical performance
of proxy servers (for an overview see pitkow 1999). The first study focus-
ing on questions of Internet usage was published in 2002, following the
server-centric approach: Berker (2002) analyzed the proxy server log files of
Frankfurt University’s computer center on an aggregated level in order to
examine which type of content its students and employees request online.
Ten years later, Taghavi and colleagues (2012) conducted a similar study to
investigate patterns of search behavior.7 In the meantime, Atterer and col-
leagues (2006) used proxy servers from a user-centric approach, making
the proxy server add JavaScript code to every html web page it relayed
to clients in order to track their individual activities (mouse movements,
scrolling, clicks, text input) on those web pages. The users (behind the cli-
ents) were recruited in advance and participated in this study voluntarily.
Weinreich and colleagues (2006, 2008) used a similar approach but addi-
tionally combined it with a client-sided tracking tool to gain even deeper
insights into web navigation patterns.
7 Besides these two studies, we found server-centric log data analyses of other intermediating
systems such as local university networks (meiss et al. 2009), a Wi-Fi network of a shopping
mall (ren et al. 2014), or a telecommunications services provider (lee/kim/kim 2005). Although
these studies are not proxy logfile analyses in a strict sense, the methodological approaches
are nevertheless comparable.
337
Log File Analysis as a Method for Automated Measurement of Internet Usage
In comparison to client log file analyses (section 3.1), client-centric
proxy log file studies are less invasive, possibly making it easier to recruit
study participants: Volunteer users only have to change a few settings of
their Internet connection to redirect their Internet traffic through the
proxy. Additionally, proxy logging works independently from the speci-
fic web browsers of the clients, which is why the costs for setting it up are
much lower than those for developing tracking software that runs on all
devices. Moreover, proxy logs often have a similar format like server log
files (see, e.g., fei et al. 2006: 250), which makes it easy to analyze them.
Finally, the log data is immediately collected in one place (the proxy ser-
ver itself), which makes it more accessible for researchers (with no need
for subsequent collection of data from the participants’ computers) and
at the same time allows easier protection from manipulation or unautho-
rized external access.
In comparison to server log file studies (section 3.2), server-centric proxy
log file analyses allow a broader perspective on Internet usage as they are
not limited to only one web server but theoretically consider all websites.
Although in both cases the log data is analyzed on an aggregated level,
proxy log studies have more knowledge about the clients behind their data
(as they know, for example, where the proxy server is located, who has ac-
cess to it, and for which purposes it can be used). To a certain extent, this
allows comparative designs, for example, proxy log file analyses of com-
parable proxy servers in dierent places. At the same time, server-centric
proxy log file analyses carry all advantages of server log file analyses, i.e.,
they are non-reactive, cheap, and precise.
Limitations
Both the client-centric and the server-centric approach of proxy log file
analyses come with two major limitations, mainly in terms of data validity:
First, many proxy servers only capture the data trac transmitted via
the Hypertext Transfer Protocol (http). Although http is one of the most
important protocols for data transfers on the Internet, it only covers specific
Internet activities. Accordingly, proxy log file analyses do not capture the
full bandwidth of Internet use. For example, encrypted communication via
the Hypertext Transfer Protocol Secure (https) cannot be logged without
additional measures. Thus, any Internet use in secure areas, such as online
banking or the use of social networks, remains invisible. This becomes
338
christian strippel
even more relevant, as websites are increasingly encrypted –particularly
because of gdpr regulations.
Secondly, mobile Internet use (e.g., on smartphones) is dicult to track
with proxy servers. The reason for this is less attributed to the proxy serv-
ers themselves and more due to the mobile devices and their technical in-
frastructure. Some of these devices do not oer the possibility to redirect
Internet use via a proxy server, which means that another important area
of Internet use gets out of sight.
To sum up, proxy log file analyses are suitable for studies that aim
to investigate usage behavior of individuals (client-centric) or of certain
groups and institutions (server-centric). In the latter case, it is question-
able, however, whether such studies are (still) justifiable ethically and/or in
terms of data protection law. Nevertheless, the method provides important
strengths: Similar to both other approaches, the measurement accuracy is
high even over a long time-span, the log data has the potential to be highly
valid, and there is a possibility to conduct studies with a representative
sample (in the same way as client log file analyses). However, it is still
invasive and costly to a certain extent, and without additional measures
(which then introduce their own problems) it is limited to certain online
content and certain devices as well.
4. Conclusion
The aim of this article was to introduce the basic and most important as-
pects of log file analyses for the purpose of Internet use studies and to dis-
cuss and evaluate the three variants of this method server, client, and
proxy log files analysisconcerning measurement accuracy, data validity,
representativeness, reactivity, and costs. Additionally, we provided short
overviews of the main limitations of these three variants. Table 3 gives an
overview of the main results of this discussion and compares the strengths
and weaknesses of the three methods.
To conclude, it can be said that log file analyses on all three levels can
be helpful for Internet use studies. Depending on the research interest,
the approaches introduced in this article come with certain strengths and
weaknesses. Server log file analyses are suitable for studies that focus on
specific websites and the usage behavior of their audiences, whereas client
339
Log File Analysis as a Method for Automated Measurement of Internet Usage
and proxy log file analyses are more suitable for user-centric studies that
are interested in the usage behavior of (known) individuals.
Against the background of the limitations presented and discussed
in each case, innovative solutions and method tests for investigating and
addressing these limitations are needed. Since the majority of research
in this field is to a certain extent still opaque for many colleaguesand
even some researchers themselves (if the data was collected by a third
party) – these methods need to be tested and developed for the specific
purposes of academic Internet use studies. On the one hand, this could
be achieved with method experiments and methodological triangulation,
which we can already find in literature. On the other hand, there is (still)
high potential particularly for more qualitative research in this field: Al-
most all presented studies in this article were based on standardized analy-
ses. However, studies by Orthmann (2000) and by Ørmen and Thorhauge
(2015) show that also qualitative research can benefit from this method of
automated measurement of Internet usageand vice versa.
table 3
Comparison of log file analysis methods for Internet usage
studies
Server-level Client-level Proxy-level
Accuracy high high high
Validity low high medium (server)
high (client)
Represen-
tativeness
limited to acces-
sible websites/
servers
limited to volunta-
ry participants
limited to accessible
proxies or voluntary
participants
Invasive-
ness low high low (server)
medium (client)
Costs low high low (server)
medium (client)
Application
website case
studies and
experiments
usage studies on
an individual level
usage studies on an
individual or aggre-
gate level
340
christian strippel
References
adar, e.; teevan, j.; dumais, s. t.: Large Scale Analysis of Web
Revisitation Patterns. In: Proceedings of the sigchi Conference
on Human Factors in Computing Systems, 2008, pp. 1197-1206.
doi: 10.1145/1357054.1357241
agosti, m.; crivellari, f.; di nunzio, g.: Web log analysis: a review
of a decade of studies about information acquisition, inspection
and interpretation of user interaction. In: Data Mining and Knowledge
Discovery, 24(3), 2012, pp. 663-696. doi: 10.1007/s10618-011-0228-8
aikat, d. d.: News on the Web: Usage Trends of an On-Line Newspaper. In:
Convergence, 4(4), 1998, pp. 94-110. doi: 10.1177/135485659800400409
ansolabehere, s.; iyengar, s.: Messages forgotten: Misreporting in surveys
and the bias toward minimal effects. 1995. http://pcl.stanford.edu/
common/docs/research/ansolabehere/1995/messages.pdf
apache: Log files. Apache http Server Version 2.5 Documentation. 2018.
https://httpd.apache.org/docs/trunk/en/logs.html
araujo, t.; wonneberger, a.; neijens, p.; de vreese, c.: How
Much Time Do You Spend Online? Understanding and Improving
the Accuracy of Self-Reported Measures of Internet Use. In:
Communication Methods and Measures, 11(3), 2017, pp. 173-190.
doi: 10.1080/19312458.2017.1317337
atterer, r.; wnuk, m.; schmidt, a.: Knowing the user’s every move:
user activity tracking for website usability evaluation and implicit
interaction. In: Proceedings of the 15th international conference on World
Wide Web, 2006, pp. 203-212. doi: 10.1145/1135777.1135811
barth, a.: rfc 6265 –http State Management Mechanism. Internet
Engineering Task Force. 2011. https://tools.ietf.org/html/rfc6265
benbunan-fich, r.: Using protocol analysis to evaluate the usability
of a commercial web site. In: Information and Management, 39(2), 2001,
pp. 151-163. doi: 10.1016/S0378-7206(01)00085-4
benevenuto, f.; rodrigues, t.; cha, m.; almeida, v.: Characterizing
user behavior in online social networks. In: Proceedings of the 9th acm
sigcomm conference on Internet measurement conference, 2009, pp. 49-62.
doi: 10.1145/1644893.1644900
berg, m.; düvel, c.: Qualitative media diaries: an instrument for
doing research from a mobile media ethnographic perspective. In:
341
Log File Analysis as a Method for Automated Measurement of Internet Usage
Interactions: Studies in Communication & Culture, 3(1), 2012, pp. 71-89.
doi: 10.1386/iscc.3.1.71_1
berker, t.: World Wide Web Use at a German UniversityComputers,
Sex, and Imported Names. Results of a Logfile Analysis. In: batinic,
b.; reips, u.-d.; bosnjak, m. (Eds.): Online Social Sciences. Göttingen
[Hogrefe & Huber Publishers] 2002, pp. 365-381
berners-lee, t.; cailliau, r.; luotonen, a.; nielsen, h. f.; secret,
a.: The World-Wide Web. In: Communications of the acm, 37(8), 1994,
pp. 76-82. doi: 10.1145/179606.179671
berners-lee, t.; hall, w.; hendler, j.; shadbolt, n.; weitzner, d. j.:
Creating a Science of the Web. In: Science, 313(5788), 2006, pp. 769-771.
doi: 10.1126/science.1126902
berson, a.: Client/Server Architecture. New York, ny [McGraw-Hill] 1992
bertot, j. c.; mcclure, c. r.; moen, w. e.; rubin, j.: Web usage
statistics: Measurement issues and analytical techniques. In:
Government Information Quarterly, 14(4), 1997, pp. 373-395. doi: 10.1016/
S0740-624X(97)90034-4
borgman, c. l.; hirsh, s. g.; hiller, j.: Rethinking online monitoring
methods for information retrieval systems: From search product to
search process. In: Journal of the American Society for Information Science,
47(7), 1996, pp. 568-583
bouwman, h.; de reuver, m.; heerschap, n.; verkasalo, h.:
Opportunities and problems with automated data collection via
smartphones. In: Mobile Media & Communication, 1(1), 2013, pp. 63-68.
doi: 10.1177/2050157912464492
bulkow, k.; urban, j.; schweiger, w.: The Duality of Agenda-Setting:
The Role of Information Processing. In: International Journal of Public
Opinion Research, 25(1), 2013, pp. 43-63. doi: 10.1093/ijpor/eds003
burke, m.; kraut, r. e.: The Relationship Between Facebook Use and
Well-Being Depends on Communication Type and Tie Strength. In:
Journal of Computer-Mediated Communication, 21(4), 2016, pp. 265-281.
doi: 10.1111/jcc4.12162
burton, m. c.; walther, j. b.: The Value of Web Log Data in Use-Based
Design and Testing. In: Journal of Computer-Mediated Communication,
6(3), 2001. doi: 10.1111/j.1083-6101.2001.tb00121.x
catledge, l. d.; pitkow, j. e.: Characterizing browsing behaviors on
the world-wide web. In: Computer Networks and isdn Systems, 27(6),
1995, pp. 1065-1073. doi: 10.1016/0169-7552(95)00043-7
342
christian strippel
chae, m.; kim, j.: What’s So Dierent About the Mobile
Internet? In: Communication of acm, 46(12), 2003, pp. 240-247.
doi: 10.1145/953460.953506
chatterjee, p.; hoffman, d. l.; novak, t. p.: Modeling the Clickstream:
Implications for Web-Based Advertising Eorts. In: Marketing Science,
22(4), 2003, pp. 520-541. doi: 10.1287/mksc.22.4.520.24906
chuvakin, a.; schmidt, k.; phillips, c.: Logging and Log Management.
Waltham, ma [Syngress/Elsevier] 2013
clark, l.; ting, i.-h.; kimble, c.; wright, p.; kudenko, d.: Combining
ethnographic and clickstream data to identify user Web browsing
strategies. In: Information Research, 11(2), 2006, paper 249
clay, r.; barber, j. m.; shook, n. j.: Techniques for Measuring Selective
Exposure: A Critical Review. In: Communication Methods and Measures,
7(3-4), 2013, pp. 147-171. doi: 10.1080/19312458.2013.813925
cockburn, a.; mckenzie, b.: What do web users do? An empirical
analysis of web use. In: International Journal of Human-Computer Studies,
54(5), 2001, pp. 903-922. doi: 10.1006/ijhc.2001.0459
cooley, r.; tan, p.-n.; srivastava, j.: Web Mining: Information and
Pattern Discovery on the World Wide Web. In: Proceedings of the Ninth
ieee International Conference on Tools with Artificial Intelligence, 1997,
pp. 558-567. doi: 10.1109/TAI.1997.632303
cooley, r.; mobasher, b.; srivastava, j.: Data Preparation for Mining
World Wide Web Browsing Patterns. In: Knowledge and Information
Systems, 1(1), 1999, pp. 5-32. doi: 10.1007/BF03325089
costa, c. p.; cunha, i. s.; borges, a.; ramos, c. v.; rocha, m. m.;
almeida, j. m.; ribeiro-neto, b.: Analyzing client interactivity in
streaming media. In: Proceedings of the 13th International Conference on
World Wide Web, 2004, pp. 534-543. doi: 10.1145/988672.988744
crovella, m. e.; bestavros, a.: Self-similarity in World Wide Web
trac: Evidence and possible causes. In: ieee/acm Transactions on
Networking, 5(6), 1996, pp. 835-846. doi: 10.1109/90.650143
david, p.; seo, m.; german, t.: Demand Characteristics and Biases in
Self-Reports of Media Use Through an Online Diary. In: American
Journal of Media Psychology, 3(1-2), 2010, pp. 54-72
deane, f. p.; podd, j.; henderson, r. d.: Relationship between self-
report and log data estimates of information system usage. In:
Computers in Human Behavior, 14(4), 1998, pp. 621-636. doi: 10.1016/
S0747-5632(98)00027-2
343
Log File Analysis as a Method for Automated Measurement of Internet Usage
de vreese, c.; neijens, p.: Measuring Media Exposure in a Changing
Communications Environment. In: Communication Methods and
Measures, 10(2-3), 2016, pp. 69-80. doi: 10.1080/19312458.2016.1150441
dirksen, v.; huizing, a.; smit, b.: ›Piling on layers of understanding‹:
the use of connective ethnography for the study of (online)
work practices. In: New Media & Society, 12(7), 2010, pp. 1045-1063.
doi: 10.1177/1461444809341437
do, t.-m.-t.; gatica-perez, d.: By their apps you shall understand
them: mining large-scale patterns of mobile phone usage. In:
Proceedings of the 9th International Conference on Mobile and Ubiquitous
Multimedia, 2010, Article 27. doi: 10.1145/1899475.1899502
fei, b.; eloff, j.; olivier, m.; venter, h.: Analysis of Web Proxy Logs.
In: olivier, m. s.; shenoi, s. (Eds.): Advances in Digital Forensics ii.
New York, ny [Springer] 2006, pp. 247-258
fielding, r. t.; reschke, j. (Eds.): rfc 7231 –Hypertext Transfer Protocol
(HTTP/1.1): Semantics and Content. Internet Engineering Task Force
(ietf) 2014. https://tools.ietf.org/html/rfc7231
fielding, r. t.; singer, d. (Eds.): Tracking Preference Expression
(dnt)W3C Candidate Recommendation19 October 2017. https://
www.w3.org/TR/tracking-dnt/
flaxman, s.; goel, s.; rao, j. m.: Filter Bubbles, Echo Chambers, and
Online News Consumption. In: Public Opinion Quarterly, 80(S1), 2016,
pp. 298-320. doi: 10.1093/poq/nfw006
fourie, i.; bothma, t.: Information seeking: an overview of web
tracking and the criteria for tracking software. In: Aslib Proceedings,
59(3), 2007, pp. 264-284. doi: 10.1108/00012530710752052
freelon, d.: On the Interpretation of Digital Trace Data in
Communication and Social Computing Research. In: Journal
of Broadcasting & Electronic Media, 58(1), 2014, pp. 59-75.
doi: 10.1080/08838151.2013.875018
garrett, r. k.: Echo chambers online? Politically motivated
selective exposure among Internet news users. In: Journal of
Computer-Mediated Communication, 14(2), 2009, pp. 265-285.
doi: 10.1111/j.1083-6101.2009.01440.x
gentzkow, m.; shapiro, j. m.: Ideological Segregation Online and
Oine. In: The Quarterly Journal of Economics, 126(4), 2011, pp. 1799-
1839. doi: 10.1093/qje/qjr044
344
christian strippel
glassman, s.: A caching relay for the World Wide Web. Computer
Networks and isdn Systems, 27(2), 1994, pp. 165-173.
doi: 10.1016/0169-7552(94)90130-9
goel, s.; hofman, j. m.; sirer, m. i.: Who Does What on the Web:
A Large-Scale Study of Browsing Behavior. In: Proceedings of the
International aaai Conference on Web and Social Media, 2012, pp. 130-137
google: Google Analytics Help Center: Glossary –Tracking Code. 2018. https://
support.google.com/analytics
goulet, l. s.; hampton, k. n.: The accuracy of self-reports of social network
site use: Comparing survey responses to server logs. Paper presented at the
Annual conference of the International Communication Association,
Phoenix, AZ. 2012
hall, e.: Internet Core Protocols: The Definitive Guide. Beijing et al. [O’Reilly
Media] 2009
hallam-baker, p. m.; behlendorf, b.: Extended Log File Format – W3C
Working Draft WD-logfile-960323. World Wide Web Consortium. 1996.
https://www.w3.org/TR/WD-logfile.html
heerwegh, d.: Explaining Response Latencies and Changing
Answers Using Client-Side Paradata from a Web Survey.
In: Social Science Computer Review, 21(4), 2003, pp. 360-373.
doi: 10.1177/0894439303253985
hendler, j.; shadbolt, n.; hall, w.; berners-lee, t.; weitzner,
d.: Web Science: An Interdisciplinary Approach to Understanding
the Web. In: Communications of the acm, 51(7), 2008, pp. 60-69.
doi: 10.1145/1364782.1364798
hine, c. (Ed.): Virtual Methods: Issues in Social Research on the Internet.
Oxford [Berg] 2005
hooghe, m.; teepe, w.: Party profiles on the web: an analysis of the
logfiles of non-partisan interactive political internet sites in the 2003
and 2004 election campaigns in Belgium. In: New Media & Society, 9(6),
2007, pp. 965-985. doi: 10.1177/1461444807082726
huberman, b. a.; pirolli, p. l. t.; pitkow, j. e.; lukose, r. m.: Strong
Regularities in World Wide Web Surfing. In: Science, 280(5360), 1998,
pp. 95-97. doi: 10.1126/science.280.5360.95
ince, d.: A Dictionary of the Internet (2nd edition). Oxford [Oxford
University Press] 2009
345
Log File Analysis as a Method for Automated Measurement of Internet Usage
janetzko, d.: Nonreactive Data Collection on the Internet. In:
fielding, n.; lee, r. m.; blank, g. (Eds.): The sage Handbook of Online
Research Methods. Thousand Oaks, ca [sage] 2008, pp. 161-175
jansen, b. j.: Search log analysis: What it is, what’s been done, how to
do it. In: Library & Information Science Research, 28(3), 2006, pp. 407-432.
doi: 10.1016/j.lisr.2006.06.005
jansen, b. j.; spink, a.: How Are We Searching the World Wide Web? A
Comparison of Nine Search Engine Transaction Logs. In: Information
Processing and Management, 52(1), 2006, pp. 248-263. doi: 10.1016/j.
ipm.2004.10.007
jansen, b. j.; spink, a. h.: Log-File Analysis. In: donsbach, w. (Ed.):
The International Encyclopedia of Communication. Blackwell 2008.
doi: 10.1111/b.9781405131995.2008.x
jansen, b. j.; spink, a.; taksa, i. (Eds.): Handbook of Research on Web Log
Analysis. Hershey, pa [igi Global] 2009
johnson, e. j.; moe, w. w.; fader, p. s.; bellman, s.; lohse, g. l.: On
the Depth and Dynamics of Online Search Behavior. In: Management
Science, 50(3), 2004, pp. 281-424. doi: 10.1287/mnsc.1040.0194
jones, s.; cunningham, s. j.; mcnab, r.; boddie, s.: A transaction log
analysis of a digital library. In: International Journal of Digital Libraries,
3(2), 2000, pp. 152-169
kalogeropoulos, a.; fletcher, r.; nielsen, r. k.: News brand
attribution in distributed environments: Do people know where
they get their news? In: New Media & Society, online first, 2018.
doi: 10.1177/1461444818801313
kamvar, m.; kellar, m.; patel, r.; xu, y.: Computers and iPhones and
Mobile Phones, oh my! A logs-based comparison of search users on
dierent devices. In: Proceedings of the 18th International Conference on
World Wide Web, 2009, pp. 801-810. doi: 10.1145/1526709.1526817
kinder-kurlanda, k.; weller, k.: »I always feel it must be great to
be a hacker!«:The role of interdisciplinary work in social media
research. In: Proceedings of the 2014 acm conference on Web science, 2014,
pp. 91-98. doi: 10.1145/2615569.2615685
kobayashi, t.; boase, j.: No Such Eect? The Implications
of Measurement Error in Self-Report Measures of Mobile
Communication Use. In: Communication Methods and Measures, 6(2),
2012, pp. 126-143. doi: 10.1080/19312458.2012.679243
346
christian strippel
kraut, r.; scherlis, w.; mukhopadhyay, t.; manning, j.;
kiesler, s.: The HomeNet Field Trial of Residential Internet
Services. In: Communications of the acm, 39(12), 1996, pp. 55-63.
doi: 10.1145/240483.240493
kreuter, f.: Improving Surveys with Paradata: Analytic Use of Process
Information. Hoboken, nj [Wiley] 2013
kumar, r.; tomkins, a.: A Characterization of Online Browsing
Behavior. In: Proceedings of the 19th international conference on World Wide
Web, 2010, pp. 561-570. doi: 10.1145/1772690.1772748
kurth, m.: The limits and limitations of transaction log analysis. In:
Library Hi Tech, 11(2), 1993, pp. 98-104. doi: 10.1108/eb047888
kwan, t. t.; mcgrath, r. e.; reed, d. a.: ncsa’s World Wide Web
server: design and performance. In: Computer, 28(11), 1995, pp. 68-74.
doi: 10.1109/2.471181
kwok, r.: Personal technology: Phoning in data. In: Nature, 458(7241),
2009, pp. 959-961. doi: 10.1038/458959a
lazer, d.; pentland, a.; adamic, l.; aral, s.; barabasi, a. l.; brewer,
d.; christakis, n.; contractor, n.; fowler, j.; gutmann,
m.; jebara, t.; king, g.; macy, m.; roy, d.; van alstyne, m.:
Computational social science. In: Science, 323(5915), 2009, pp. 721-723.
doi: 10.1126/science.1167742
lee, i., kim, j.; kim, j.: Use Contexts for the Mobile Internet: A
Longitudinal Study Monitoring Actual Use of Mobile Internet
Services. In: International Journal of Human-Computer Interaction, 18(3),
2005, pp. 269-292. doi: 10.1207/s15327590ijhc1803_2
lee, j. h.: Eects of News Deviance and Personal Involvement
on Audience Story Selection: A Web-Tracking Analysis.
In: Journalism & Mass Communication Quarterly, 85(1), 2008, pp. 41-60.
doi: 10.1177/107769900808500104
lee, s.; zufryden, f.; dze, x.: A Study of Consumer Switching
Behavior across Internet Portal Web Sites. In: International Journal of
Electronic Commerce, 7(3), 2003, pp. 39-63.
leiner, d. j.; scherr, s.; bartsch, a.: Using Open-Source Tools
to Measure Online Selective Exposure in Naturalistic Settings.
In: Communication Methods and Measures, 20(4), 2016, pp. 199-216.
doi: 10.1080/19312458.2016.1224825
liu, b.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data.
Berlin [Springer] 2011
347
Log File Analysis as a Method for Automated Measurement of Internet Usage
luotonen, a.: Logging Control In W3C httpd. World Wide Web Consortium.
1994. https://www.w3.org/Daemon/User/Config/Logging.
html#common-logfile-format
luotonen, a.; altis, k.: World-Wide Web proxies. In:
Computer Networks and isdn Systems, 27(2), 1994, pp. 147-154.
doi: 10.1016/0169-7552(94)90128-7
mahrt, m.: Beyond Filter Bubbles and Echo Chambers: The Integrative Potential
of the Internet. Berlin [Digital Communication Research] 2019
mahrt, m.; scharkow, m.: The Value of Big Data in Digital Media
Research. In: Journal of Broadcasting & Electronic Media, 57(1), 2013,
pp. 20-33. doi: 10.1080/08838151.2012.761700
markov, z.; larose, d. t.: Data Mining the Web: Uncovering Patterns in Web
Content, Structure, and Usage. Hoboken, nj [Wiley] 2007
masur, p. k.: Situational Privacy and Self-Disclosure. Communication Processes
in Online Environments. Cham [Springer] 2019
maurer, m.; holbach, t.: Taking Online Search Queries as an
Indicator of the Public Agenda: The Role of Public Uncertainty. In:
Journalism & Mass Communication Quarterly, 93(3), 2016, pp. 572-586.
doi: 10.1177/1077699015610072
meiss, m.; duncan, j.; gonçalves, b.; ramasco, j. j.; menczer, f.:
What’s in a session: tracking individual behavior on the web. In:
Proceedings of the 20th acm conference on Hypertext and hypermedia, 2009,
pp. 173-182. doi: 10.1145/1557914.1557946
mobasher, b.; cooley, r.; srivastava, j.: Automatic personalization
based on Web usage mining. In: Communications of the acm, 43(8),
2000, pp. 142-151. doi: 10.1145/345124.345169
moe, w. w.; fader, p. s.: Dynamic Conversion Behavior at E-Commerce
Sites. In: Management Science, 50(3), 2004, pp. 281-424. doi: 10.1287/
mnsc.1040.0153
montgomery, a. l.; li, s.; srinivasan, k.; liechty, j. c.: Modeling
Online Browsing and Path Analysis Using Clickstream Data.
In: Marketing Science, 23(4), 2004, pp. 579-595. doi: 10.1287/
mksc.1040.0073
munson, s.; lee, s. y.; resnick, p.: Encouraging reading of diverse political
viewpoints with a browser widget. International Conference on Weblogs
and Social Media. 2013. https://dub.washington.edu/djangosite/
media/papers/balancer-icwsm-v4.pdf
348
christian strippel
ni loideain, n.: eu Law and Mass Internet Metadata Surveillance in the
Post-Snowden Era. In: Media and Communication, 3(2), 2015, pp. 53-62.
doi: 10.17645/mac.v3i2.297
nicholas, d.; huntington, p.; lievesley, n.; wasti, a.: Evaluating
consumer website logs: a case study of The Times/The Sunday Times
website. In: Journal of Information Science, 26(6), 2000, pp. 399-411.
doi: 10.1177/016555150002600603
nicholas, d.; huntington, p.; watkinson, a.: Scholarly journal
usage: the results of deep log analysis. In: Journal of Documentation,
61(2), 2005, pp. 246-280. doi: 10.1108/00220410510585214
obendorf, h.; weinreich, h.; herder, e.; mayer, m.: Web Page
Revisitation Revisited: Implications of a Long-term Click-
stream Study of Browser Usage. In: Proceedings of the sigchi
Conference on Human Factors in Computing Systems, 2007, pp. 597-606.
doi: 10.1145/1240624.1240719
ørmen, j.; thorhauge, a. m.: Smartphone log data in a qualitative
perspective. In: Mobile Media & Communication, 3(3), 2015, pp. 1-17.
doi: 10.1177/2050157914565845
orthmann, c.: Analyzing the Communication in Chat
RoomsProblems of Data Collection. In: Forum Qualitative Social
Research, 1(3), 2000, Article 36
park, y.-h.; fader, p. s.: Modeling Browsing Behavior at Multiple
Websites. In: Marketing Science, 23(3), 2004, pp. 280-303. doi: 10.1287/
mksc.1040.0050
peters, t. a.: The history and development of transaction log analysis.
In: Library Hi Tech, 11(2), 1993, pp. 41-66. doi: 10.1108/eb047884
pierrakos, d.; paliouras, g.; papatheodorou, c.; spyropoulos, c.
d.: Web Usage Mining as a Tool for Personalization: A Survey. In:
User Modeling and User-Adapted Interaction, 13(4), 2003, pp. 311-372.
doi: 10.1023/A:1026238916441
pitkow, j.: In search of reliable usage data on the www. In: Computer
Networks and isdn Systems, 29(8-13), 1997, pp. 1343-1355. doi: 10.1016/
S0169-7552(97)00021-4
pitkow, j.: Summary of www characterizations. In: World Wide Web,
2(1-2), 1999, pp. 3-13. doi: 10.1023/A:1019284202914
prior, m.: The Immensely Inflated News Audience: Assessing Bias in
Self-Reported News Exposure. In: Public Opinion Quarterly, 73(1), 2009,
pp. 130-143. doi: 10.1093/poq/nfp002
349
Log File Analysis as a Method for Automated Measurement of Internet Usage
raento, m.; oulasvirta, a.; eagle, n.: Smartphones: An Emerging
Tool for Social Scientists. In: Sociological Methods & Research, 37(3), 2009,
pp. 426-454. doi: 10.1177/0049124108330005
ren, y.; tomko, m.; ong, k.; sanderson, m.: How People Use the Web
in Large Indoor Spaces. In: Proceedings of the 23rd acm International
Conference on Information and Knowledge Management, 2014, pp. 1879-
1882. doi: 10.1145/2661829.2661929
rice, r. e.; borgman, c. l.: The use of computer-monitored data in
information science and communication research. In: Journal of
the American Society for Information Science and Technology, 44(1), 1983,
pp. 247-256. doi: 10.1002/asi.4630340404
rogers, r.: Digital Methods. Cambridge [mit Press] 2013
savage, m.; burrows, r.: The coming crisis of empirical sociology. In:
Sociology, 41(5), 2007, pp. 885-899. doi: 10.1177/0038038507080443
scharkow, m.: The Accuracy of Self-Reported Internet Use – A
Validation Study Using Client Log Data. In: Communication Methods
and Measures, 10(1), 2016, pp. 13-27. doi: 10.1080/19312458.2015.1118446
scharkow, m.: The reliability and temporal stability of self-reported media
exposure a meta-analysis ( preprint). 2018. https://osf.io/96mn2/
scharkow, m.; vogelgesang, j.: Measuring the Public Agenda Using
Search Engine Queries. In: International Journal of Public Opinion
Research, 23(1), 2011, pp. 104-113. doi: 10.1093/ijpor/edq048
silverstein, c.; marais, h.; henzinger, m.; moricz, m.: Analysis of
a very large web search engine query log. In: acm sigir Forum, 33(1),
1999, pp. 6-12. doi: 10.1145/331403.331405
sismeiro, c.; bucklin, r. e.: Modeling Purchase Behavior at an
E-Commerce Web Site: A Task-Completion Approach. In: Journal
of Marketing Research, 41(3), 2004, pp. 306-323. doi: 10.1509/
jmkr.41.3.306.35985
spink, a.; wolfram, d.; jansen, b. j.; saracevic, t.: Searching the
web: The public and their queries. In: Journal of the American Society for
Information Science and Technology, 52(3), 2001, pp. 226-234
srivastava, j.; cooley, r.; deshpande, m.; tan, p.-n.: Web usage
mining: discovery and applications of usage patterns from Web
data. In: acm sigkdd Explorations Newsletter, 1(2), 2000, pp. 12-23.
doi: 10.1145/846183.846188
st. johns, m.: rfc 913 –Authentication Server. Network Working Group.
1985. https://tools.ietf.org/html/rfc931
350
christian strippel
st. johns, m.: rfc 1413 –Identification Protocol. Network Working Group.
1993. https://tools.ietf.org/html/rfc1413
stout, r.: Web Site Stats: Tracking Hits and Analyzing Web Traffic. Berkeley,
ca [Osborne McGraw-Hill] 1997
suneetha, k. r.; krishnamoorthi, r.: Identifying User Behavior
by Analyzing Web Server Access Log File. In: International Journal of
Computer Science and Network Security, 9(4), 2009, pp. 327-332
taghavi, m.; patel, a.; schmidt, n.; wills, c.; tew, y.: An analysis of
web proxy logs with query distribution pattern approach for search
engines. In: Computer Standards & Interfaces, 34(1), 2012, pp. 162-170.
doi: 10.1016/j.csi.2011.07.001
taneja, h.: Using Commercial Audience Measurement Data in
Academic Research. In: Communication Methods and Measures, 10(2-3),
2016, pp. 176-178. doi: 10.1080/19312458.2016.1150971
tauscher, l.; greenberg, s.: How people revisit web pages: empirical
findings and implications for the design of history systems. In:
International Journal of Human-Computer Studies, 47(1), 1997, pp. 97-137.
doi: 10.1006/ijhc.1997.0125
tewksbury, d.: What Do Americans Really Want to Know? Tracking
the Behavior of News Readers on the Internet. In: Journal of
Communication, 53(4), 2003, pp. 694-710. doi: 10.1111/j.1460-2466.2003.
tb02918.x
thatcher, j.: Living on fumes: Digital footprints, data fumes, and
the limitations of spatial big data. In: International Journal of
Communication, 8, 2014, pp. 1765-1783
webster, j. g.; ksiazek, t. b.: The Dynamics of Audience
Fragmentation: Public Attention in an Age of Digital
Media. In: Journal of Communication, 62(1), 2012, pp. 39-56.
doi: 10.1111/j.1460-2466.2011.01616.x
webster, j. g.; phalen, p. f.; lichty, l.: Ratings analysis: Audience
measurement and analytics. New York, ny [Routledge] 2014
weeks, b.; southwell, b.: The Symbiosis of News Coverage and
Aggregate Online Search Behavior: Obama, Rumors, and Presidential
Politics. In: Mass Communication and Society, 13(4), 2010, pp. 341-360.
doi: 10.1080/15205430903470532
weinreich, h.; obendorf, h.; herder, e.; mayer, m.: O the beaten
tracks: Exploring three aspects of web navigation. In: Proceedings of
351
Log File Analysis as a Method for Automated Measurement of Internet Usage
the 15th International Conference on World Wide Web, 2006, pp. 133-142.
doi: 10.1145/1135777.1135802
weinreich, h.; obendorf, h.; herder, e.; mayer, m.: Not Quite the
Average: An Empirical Study of Web Use. In: acm Transactions on the
Web, 2(1), 2008, Article 5. doi: 10.1145/1326561.1326566
wieland, m.; in der au, a.-m.; keller, c.; brunk, s.; bettermann,
t.; hagen, l.; schlegel, t.: Online Behavior Tracking in Social
Sciences: Quality Criteria and Technical Implementation. In:
stützer, c. m.; welker, m. (Eds.): Computational Social Science in the
Age of Big Data. Concepts, Methodologies, Tools, and Applications. Cologne
[Herbert von Halem] 2018, pp. 131-160
yu, h.; zheng, d.; zhao, b. y.; zheng, w.: Understanding User Behavior
in Large-Scale Video-on-Demand Systems. In: acm sigops Operating
Systems Review, 40(4), 2006, pp. 333-344. doi: 10.1145/1218063.1217968
yun, g. w.; ford, j.; hawkins, r. p.; pingree, s.; mctavish, f.;
gustafson, d.; berhe, h.: On the validity of client-side vs server-
side web log data analysis. In: Internet Research, 16(5), 2006, pp. 537-
552. doi: 10.1108/10662240610711003
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Measuring audiences’ selective exposure to media content in naturalistic settings constitutes a methodological challenge that has only partly been resolved. We present a new methodological approach that is based on the open-source web analytics software Piwik. This method allows for the tracking of selective exposure and facilitates the integration of selective exposure data with online survey data. To ease data handling, we created a plug-in turning Piwik into a scientific research tool. After discussing the theoretical and methodological background of collecting data on user selections, we provide step-by-step instructions on the integration of Piwik with online content, survey software, and the merging of tracking and survey data. Finally, we discuss research applications, advantages, and limitations of the new research tool.
Article
Full-text available
The measurement of how people are “exposed” to media content, which is crucial for the understanding of media use and effects, has been a challenge for a long time. Today’s media landscape, in which individuals are exposed to a diversity of messages anytime, anywhere, and from a great variety of sources on an increasing number of different media platforms, has complicated the measurement of media exposure even more. However, today’s digital media landscape also offers new possibilities to map media exposure by means of passive measurement. In this Introduction article to the special issue, we give an overview of the different ways in which media exposure is measured and the various issues associated with their applications. We conclude with a research agenda for issues that need to be tackled in future research and also introduce a research tool for media exposure measurement.
Book
Using both a theoretical argumentation and an empirical investigation, this book rationalizes the view that in order to understand people’s privacy perceptions and behaviors, a situational perspective needs to be adopted. To this end, the book is divided into three parts. Part I advances the theory of situational privacy and self-disclosure by discussing impacts of new information and communication technologies on traditional theories of privacy and self-disclosure. Based on five basic suppositions, it describes three major processes of the theory: pre-situational privacy regulations processes, situational privacy perception and self-disclosure processes, and post-situational evaluation processes. Next, Part II presents the application of the theory to smartphone-based communication. It first analyses how people choose certain communication environments on their smartphones, how they manipulate them, and how these external factors affect self-disclosure within these environments. It then details a multi-method study conducted to test the derived assumptions and discusses the obtained results. Part III reflects on the overall implications of the theory, summarizes the major findings and lastly considers possible extensions and perspectives on future research. Intended mainly for researchers in privacy and communication studies, the book offers privacy scholars a systematic review of both classic and contemporary theories of privacy and self-disclosure. At the same time, communication scholars benefit from an additional methodological discussion of the mobile experience sampling method, which provides an invaluable approach to measuring situational communication processes.
Book
The rapid growth of the Web in the last decade makes it the largest p- licly accessible data source in the world. Web mining aims to discover u- ful information or knowledge from Web hyperlinks, page contents, and - age logs. Based on the primary kinds of data used in the mining process, Web mining tasks can be categorized into three main types: Web structure mining, Web content mining and Web usage mining. Web structure m- ing discovers knowledge from hyperlinks, which represent the structure of the Web. Web content mining extracts useful information/knowledge from Web page contents. Web usage mining mines user access patterns from usage logs, which record clicks made by every user. The goal of this book is to present these tasks, and their core mining - gorithms. The book is intended to be a text with a comprehensive cov- age, and yet, for each topic, sufficient details are given so that readers can gain a reasonably complete knowledge of its algorithms or techniques without referring to any external materials. Four of the chapters, structured data extraction, information integration, opinion mining, and Web usage mining, make this book unique. These topics are not covered by existing books, but yet they are essential to Web data mining. Traditional Web mining topics such as search, crawling and resource discovery, and link analysis are also covered in detail in this book.
Article
Surveys, although widely used to measure media exposure, are blunt instruments. In today’s complex media environments it is difficult to accurately recall usage. Audience measurement data, often collected by the media industries for commercial purposes, offer an alternative. However, unlike surveys, where individuals are the unit of analysis, audience measurement companies report data aggregated at the level of media outlets. Despite this limitation, audience measurement data may be analyzed for deciphering media use patterns in theoretically productive ways.
Article
The vast majority of empirical research on online communication, or media use in general, relies on self-report measures instead of behavioral data. Previous research has shown that the accuracy of these self-report measures can be quite low, and both over- and underreporting of media use are commonplace. This study compares self-reports of Internet use with client log files from a large household sample. Results show that the accuracy of self-reported frequency and duration of Internet use is quite low, and that survey data are only moderately correlated with log file data. Moreover, there are systematic patterns of misreporting, especially overreporting, rather than random deviations from the log files. Self-reports for specific content such as social network sites or video platforms seem to be more accurate and less consistently biased than self-reports of generic frequency or duration of Internet use. The article closes by demonstrating the consequences of biased self-reports and discussing possible solutions to the problem.
Article
Online publishing, social networks, and web search have dramatically lowered the costs of producing, distributing, and discovering news articles. Some scholars argue that such technological changes increase exposure to diverse perspectives, while others worry that they increase ideological segregation. We address the issue by examining web-browsing histories for 50,000 US-located users who regularly read online news. We find that social networks and search engines are associated with an increase in the mean ideological distance between individuals. However, somewhat counterintuitively, these same channels also are associated with an increase in an individual’s exposure to material from his or her less preferred side of the political spectrum. Finally, the vast majority of online news consumption is accounted for by individuals simply visiting the home pages of their favorite, typically mainstream, news outlets, tempering the consequences—both positive and negative—of recent technological changes. We thus uncover evidence for both sides of the debate, while also finding that the magnitude of the effects is relatively modest.
Article
Recently, several studies suggested that the amount of online search queries can be used as an indicator of the public agenda. Based on former research by the authors, this article discusses the role of public uncertainty as another factor influencing search queries. Therefore, the influence of media coverage on Wikipedia searches concerning two issues is compared: one issue with uncertainty (the EHEC epidemic) and one without uncertainty (unemployment). Analyses show much stronger correlations in the case of EHEC, which suggests that online search behavior may especially be used as an indicator of the public agenda in case public uncertainty exists.