Chapter

Big Data Approaches to the Study of Digital Media

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Already, journalists are equating tweets with public opinion when reporting the news (Anstead and O'Loughlin 2015;Beckers and Harder 2016;Dubois, Gruzd, and Jacobson 2018;McGregor 2019), and data scientists have started to do the same by using the phrase interchangeably with opinion mining and sentiment analysis, often with little thought given to what those indicators actually measure, or how results might align with existing theories (Bail 2014;Poorthuis and Zook 2015;Schroeder and Cowls 2018;Ledford 2020). Given the intoxicating pull of technology, these trends are likely to continue-and even accelerate as we move further into the digital age-so the time is ripe to reconsider an obvious, but easily overlooked question: What constitutes public opinion? ...
... Revisiting the classical tradition offers two pathways forward. First, for those who believe that scholarship based on digital trace data has been poorly theorized, the classical tradition provides a strong justification for the work that data scientists do in text mining and sentiment analysis (Bail 2014;Schroeder and Cowls 2018;Ledford 2020). And second, it offers clues to how emerging technologies might be used most effectively in the future. ...
Article
Digital trace data have the potential to offer rich insight into complex behaviors that were once out of reach, but their use has raised vital and unresolved questions about what is—or is not—public opinion. Building on the work of James Bryce, Lindsay Rogers, Herbert Blumer, Paul Lazarsfeld, and more, this essay revisits the discipline’s historical roots and draws parallels between past theory and present practice. Today, scholars treat public opinion as the summation of individual attitudes, weighted equally and expressed anonymously at static points in time through polls, yet prior to the advent of survey research, it was conceived as something intrinsically social and dynamic. In an era dominated by online discussion boards and social media platforms, the insights of this earlier “classical tradition” offer two pathways forward. First, for those who criticize computational social science as poorly theorized, it provides a strong justification for the work that data scientists do in text mining and sentiment analysis. And second, it offers clues for how emerging technologies might be leveraged effectively for the study of public opinion in the future.
Article
In the article, we argue that the advent of data mining techniques and big data in media and communication studies present problems that involve fundamental methodological questions, requiring us to revisit existing ways in which the link between theory, operationalization and data are explained and justified. We note that the discourse of instrumental optimization that surrounds big data clouds epistemic debates about their appropriate integration in scholarly explanations, and argue that a discussion of these problems can usefully depart from a distinction between the two main types of data mining models (supervised and unsupervised). We argue that both types pose specific challenges and give examples of ways they have been productively overcome. In particular, we argue that while big data approaches have introduced novel opportunities for research, they have fundamentally been incorporated into media and communication studies in ways that comply with existing, prototypical explanatory schemes. Our examples link specific empirical studies to general strategies of scientific explanation, focusing on neo-positivist, critical realist and interpretivist explanations.
Article
Full-text available
Feminist STS has long established that science’s provenance as a male domain continues to define what counts as knowledge and expertise. Wikipedia, arguably one of the most powerful sources of information today, was initially lauded as providing the opportunity to rebuild knowledge institutions by providing greater representation of multiple groups. However, less than ten percent of Wikipedia editors are women. At one level, this imbalance in contributions and therefore content is yet another case of the masculine culture of technoscience. This is an important argument and, in this article, we examine the empirical research that highlights these issues. Our main objective, however, is to extend current accounts by demonstrating that Wikipedia’s infrastructure introduces new and less visible sources of gender disparity. In sum, our aim here is to present a consolidated analysis of the gendering of Wikipedia.
Article
Full-text available
The recent Facebook study about emotional contagion has generated a high-profile debate about the ethical and social issues in Big Data research. These issues are not unprecedented, but the debate highlighted that, in focusing on research ethics and the legal issues about this type of research, an important larger picture is overlooked about the extent to which free will is compatible with the growth of deterministic scientific knowledge, and how Big Data research has become central to this growth of knowledge. After discussing the ‘emotional contagion study’ as an illustration, these larger issues about Big Data and scientific knowledge are addressed by providing definitions of data, Big Data and of how scientific knowledge changes the human-made environment. Against this background, it will be possible to examine why the uses of data-driven analyses of human behaviour in particular have recently experienced rapid growth. The essay then goes on to discuss the distinction between basic scientific research as against applied research, a distinction which, it is argued, is necessary to understand the quite different implications in the context of scientific as opposed to applied research. Further, it is important to recognize that Big Data analyses are both enabled and constrained by the nature of data sources available. Big Data research is bound to become more widespread, and this will require more awareness on the part of data scientists, policymakers and a wider public about its contexts and often unintended consequences.
Article
Full-text available
The recent interest in Big Data has generated a broad range of new academic, corporate, and policy practices along with an evolving debate among its proponents, detractors, and skeptics. While the practices draw on a common set of tools, techniques, and technologies, most contributions to the debate come either from a particular disciplinary perspective or with a focus on a domain-specific issue. A close examination of these contributions reveals a set of common problematics that arise in various guises and in different places. It also demonstrates the need for a critical synthesis of the conceptual and practical dilemmas surrounding Big Data. The purpose of this article is to provide such a synthesis by drawing on relevant writings in the sciences, humanities, policy, and trade literature. In bringing these diverse literatures together, we aim to shed light on the common underlying issues that concern and affect all of these areas. By contextualizing the phenomenon of Big Data within larger socioeconomic developments, we also seek to provide a broader understanding of its drivers, barriers, and challenges. This approach allows us to identify attributes of Big Data that require more attention—autonomy, opacity, generativity, disparity, and futurity—leading to questions and ideas for moving beyond dilemmas.
Article
Full-text available
Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with [Formula: see text] up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.
Article
Full-text available
Significance We show, via a massive ( N = 689,003) experiment on Facebook, that emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. We provide experimental evidence that emotional contagion occurs without direct interaction between people (exposure to a friend expressing an emotion is sufficient), and in the complete absence of nonverbal cues.
Conference Paper
Full-text available
While there has been a substantial amount of research into the editorial and organizational processes within Wikipedia, little is known about how Wikipedia editors (Wikipedians) relate to the online world in general. We attempt to shed light on this issue by using aggregated log data from Yahoo!'s browser toolbar in order to analyze Wikipedians' editing behavior in the context of their online lives beyond Wikipedia. We broadly characterize editors by investigating how their online behavior differs from that of other users; e.g., we find that Wikipedia editors search more, read more news, play more games, and, perhaps surprisingly, are more immersed in popular culture. Then we inspect how editors' general interests relate to the articles to which they contribute; e.g., we confirm the intuition that editors are more familiar with their active domains than average users. Finally, we analyze the data from a temporal perspective; e.g., we demonstrate that a user's interest in the edited topic peaks immediately before the edit. Our results are relevant as they illuminate novel aspects of what has become many Web users' prevalent source of information.
Article
Full-text available
We consider the sampling bias introduced in the study of online networks when collecting data through publicly available APIs (application programming interfaces). We assess differences between three samples of Twitter activity; the empirical context is given by political protests taking place in May 2012. We track online communication around these protests for the period of one month, and reconstruct the network of mentions and re-tweets according to the search and the streaming APIs, and to different filtering parameters. We find that smaller samples do not offer an accurate picture of peripheral activity; we also find that the bias is greater for the network of mentions, partly because of the higher influence of snowballing in identifying relevant nodes. We discuss the implications of this bias for the study of diffusion dynamics and political communication through social media, and advocate the need for more uniform sampling procedures to study online communication.
Article
Full-text available
Use of socially generated "big data" to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science. A natural application of this would be the prediction of the society's reaction to a new product in the sense of popularity and adoption rate. However, bridging the gap between "real time monitoring" and "early predicting" remains a big challenge. Here we report on an endeavor to build a minimalistic predictive model for the financial success of movies based on collective activity data of online users. We show that the popularity of a movie can be predicted much before its release by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia.
Article
Full-text available
The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and other scholars are clamoring for access to the massive quantities of information produced by and about people, things, and their interactions. Diverse groups argue about the potential benefits and costs of analyzing genetic sequences, social media interactions, health records, phone logs, government records, and other digital traces left by people. Significant questions emerge. Will large-scale search data help us create better tools, services, and public goods? Or will it usher in a new wave of privacy incursions and invasive marketing? Will data analytics help us understand online communities and political movements? Or will it be used to track protesters and suppress speech? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what ‘research’ means? Given the rise of Big Data as a socio-technical phenomenon, we argue that it is necessary to critically interrogate its assumptions and biases. In this article, we offer six provocations to spark conversations about the issues of Big Data: a cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology that provokes extensive utopian and dystopian rhetoric.
Article
Full-text available
We respond to the two comments on our article `The Coming Crisis of Empirical Sociology' from Rosemary Crompton (2008) and Richard Webber (2009) which have been published in Sociology , as well as issues arising from the wider debate generated by our article. We urge sociologists to recognize the gravity of the challenges posed by the proliferation of social data and to become more vociferous in contributing to political debates over method and data.
Article
Full-text available
Opt-in surveys are the most widespread method used to study participation in online communities, but produce biased results in the absence of adjustments for non-response. A 2008 survey conducted by the Wikimedia Foundation and United Nations University at Maastricht is the source of a frequently cited statistic that less than 13% of Wikipedia contributors are female. However, the same study suggested that only 39.9% of Wikipedia readers in the US were female - a finding contradicted by a representative survey of American adults by the Pew Research Center conducted less than two months later. Combining these two datasets through an application and extension of a propensity score estimation technique used to model survey non-response bias, we construct revised estimates, contingent on explicit assumptions, for several of the Wikimedia Foundation and United Nations University at Maastricht claims about Wikipedia editors. We estimate that the proportion of female US adult editors was 27.5% higher than the original study reported (22.7%, versus 17.8%), and that the total proportion of female editors was 26.8% higher (16.1%, versus 12.7%).
Article
Full-text available
With over 800 million active users, Facebook is changing the way hundreds of millions of people relate to one another and share information. A rapidly growing body of research has accompanied the meteoric rise of Facebook as social scientists assess the impact of Facebook on social life. In addition, researchers have recognized the utility of Facebook as a novel tool to observe behavior in a naturalistic setting, test hypotheses, and recruit participants. However, research on Facebook emanates from a wide variety of disciplines, with results being published in a broad range of journals and conference proceedings, making it difficult to keep track of various findings. And because Facebook is a relatively recent phenomenon, uncertainty still exists about the most effective ways to do Facebook research. To address these issues, the authors conducted a comprehensive literature search, identifying 412 relevant articles, which were sorted into 5 categories: descriptive analysis of users, motivations for using Facebook, identity presentation, the role of Facebook in social interactions, and privacy and information disclosure. The literature review serves as the foundation from which to assess current findings and offer recommendations to the field for future research on Facebook and online social networks more broadly. © The Author(s) 2012.
Article
Full-text available
Human behaviour is thought to spread through face-to-face social networks, but it is difficult to identify social influence effects in observational studies, and it is unknown whether online social networks operate in the same way. Here we report results from a randomized controlled trial of political mobilization messages delivered to 61 million Facebook users during the 2010 US congressional elections. The results show that the messages directly influenced political self-expression, information seeking and real-world voting behaviour of millions of people. Furthermore, the messages not only influenced the users who received them but also the users' friends, and friends of friends. The effect of social transmission on real-world voting was greater than the direct effect of the messages themselves, and nearly all the transmission occurred between 'close friends' who were more likely to have a face-to-face relationship. These results suggest that strong ties are instrumental for spreading both online and real-world behaviour in human social networks.
Article
Full-text available
In this work we study the dynamical features of editorial wars in Wikipedia (WP). Based on our previously established algorithm, we build up samples of controversial and peaceful articles and analyze the temporal characteristics of the activity in these samples. On short time scales, we show that there is a clear correspondence between conflict and burstiness of activity patterns, and that memory effects play an important role in controversies. On long time scales, we identify three distinct developmental patterns for the overall behavior of the articles. We are able to distinguish cases eventually leading to consensus from those cases where a compromise is far from achievable. Finally, we analyze discussion networks and conclude that edit wars are mainly fought by few editors only.
Article
Full-text available
In 2008, a group of researchers publicly released profile data collected from the Facebook accounts of an entire cohort of college students from a US university. While good-faith attempts were made to hide the identity of the institution and protect the privacy of the data subjects, the source of the data was quickly identified, placing the privacy of the students at risk. Using this incident as a case study, this paper articulates a set of ethical concerns that must be addressed before embarking on future research in social networking sites, including the nature of consent, properly identifying and respecting expectations of privacy on social network sites, strategies for data anonymization prior to public release, and the relative expertise of institutional review boards when confronted with research projects based on data gleaned from social media. KeywordsResearch ethics-Social networks-Facebook-Privacy-Anonymity
Article
Full-text available
It is not easy to initiate a new language version of Wikipedia. Although anyone can propose a new language version without financial cost, certain Wikipedia policies for establishing a new language version must be followed. Once approved and created, the new language version needs tools to facilitate writing and reading in the new language. Even if a team tackles these technical and linguistic issues, a nascent community has to then develop its own editorial and administrative policies and guidelines, sometimes by translating and ratifying the policies in another language version (usually English). Given that Wikipedia does not impose an universal set of editorial and administrative policies and guidelines, the cultural and political nature of such communities remains open-ended.
Conference Paper
Full-text available
We present a study of anonymized data capturing a month of high-level communication activities within the whole of the Microsoft Messenger instant-messaging system. We examine characteristics and patterns that emerge from the collective dynamics of large numbers of people, rather than the actions and characteristics of individuals. The dataset contains summary properties of 30 billion conversations among 240 million people. From the data, we construct a communication graph with 180 million nodes and 1.3 billion undirected edges, creating the largest social network constructed and analyzed to date. We report on multiple aspects of the dataset and synthesized graph. We find that the graph is well-connected and robust to node removal. We investigate on a planetary-scale the oft-cited report that people are separated by "six degrees of separation" and find that the average path length among Messenger users is 6.6. We find that people tend to communicate more with each other when they have similar age, language, and location, and that cross-gender conversations are both more frequent and of longer duration than conversations with the same gender.
Article
Full-text available
In this article, the authors analyze the popular search queries used in Google and Yahoo! over a 24-month period, January 2004-December 2005. They develop and employ a new methodology and metrics to examine and assess the digital divide in information uses, looking at the extent of political searches and their accuracy and variety. The findings indicate that some countries, particularly Germany, Russia, and Ireland, display greater accuracy of search terms, diversity of information uses, and sociopolitical concern. Also, in many English-speaking and Western countries most popular searches were about entertainment, implying a certain gap within these countries between the few who search for economic and political information and the many who do not.
Article
Directed links in social media could represent anything from intimate friendships to common interests, or even a passion for breaking news or celebrity gossip. Such directed links determine the flow of information and hence indicate a user's influence on others — a concept that is crucial in sociology and viral marketing. In this paper, using a large amount of data collected from Twitter, we present an in-depth comparison of three measures of influence: indegree, retweets, and mentions. Based on these measures, we investigate the dynamics of user influence across topics and time. We make several interesting observations. First, popular users who have high indegree are not necessarily influential in terms of spawning retweets or mentions. Second, most influential users can hold significant influence over a variety of topics. Third, influence is not gained spontaneously or accidentally, but through concerted effort such as limiting tweets to a single topic. We believe that these findings provide new insights for viral marketing and suggest that topological measures such as indegree alone reveals very little about the influence of a user.
Book
How Wikipedia collaboration addresses the challenges of openness, consensus, and leadership in a historical pursuit for a universal encyclopedia. Wikipedia, the online encyclopedia, is built by a community—a community of Wikipedians who are expected to “assume good faith” when interacting with one another. In Good Faith Collaboration, Joseph Reagle examines this unique collaborative culture. Wikipedia, says Reagle, is not the first effort to create a freely shared, universal encyclopedia; its early twentieth-century ancestors include Paul Otlet's Universal Repository and H. G. Wells's proposal for a World Brain. Both these projects, like Wikipedia, were fuelled by new technology—which at the time included index cards and microfilm. What distinguishes Wikipedia from these and other more recent ventures is Wikipedia's good-faith collaborative culture, as seen not only in the writing and editing of articles but also in their discussion pages and edit histories. Keeping an open perspective on both knowledge claims and other contributors, Reagle argues, creates an extraordinary collaborative potential. Wikipedia's style of collaborative production has been imitated, analyzed, and satirized. Despite the social unease over its implications for individual autonomy, institutional authority, and the character (and quality) of cultural products, Wikipedia's good-faith collaborative culture has brought us closer than ever to a realization of the century-old pursuit of a universal encyclopedia.
Book
This 2003 book assesses the consequences of new information technologies for American democracy in a way that is theoretical and also historically grounded. The author argues that new technologies have produced the fourth in a series of 'information revolutions' in the US, stretching back to the founding. Each of these, he argues, led to important structural changes in politics. After re-interpreting historical American political development from the perspective of evolving characteristics of information and political communications, the author evaluates effects of the Internet and related new media. The analysis shows that the use of new technologies is contributing to 'post-bureaucratic' political organization and fundamental changes in the structure of political interests. The author's conclusions tie together scholarship on parties, interest groups, bureaucracy, collective action, and political behavior with new theory and evidence about politics in the information age.
Book
An examination of the ways that digital and networked technologies have fundamentally changed research practices in disciplines from astronomy to literary analysis. In Knowledge Machines, Eric Meyer and Ralph Schroeder argue that digital technologies have fundamentally changed research practices in the sciences, social sciences, and humanities. Meyer and Schroeder show that digital tools and data, used collectively and in distributed mode—which they term e-research—have transformed not just the consumption of knowledge but also the production of knowledge. Digital technologies for research are reshaping how knowledge advances in disciplines that range from physics to literary analysis. Meyer and Schroeder map the rise of digital research and offer case studies from many fields, including biomedicine, social science uses of the Web, astronomy, and large-scale textual analysis in the humanities. They consider such topics as the challenges of sharing research data and of big data approaches, disciplinary differences and new forms of interdisciplinary collaboration, the shifting boundaries between researchers and their publics, and the ways that digital tools promote openness in science. This book considers the transformations of research from a number of perspectives, drawing especially on the sociology of science and technology and social informatics. It shows that the use of digital tools and data is not just a technical issue; it affects research practices, collaboration models, publishing choices, and even the kinds of research and research questions scholars choose to pursue. Knowledge Machines examines the nature and implications of these transformations for scholarly research.
Article
In this article, we examine the growth of the Internet as a research topic across the disciplines and the embedding of the Internet into the very fabric of research. While this is a trend that ‘everyone knows’, prior to this study, no work had quantified the extent to which this common sense knowledge was true or how the embedding actually took place. Using scientometric data extracted from Scopus, we explore how the Internet has become a powerful knowledge machine which forms part of the scientific infrastructure across not just technology fields, but also right across the social sciences, sciences and humanities.
Chapter
Statistics assumed its recognizably modern disciplinary form during the period from about 1890 to 1930. These dates are comparable to those for the formation of disciplines in the leading fields of social science. Statistics, however, changed during this period from an empirical science of society, as it had been during the nineteenth century, into a mathematical and methodological field. Although it disappeared as a social science per se, as an area of applied mathematics it became an important source of tools, concepts, and research strategies throughout the social sciences. It also provided legitimacy for, and contributed to a redefinition of what would count as, social knowledge. In its nineteenth-century incarnation, as itself a social science, statistics was guided by a different set of ideals - not academic detachment, but active involvement in administration and social reform. The social science of statistics was practically indistinguishable from government collection of numbers about population, health, crime, commerce, poverty, and labor. Even its most self-consciously scientific advocates, such as the prominent Belgian astronomer and statistician Adolphe Quetelet (1796-1874), often had administrative responsibility for the organization of official statistics. This alliance of scientific and bureaucratic statistics did not disappear abruptly. But it was gradually subordinated to a new order in which statisticians assumed consulting roles, offering their expertise to statistical agencies but also to many others. At the end of the nineteenth century, it still appeared possible that statistics might succeed in the universities as a quantitative social science. Instead, it was recreated as a mathematical field.
Article
The media environment is changing. Today in the United States, the average viewer can choose from hundreds of channels, including several twenty-four hour news channels. News is on cell phones, on iPods, and online; it has become a ubiquitous and unavoidable reality in modern society. The purpose of this book is to examine systematically, how these differences in access and form of media affect political behaviour. Using experiments and new survey data, it shows how changes in the media environment reverberate through the political system, affecting news exposure, political learning, turnout, and voting behavior.
Article
This book offers a probing account of the erosion of privacy in American society, which shows that we are often unwitting, if willing, accomplices, providing personal data in exchange for security or convenience. The book reveals that in today's "information society," the personal data that we make available to virtually any organization for virtually any purpose is apt to surface elsewhere, applied to utterly different purposes. The mass collection and processing of personal information produces such tremendous efficiencies that both the public and private sector feel justified in pushing as far as they can into our private lives. There is no easy cure. Indeed, there are many cases where privacy invasion is both hurtful to the individual and indispensable to an organization's quest for efficiency. As long as we willingly accept the pursuit of profit, or the reduction of crime, or cutting government costs as sufficient reason for intensified scrutiny over our lives, then privacy will remain endangered.
Book
This book studies the rise of social media in the first decade of the twenty-first century, up until 2012. It provides both a historical and a critical analysis of the emergence of networking services in the context of a changing ecosystem of connective media. Such history is needed to understand how the intricate constellation of platforms profoundly affects our experience of online sociality. In a short period of time, services like Facebook, YouTube and many others have come to deeply penetrate our daily habits of communication and creative production. While most sites started out as amateur-driven community platforms, half a decade later they have turned into large corporations that do not just facilitate user connectedness, but have become global information and data mining companies extracting and exploiting user connectivity. Offering a dual analytical prism to examine techno-cultural as well as socio-economic aspects of social media, the author dissects five major platforms: Facebook, Twitter, Flickr, YouTube, and Wikipedia. Each of these microsystems occupies a distinct position in the larger ecosystem of connective media, and yet, their underlying mechanisms for coding interfaces, steering users, filtering content, governance and business models rely on shared ideological principles. Reconstructing the premises on which these platforms are built, this study highlights how norms for online interaction and communication gradually changed. "Sharing," "friending," "liking," "following," "trending," and "favoriting" have come to denote online practices imbued with specific technological and economic meanings. This process of normalization is part of a larger political and ideological battle over information control in an online world where everything is bound to become "social."
Article
During the course of several natural disasters in recent years, Twitter has been found to play an important role as an additional medium for many-to-many crisis communication. Emergency services are successfully using Twitter to inform the public about current developments, and are increasingly also attempting to source first-hand situational information from Twitter feeds (such as relevant hashtags). The further study of the uses of Twitter during natural disasters relies on the development of flexible and reliable research infrastructure for tracking and analysing Twitter feeds at scale and in close to real time, however. This article outlines two approaches to the development of such infrastructure: one which builds on the readily available open source platform yourTwapperkeeper to provide a low-cost, simple, and basic solution; and, one which establishes a more powerful and flexible framework by drawing on highly scaleable, state-of-the-art technology.
Article
This paper examines research about Wikipedia that has been undertaken using big data approaches. The aim is to gauge the coherence as against the disparateness of studies from different disciplines, how these studies relate to each other, and to research about Wikipedia and new social media in general. The paper is partly based on interviews with big data researchers, and discusses a number of themes and implications of Wikipedia research, including about the workings of online collaboration, the way that contributions mirror (or not) aspects of real-world geographies, and how contributions can be used to predict offline social and economic trends. Among the findings is that in some areas of research, studies build on and extend each other's results. However, most of the studies stay within disciplinary silos and could be better integrated with other research on Wikipedia and with research about new media. Wikipedia is among few sources in big data research where the data are openly available, unlike many studies where data are proprietary. Thus, it has lent itself to a burgeoning and promising body of research. The paper concludes that in order to fulfil this promise, this research must pay more attention to theories and research from other disciplines, and also go beyond questions based narrowly on the availability of data and towards a more powerful analytical grasp of the phenomenon being investigated.
Article
Online interaction is now a regular part of daily life for a demographically diverse population of hundreds of millions of people worldwide. These interactions generate fine-grained time-stamped records of human behavior and social interaction at the level of individual events, yet are global in scale, allowing researchers to address fundamental questions about social identity, status, conflict, cooperation, collective action, and diffusion, both by using observational data and by conducting in vivo field experiments. This unprecedented opportunity comes with a number of methodological challenges, including generalizing observations to the offline world, protecting individual privacy, and solving the logistical challenges posed by “big data” and web-based experiments. We review current advances in online social research and critically assess the theoretical and methodological opportunities and limitations. [J]ust as the invention of the telescope revolutionized the study of the heavens, so too by rendering the unmeasurable measurable, the technological revolution in mobile, Web, and Internet communications has the potential to revolutionize our understanding of ourselves and how we interact…. [T]hree hundred years after Alexander Pope argued that the proper study of mankind should lie not in the heavens but in ourselves, we have finally found our telescope. Let the revolution begin. —Duncan Watts (2011 , p. 266)
Article
Large errors in flu prediction were largely avoidable, which offers lessons for the use of big data.
Article
Wikipedia has become one of the ten most visited sites on the Web, and the world’s leading source of Web reference information. Its rapid success has inspired hundreds of scholars from various disciplines to study its content, communication and community dynamics from various perspectives. This article presents a systematic review of scholarly research on Wikipedia. We describe our detailed, rigorous methodology for identifying over 450 scholarly studies of Wikipedia. We present the WikiLit website (http wikilit dot referata dot com), where most of the papers reviewed here are described in detail. In the major section of this article, we then categorize and summarize the studies. An appendix features an extensive list of resources useful for Wikipedia researchers.
Article
Analytical table of contents Preface Introduction: rationality Part I. Representing: 1. What is scientific realism? 2. Building and causing 3. Positivism 4. Pragmatism 5. Incommensurability 6. Reference 7. Internal realism 8. A surrogate for truth Part II. Intervening: 9. Experiment 10. Observation 11. Microscopes 12. Speculation, calculation, models, approximations 13. The creation of phenomena 14. Measurement 15. Baconian topics 16. Experimentation and scientific realism Further reading Index.
Article
From e-mails to social networks, the digital traces left by life in the modern world are transforming social science.
Article
The literature on the private provision of public goods suggests an inverse relationship between incentives to contribute and group size. We find, however, that after an exogenous reduction of group size at Chinese Wikipedia, the nonblocked contributors decrease their contributions by 42.8 percent on average. We attribute the cause to social effects: contributors receive social benefits that increase with both the amount of their contributions and group size, and the shrinking group size weakens these social benefits. Consistent with our explanation, we find that the more contributors value social benefits, the more they reduce their contributions after the block. (JEL H41, L17, L82)
Article
A research front of rapid discovery, leaving a trail of cognitive consensus behind it, is characteristic of natural sciences since about the 17th century in Europe. The basis of this high-consensus, rapid-discovery science is not empiricism, since empirical research existed in the natural sciences before the 17th century. The key is appropriation of genealogies of research technologies, which are pragmatically manipulated and modified to produce new phenomena; high consensus results because there is higher social prestige in moving ahead to new research discoveries than by continuing to dispute the interpretation of older discoveries. The social sciences have not acquired this pattern of rapid discovery with high consensus behind the research front. Their fundamental disability is not lack of empirical research, nor failure to adhere to a scientific epistemology, nor the greater ideological controversy that surrounds social topics. What is fundamentally lacking in the social sciences is a genealogy of research technology, whose manipulation reliably produces new phenomena and a rapidly moving research front. Unless the social sciences invent new research hardware, they will likely never acquire much consensus or rapid discovery. Possibilities may exist for such development stemming from research technologies in microsociology and in artificial intelligence.
Article
Scholars have long recognized the potential of Internet-based communication technologies for improving network research—potential that, to date, remains largely underexploited. In the first half of this paper, we introduce a new public dataset based on manipulations and embellishments of a popular social network site, Facebook.com. We emphasize five distinctive features of this dataset and highlight its advantages and limitations vis-à-vis other kinds of network data. In the second half of this paper, we present descriptive findings from our first wave of data. Subgroups defined by gender, race/ethnicity, and socioeconomic status are characterized by distinct network behaviors, and students sharing social relationships as well as demographic traits tend to share a significant number of cultural preferences. These findings exemplify the scientific and pedagogical potential of this new network resource and provide a starting point for future analyses.
Article
This paper reports on a transaction log analysis of the type and topic of search queries entered into the search engine Google (Australia). Two aspects, in particular, set this apart from previous studies: the sampling and analysis take account of the distribution of search queries, and lifestyle information of the searcher was matched with each search query. A surprising finding was that there was no observed statistically significant difference in search type or topics for different segments of the online population. It was found that queries about popular culture and Ecommerce accounted for almost half of all search engine queries and that half of the queries were entered with a particular Website in mind. The findings of this study also suggest that the Internet search engine is not only an interface to information or a shortcut to Websites, it is equally a site of leisure. This study has implications for the design and evaluation of search engines as well as our understanding of search engine use.