ArticlePDF Available

Abstract

The term big data encompasses a wide range of approaches of collecting and analyzing data in ways that were not possible before the era of modern personal computing. One approach to big data of great potential to psychologists is web scraping, which involves the automated collection of information from webpages. Although web scraping can create massive big datasets with tens of thousands of variables, it can also be used to create modestly sized, more manageable datasets with tens of variables but hundreds of thousands of cases, well within the skillset of most psychologists to analyze, in a matter of hours. In this article, we demystify web scraping methods as currently used to examine research questions of interest to psychologists. First, we introduce an approach called theory-driven web scraping in which the choice to use web-based big data must follow substantive theory. Second, we introduce data source theories, a term used to describe the assumptions a researcher must make about a prospective big data source in order to meaningfully scrape data from it. Critically, researchers must derive specific hypotheses to be tested based upon their data source theory, and if these hypotheses are not empirically supported, plans to use that data source should be changed or eliminated. Third, we provide a case study and sample code in Python demonstrating how web scraping can be conducted to collect big data along with links to a web tutorial designed for psychologists. Fourth, we describe a 4-step process to be followed in web scraping projects. Fifth and finally, we discuss legal, practical and ethical concerns faced when conducting web scraping projects.
A preview of the PDF is not available
... The rise of social media use in conjunction with advances in machine learning, text mining and modelling have created novel opportunities for the use of social media "big data" to address research questions in psychology (Conway & O'Connor, 2016;Landers, Brusso, Cavanaugh, & Collmus, 2016). One area that has attracted particular attention concerns the association between mental health problems and social media behaviour, e.g., with respect to the type of content posted on social media sites (Chancellor & De Choudhury, 2020;Merchant et al., 2019). ...
... This can yield important insights into the specific issues and themes that are discussed in a given community. However, due to technological advances, very large data sets can now be automatically retrieved (scraped) from the web (Landers et al., 2016). For such data, manual content analyses are clearly not feasible. ...
... Eysenbach and Till (2001) emphasise the importance of "perceived privacy" in an online community. If, for instance, posts can only be viewed following registration, this suggests that users might perceive their contributions as occurring in a private space (Eysenbach & Till, 2001;Landers et al., 2016). Likewise, if providers put measures in place to restrict data access, researchers should not circumvent these measures (Landers et al., 2016). ...
Preprint
Full-text available
The COVID-19 pandemic and the measures to prevent its spread have had a negative impact on substance use behaviour and posed a special threat for individuals at risk. Problem gambling is a major public health concern, and it is likely that the lockdown and social distancing measures have altered gambling behaviour, for instance shifting from land-based to online gambling. In this study, we used large-scale web scraping to analyse posting behaviour on a major German online gambling forum, gathering a database of more than 200k posts. We examined the relative usage of different subforums, i.e. terrestrial, online gambling and problem gambling sections, posting frequency, and changes in posting behaviour related to the casino closures that were part of the nationwide restrictions in Germany in 2020. There was a marked increase in the number of newly registered users during the first lockdown compared to the weeks prior to the lockdown, which may reflect a shift from terrestrial to online gambling. Further, there was an increase in the number of posts in the online gambling subforum with a concurrent decrease in the number of posts in the terrestrial gambling subforum. An analysis of user types revealed that a substantial number of users who posted in both the online and terrestrial forum contributed at least once to the problem gambling subforum. This subforum contained the longest posts, which were on average twice as long as the average post. Modelling the relationship between reply frequency and latency between initial posts and replies showed that the number of short-latency replies (i.e. replies posted within seven hours after the initial post) was substantially higher during the first lockdown compared to the preceeding weeks.The increase during the first lockdown may reflect the general marked increase in screen time and/or usage of online platforms and media after the onset of the global COVID-19 pandemic. The analyses may help to identify lockdown-related effects on gambling behaviour. These potentially detrimental effects on mental health, including addiction and problem gambling, may require monitoring and special public health measures.
... Some researchers began to make studies about psychology from the perspective of big data [22,23,24]. Table 1 displays the summarization on the related works of the advanced information technologies in psychology. ...
... Chen et al. [23] provided a practical guidance for psychology researchers to conduct psychology related research works from the perspective of big data and discussed the general framework of big data processing including data acquisition, data storage, data processing, data analyzing and data visualization. Landers et al. [24] proposed a new method called theory-driven web scraping which can be used to collect massive information from the internet for psychologists, and also pointed out the matters needing attention in the process of data acquisition. ...
Article
Full-text available
Pubertal timing and social adaptability are important research contents of adolescent mental health education. Traditional research methods mainly classify students based on the total score or average score of the scale, although this kind of method is simple easy to conduct, it can't make a more detailed analysis of the students. In this paper, data mining methods such as association rules and clustering are used to analyze the data of pubertal timing and social adaptability scale, some novel and meaningful conclusions are figured out from the analysis results that can't be obtained by the previous methods, and the analysis results are visualized to enhance readability. Association rule mining on basic attributes information, the pubertal timing group and the social adaptability levels were performed which can explore the relationship between the basic attributes information of the students, pubertal timing and the social adaptability. Fine-grained analysis of social adaptability by using clustering method was conducted which can divide the similar students into the same groups that is very useful for teachers to have a more in-depth, accurate and detailed understanding of students, make sure that the better classification can be obtained compared with the traditional analysis approaches. The work of this paper provides an effective guidance and a novel perspective for how to use data mining technologies to study the pubertal timing and social adaptability problems.
... Colvin et al., 2016) or by scraping information such as an email or physical address from a website (e.g. Landers et al., 2016). Then, their location is linked to other data sources with the information that interests the researcher, such as SES (Oest et al., 2018). ...
Article
Identifying online users' contexts can help researchers understand their needs. However, the validity of different methods for identifying the location of online users has been underexplored. This paper proposes using multiple methods and examining their impact on different research questions to determine their validity. It then demonstrates this approach using data from six Massive Open Online Courses (MOOCs) by examining whether different methods produce different results regarding the relationship between SES and participation and engagement in MOOCs. We found that the choice of method impacted the estimated SES of the sample; IP geolocation placed participants in lower SES districts in comparison with their self-reported districts. Using all geolocation methods, we found that our MOOCs' learners tended to be located in high-SES districts, but the results were inconclusive regarding the relationship between SES and course engagement. Based on this case study, we suggest that using multiple methods produces more robust findings. However, when methods diverge in findings, researchers should consider which method is most suitable for their specific purposes.
... Fortunately, "big data" methods exist to conduct inquiries on a grander scale. For instance, it is possible to collect ("scrape") the contents of large volumes of tweets and process them for aggregate analysis (Landers et al., 2016). Automated procedures have been developed for summarizing patterns in those tweets (Kouloumpis et al., 2011). ...
Article
When Behavior Analysis in Practice (BAP) was founded 15 years ago, questions were raised about whether a practitioner-focused journal was really needed to complement our field's well-established applied research periodicals. Like research journals, BAP publishes primary research reports for which scholarly citations are one measure of impact. Unlike most research journals, it also was intended to achieve dissemination impact, which implies influence on people who may not conduct research or leave behind citations. Using altmetric data as an objective measure of dissemination impact, we present evidence that BAP is becoming a leader in this domain among applied behavior analysis journals, and thus appears to be accomplishing exactly what it was designed to. We recommend explicitly relying on dissemination impact data to help shape the journal's future development.
... Recently, with increasing availability of novel data sources to psychologists (e.g., scraped data from websites, data captured from social media; cf. Landers et al., 2016), content analysis has been increasingly adopted to examine research questions that are difficult to be addressed by traditional methods, such as identifying the major themes underlying scraped or captured data. Examples have included psychological themes in employees' work-related discussions on Twitter (van Zoonen et al., 2016), general public's sentiment toward remote work during the recent COVID-19 pandemic (Zhang et al., 2021), the motivations behind online blog posts (Fullwood et al., 2009), and many others. ...
Article
Full-text available
Content analysis is a common and flexible technique to quantify and make sense of qualitative data in psychological research. However, the practical implementation of content analysis is extremely labor-intensive and subject to human coder errors. Applying natural language processing (NLP) techniques can help address these limitations. We explain and illustrate these techniques to psychological researchers. For this purpose, we first present a study exploring the creation of psychometrically meaningful predictions of human content codes. Using an existing database of human content codes, we build an NLP algorithm to validly predict those codes, at generally acceptable standards. We then conduct a Monte-Carlo simulation to model how four dataset characteristics (i.e., sample size, unlabeled proportion of cases, classification base rate, and human coder reliability) influence content classification performance. The simulation indicated that the influence of sample size and unlabeled proportion on model classification performance tended to be curvilinear. In addition, base rate and human coder reliability had a strong effect on classification performance. Finally, using these results, we offer practical recommendations to psychologists on the necessary dataset characteristics to achieve valid prediction of content codes to guide researchers on the use of NLP models to replace human coders in content analysis research. (PsycInfo Database Record (c) 2022 APA, all rights reserved).
... Fortunately, "big data" methods exist to conduct inquiries on a grander scale. For instance, it is possible to collect ("scrape") the contents of large volumes of tweets and process them for aggregate analysis (Landers et al., 2016). Automated procedures have been developed for summarizing patterns in those tweets (Kouloumpis et al., 2011). ...
Preprint
Full-text available
VIEW FULL TEXT ONLINE AT https://rdcu.be/cVaws Abstract: When Behavior Analysis in Practice was founded 15 years ago, questions were raised about whether a practitioner-specific journal was really needed to complement our field's well-established applied research periodicals. The journal's purpose hinges on the distinction between scholarly impact, which is measured via citations, and dissemination impact, which implies influence on people with the who do not leave behind citations. Using altmetric data as an objective measure of dissemination impact, we present evidence that Behavior Analysis in Practice is becoming a leader in this domain among applied behavior analysis journals, and thus appears to be achieving exactly what it was designed to. We recommend explicitly relying on dissemination impact data to shape the journal's future development.
... Together, these sources can provide researchers with a good starting point. We should note that researchers may consider "creating" their own database of available information such as by web-scrapping (Landers et al., 2016). ...
Article
Despite ample access to large, archival datasets, the micro-organizational sciences field seem to consistently cast these datasets aside in favor of primary datasets collected by independent researchers. In the current GoMusing, we argue that these archival datasets should not be a secondary (or even last) choice for the micro-organizational sciences. In fact, large archival datasets can enable researchers to (a) investigate phenomena of interest across generalizable samples, (b) incorporate multiple levels of context into research, and (c) take advantage of several additional methodological benefits. In the hopes of spurring a paradigm shift in the micro-organizational sciences, we begin our article by discussing problems with the standard approach to data collection (i.e., independent researchers collecting their own datasets). We then discuss how archival datasets can remedy many of these issues and advance the range of research questions the field is able to answerer. We conclude by providing a step-by-step process for incorporating these archival datasets into our literature and provide insights into addressing common challenges. We hope this GoMusing will serve as a call to action for researchers and editorial teams alike to move our research forward though a greater usage of large archival datasets.
... Other how-to guides that may be of use include topic modeling in leadership Tonidandel et al., 2021) and the organizational sciences more broadly (Hannigan et al., 2019;Schmiedel, Müller, & vom Brocke, 2019). Guides for leveraging web-based data (Landers, Brusso, Cavanaugh, & Collmus, 2016) and video-based methods also exist (Christianson, 2018;Congdon, Novack, & Goldin-Meadow, 2018), as do methods to conduct computational literature reviews (Antons, Breidbach, Joshi, & Salge, 2022). ...
Article
Leadership as a social influence process has always involved a complex set of phenomena that demands an interdisciplinary lens. Leadership scholarship has now entered into a digital era. In a digital era, the overall phenomenon is changing, as are the tools through which we study it, demanding a new “lens” through which we view leadership. Yet, this raises the question, to what extent is leadership different in a digital era? In acknowledgement of this trend, a special issue was commissioned at The Leadership Quarterly that sought to stimulate the imagination of leadership scholars and practitioners. In the current work, we begin with a brief review of who, what, when, where and why of digital leadership. We cover leadership in informal contexts (e.g., social media), generalization from face-to-face to virtual contexts, computational modeling, the leveraging of technology (e.g., machine learning; Big Data), as well methodological how-to guides. We then plot a path forward for leadership scholars in the dawn of the digital era.
Article
Full-text available
The purpose of this study was to compare basketball performance markers one year prior to initial severe lower extremity injury, including ankle, knee, and hip injuries, to one- and two-years following injury during the regular NBA season. Publicly available data were extracted through a reproducible extraction computed programmed process. Eligible participants were NBA players with at least three seasons played between 2008 and 2019, with a time-loss injury reported during the study period. Basketball performance was evaluated for season minutes, points, and rebounds. Prevalence of return to performance and linear regressions were calculated. 285 athletes sustained a severe lower extremity injury. 196 (69%) played one year and 130 (45%) played two years following the injury. Time to return to sport was similar between groin/hip/thigh [227 (88)], knee [260 (160)], or ankle [260 (77)] (P = 0.289). 58 (30%) players participated in a similar number of games and 57 (29%) scored similar points one year following injury. 48 (37%) participated in a similar number of games and 55 (42%) scored a similar number of points two years following injury. Less than half of basketball players that suffered a severe lower extremity injury were participating at the NBA level two years following injury, with similar findings for groin/hip/thigh, knee, and ankle injuries. Less than half of players were performing at previous pre-injury levels two years following injury. Suffering a severe lower extremity injury may be a prognostic factor that can assist sports medicine professionals to educate and set performance expectations for NBA players.
Chapter
In 2016, a number of language applications released chatbots to complement their programmes. Used primarily in informal learning settings, chatbots enable language learners to engage in conversational speaking practice, which can be perceived as less threatening than face-to-face interactions with native speakers. This study takes a closer look at four second language (L2) chatbots—Duolingo, Eggbun, Memrise, and Mondly—and analyses the experiences which informal language learners expressed on various online platforms (e.g., Duolingo forum, Memrise community, Reddit). Results indicate a degree of curiosity and a willingness to engage in conversation with chatbots. However, learners expressed frustration if the dialogues did not correspond to their learning goals or if they were excluded from using the bots because of technical or payment issues, or discontinuation of services.
Article
Full-text available
The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and other scholars are clamouring for access to the massive quantities of information produced by and about people, things, and their interactions. Significant questions emerge. Will large-scale search data help us create better tools, services, and public goods? Or will it usher in a new wave of privacy incursions and invasive marketing? Will data analytics help us understand online communities and political movements? Or will it be used to track protesters and suppress speech? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what 'research' means? Given the rise of Big Data as a socio-technical phenomenon, we argue that it is necessary to critically interrogate its assumptions and biases. In this article, we offer six provocations to spark conversations about the issues of Big Data: a cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology that provokes extensive utopian and dystopian rhetoric.
Article
Full-text available
Various factors have an influence on which coping strategies are mobilized under specific circumstances, among others, age and gender. The present paper focuses on the interrelationships between the ways of coping and some health-related variables in adolescence. Data were collected among secondary school students (n = 1039) in Szeged, Hungary. Factor analysis of the shortened and adapted version of the Ways of Coping Questionnaire gave a four-factor solution: passive coping, problem-analyzing coping, risky coping, and support-seeking coping. Passive and support-seeking ways of coping were more common among girls, however, this latter way of coping proved to be a more significant correlate of psychosocial health among boys. Both among boys and girls, passive and risky coping factors played a negative role, and problem-analyzing and support-seeking coping factors played a positive role in psychosocial health. Findings suggest that maladaptive coping and psychosocial health problems might form a vicious circle in which risk-taking as a way of coping might play a central role in adolescence. When adolescents dispair of their problems, they often use drugs, smoke, or drink alcohol. They perceive it, however, rather as a form of risk-taking or sensation-seeking than a way of coping. That is why they do not reckon with its harmfulness and future consequences.
Book
How to write your own machine learning algorithms in Python.
Article
Various aspects of honeypots, a security resource whose value lies in being probed, attacked or compromised, are discussed. It is found that production honeypots are used to secure the organization by either preventing, detecting or assisting in the response to an attack. The analysis showed that research honeypots are more complex and more risky than production honeypots.
Book
Despite illustrious origins dating to the 1920s, qualitative crime research has long been overshadowed by quantitative inquiry. After decades of limited use, there has been a notable resurgence in crime ethnography, naturalistic inquiry, and related forms of fieldwork addressing crime and related social control efforts. The Routledge Handbook of Qualitative Criminology signals this momentum as the first major reference work dedicated to crime ethnography and related fieldwork orientations. Synthesizing the foremost topics and issues in qualitative criminology into a single definitive work, the Handbook provides a "first-look" reference source for scholars and students alike. The collection features twenty original chapters on leading qualitative crime research strategies, the complexities of collecting and analyzing qualitative data, and the ethical propriety of researching active criminals and incarcerated offenders. Contributions from both established luminaries and talented emerging scholars highlight the traditions and emerging trends in qualitative criminology through authoritative overviews and "lived experience" examples. Comprehensive and current, The Routledge Handbook of Qualitative Criminology promises to be a sound reference source for academics, students and practitioners as ethnography and fieldwork realize continued growth throughout the 21st Century. © 2015 selection and editorial material, Heith Copes and J. Mitchell Miller; individual chapters, the contributors. All rights reserved.
Article
Issues common to both the process of building psychological theories and validating personnel decisions are examined. Inferences linking psychological constructs and operational measures of constructs are organized into a conceptual framework, and validation is characterized as the process of accumulating various forms of judgmental and empirical evidence to support these inferences. The traditional concepts of construct-, content-, and criterion-related validity are unified within this framework. This unified view of validity is then contrasted with more conventional views (e.g., Uniform Guidelines, 1978), and misconceptions about the validation of employment tests are examined. Next, the process of validating predictor constructs is extended to delineate the critical inferences unique to validating performance criteria. Finally, an agenda for programmatic personnel selection research is described, emphasizing a shift in the behavioral scientist's role in the personnel selection process.
Article
Interest in the problem of method biases has a long history in the behavioral sciences. Despite this, a comprehensive summary of the potential sources of method biases and how to control for them does not exist. Therefore, the purpose of this article is to examine the extent to which method biases influence behavioral research results, identify potential sources of method biases, discuss the cognitive processes through which method biases influence responses to measures, evaluate the many different procedural and statistical techniques that can be used to control method biases, and provide recommendations for how to select appropriate procedural and statistical remedies for different types of research settings.
Article
This article explores what happens to interpersonal and power dynamics when tutors use closed-group Facebook pages as a social networking tool in their tutorial groups with first and second year Bachelor of Education (BEd) students at the Wits School of Education (WSoE). It argues that this literacy practice creates an alternative pedagogical space that enables critical practices in relation to writing. These pages create a space that brings students' out-of-school literacy practices into a domain which normally promotes formal, academic literacy practices; a schooled space; a space where students feel safe enough to make their voices heard; a space where there are some interesting shifts in power relationships, identities, norms of communication and modes of learning. The research analyses the writing of tutors and students on these pages from a critical literacy perspective and makes use of the critical literacy model presented by Janks (2010) to see how this space changes issues of power, access, diversity and design by creating new relationships and new forms of interaction, language and texts.