Conference Paper

Dispensing with Humans in Human-Computer Interaction Research

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Although the deployment of LLMs in qualitative research is in 'its status nascendi' [in the state of being born] (De Paoli 2024, 3), there are ongoing debates about their role and effectiveness in processing and analyzing data. Some researchers (Byun, Vasicek, and Seppi 2023;Rietz and Maedche 2021) suggest that AI can match human capabilities in processing qualitative data and highlight the potential for LLMs to learn human coding practices and recognize linguistic patterns in textual analysis. This suggests that these models can mimic the subjective nature of qualitative analysis to a certain degree. ...
... This suggests that these models can mimic the subjective nature of qualitative analysis to a certain degree. Byun, Vasicek, and Seppi (2023) argue that LLMs could rival human capabilities in generating and organizing thematic structures, especially because of their ability to process large amounts of data and expedite the analysis. Researchers have also highlighted the potential of LLMs to mitigate certain limitations of traditional qualitative research particularly in relation to scalability, efficiency, and the ability to identify latent patterns across vast corpora that might otherwise be missed in manual analysis. ...
... The integration of LLMs in qualitative research presents both opportunities and challenges. As previous studies have suggested (Bano et al. 2024;Byun, Vasicek, and Seppi 2023), AI models can rapidly process large volumes of text, reducing the time required for initial coding and thematic categorization. However, our analysis highlights that GPT-4 lacks cultural consciousness (Shi-xu 2024) and frequently neutralizes or depoliticizes controversial discourse, avoiding explicit recognition of hate speech, cultural racism, and exclusionary rhetoric. ...
... Although the deployment of LLMs in qualitative research is in 'its status nascendi' [in the state of being born] (De Paoli, 2023:3), there are ongoing debates about their role and effectiveness in processing and analyzing data. Some researchers (Byun et al 2023;Rietz and Maedche, 2021) suggest that AI can match human capabilities in processing qualitative data and highlight the potential for LLMs to learn human coding practices which suggest that these models can adapt to the subjective nature of the qualitative analysis. Byun et al. (2023) argue that LLMs could rival human capabilities in generating and analyzing qualitative content, especially because of their ability to process large amounts of data and expedite the analysis. ...
... Some researchers (Byun et al 2023;Rietz and Maedche, 2021) suggest that AI can match human capabilities in processing qualitative data and highlight the potential for LLMs to learn human coding practices which suggest that these models can adapt to the subjective nature of the qualitative analysis. Byun et al. (2023) argue that LLMs could rival human capabilities in generating and analyzing qualitative content, especially because of their ability to process large amounts of data and expedite the analysis. Researchers also highlight the potential of LLMs to overcome typical limitations of qualitative research performed by human researchers, especially in relation to processing large datasets, generalizing results to larger contexts, and avoiding subjectivity. ...
... The integration of LLMs in qualitative research presents promising opportunities for enhancing research methodologies, but it also introduces significant challenges. As Byun et al. (2023) suggest, AI can match human capabilities in processing qualitative data, but it requires careful guidance to avoid discrepancies between AI and human reasoning. Our study demonstrates that while GPT-4 can generate useful initial categorizations, the depth and specificity provided by human researchers are crucial for accurate and meaningful analysis. ...
Preprint
Full-text available
In the dynamic field of artificial intelligence (AI), the development and application of Large Language Models (LLMs) for text analysis are of significant academic interest. Despite the promising capabilities of various LLMs in conducting qualitative analysis, their use in the humanities and social sciences has not been thoroughly examined. This article contributes to the emerging literature on LLMs in qualitative analysis by documenting an experimental study involving GPT-4. The study focuses on performing thematic analysis (TA) using a YouTube dataset derived from an EU-funded project, which was previously analyzed by other researchers. This dataset is about the representation of Roma migrants in Sweden during 2016, a period marked by the aftermath of the 2015 refugee crisis and preceding the Swedish national elections in 2017. Our study seeks to understand the potential of combining human intelligence with AI's scalability and efficiency, examining the advantages and limitations of employing LLMs in qualitative research within the humanities and social sciences. Additionally, we discuss future directions for applying LLMs in these fields.
... These advanced AI applications have been meticulously designed and trained on vast datasets, allowing them to generate human-like text to answer questions, write essays, summarise text, and even engage in conversations (Dergaa et al., 2023). The promise they offer is not just in their ability to process information but also in their potential to mimic human-like comprehension and generation of text (Byun, Vasicek, and Seppi, 2023). ...
... A strand of research has also evaluated AI's performance in specific tasks traditionally conducted by humans. Byun, Vasicek, and Seppi (2023) showed that AI can conduct qualitative analysis and generate nuanced results comparable to those of human researchers. In another task-specific study, Gilson et al. (2023) demonstrated that ChatGPT could answer medical examination questions at a level similar to a third-year medical student, underscoring its potential as an educational tool. ...
... Notably, a lack of qualitative research comparing human reasoning against LLMs is evident. Burger et al. (2023), and Byun, Vasicek, & Seppi (2023) have made an initial foray into this area, demonstrating that ChatGPT can perform certain research tasks traditionally undertaken by human researchers, producing complex and nuanced analyses of qualitative data with results arguably comparable to human-generated outputs. Despite these promising findings, these studies do not investigate AI and human reasoning within the qualitative research context. ...
Article
Full-text available
Context: The advent of AI-driven large language models (LLMs), such as ChatGPT 3.5 and GPT-4, have stirred discussions about their role in qualitative research. Some view these as tools to enrich human understanding, while others perceive them as threats to the core values of the discipline. Problem: A significant concern revolves around the disparity between AI-generated classifications and human comprehension, prompting questions about the reliability of AI-derived insights. An “AI echo chamber” could potentially risk the diversity inherent in qualitative research. A minimal overlap between AI and human interpretations amplifies concerns about the fading human element in research. Objective: This study aimed to compare and contrast the comprehension capabilities of humans and LLMs, specifically ChatGPT 3.5 and GPT-4. Methodology: We conducted an experiment with small sample of Alexa app reviews, initially classified by a human analyst. ChatGPT 3.5 and GPT-4 were then asked to classify these reviews and provide the reasoning behind each classification. We compared the results with human classification and reasoning. Results: The research indicated a significant alignment between human and ChatGPT 3.5 classifications in one-third of cases, and a slightly lower alignment with GPT-4 in over a quarter of cases. The two AI models showed a higher alignment, observed in more than half of the instances. However, a consensus across all three methods was seen only in about one-fifth of the classifications. In the comparison of human and LLMs reasoning, it appears that human analysts lean heavily on their individual experiences. As expected, LLMs, on the other hand, base their reasoning on the specific word choices found in app reviews and the functional components of the app itself. Conclusion: Our results highlight the potential for effective human-LLM collaboration, suggesting a synergistic rather than competitive relationship. Researchers must continuously evaluate LLMs’ role in their work, thereby fostering a future where AI and humans jointly enrich qualitative research.
... These advanced AI applications have been meticulously designed and trained on vast datasets, allowing them to generate human-like text that can answer questions, write essays, summarize text, and even engage in conversations (Dergaa et al. 2023). The promise they offer is not just in their ability to process information but also in their potential to mimic human-like comprehension and generation of text (Byun, Vasicek, and Seppi 2023). ...
... A strand of research has also evaluated AI's performance in specific tasks traditionally conducted by humans. Byun, Vasicek, and Seppi (2023) showed that AI can conduct qualitative analysis and generate nuanced results comparable to those of human researchers. In another task-specific study, (Gilson et al. 2023) demonstrated that ChatGPT could answer medical examination questions at a level similar to a third-year medical student, underscoring its potential as an educational tool. ...
... Notably, a lack of qualitative research comparing human reasoning against LLMs is evident. Burger et al. (2023) and Byun et al. (2023) have made an initial foray into this area, demonstrating that ChatGPT can perform certain research tasks traditionally undertaken by human researchers, producing complex and nuanced analyses of qualitative data with results arguably comparable to human-generated outputs. Despite these promising findings, these studies do not investigate AI and human reasoning within the qualitative research context. ...
Preprint
The advent of AI driven large language models (LLMs) have stirred discussions about their role in qualitative research. Some view these as tools to enrich human understanding, while others perceive them as threats to the core values of the discipline. This study aimed to compare and contrast the comprehension capabilities of humans and LLMs. We conducted an experiment with small sample of Alexa app reviews, initially classified by a human analyst. LLMs were then asked to classify these reviews and provide the reasoning behind each classification. We compared the results with human classification and reasoning. The research indicated a significant alignment between human and ChatGPT 3.5 classifications in one third of cases, and a slightly lower alignment with GPT4 in over a quarter of cases. The two AI models showed a higher alignment, observed in more than half of the instances. However, a consensus across all three methods was seen only in about one fifth of the classifications. In the comparison of human and LLMs reasoning, it appears that human analysts lean heavily on their individual experiences. As expected, LLMs, on the other hand, base their reasoning on the specific word choices found in app reviews and the functional components of the app itself. Our results highlight the potential for effective human LLM collaboration, suggesting a synergistic rather than competitive relationship. Researchers must continuously evaluate LLMs role in their work, thereby fostering a future where AI and humans jointly enrich qualitative research.
... The work of Byun et al. [8], on the other hand, reports the use of LLMs as a substitute for humans in qualitative analysis, a notoriously time-consuming process. The authors argue that the technologies used can identify themes and produce analyses comparable to those conducted by humans. ...
... O trabalho de Byun et al. [8], por sua vez, relata o uso de LLMs como substituto do ser humano em uma análise qualitativa, um processo notoriamente dispendioso. As pessoas autoras argumentam que as tecnologias utilizadas conseguem identificar temas e produzir análises comparáveis às realizadas por seres humanos. ...
... This aims to support AI practitioners during the initial phases of the AI design process, including reflexivity, brainstorming, and deliberation. While LLMs have demonstrated utility in diverse applications (Gilardi, Alizadeh, and Kubli 2023;Wu, Terry, and Cai 2022;Dowling and Lucey 2023;Byun, Vasicek, and Seppi 2023), their suitability for two specific tasks-identifying potential uses of a given AI technology and conducting legal risk assessments of its uses-remains an open question. Our aim is not to produce an exhaustive list of uses for a given AI technology, nor to provide a definitive risk classification. ...
... Moreover, even when the uses of AI are known, they can bring unanticipated challenges, from privacy and security issues Ekambaranathan, Zhao, and Van Kleek 2021) to distorting human beliefs (Kidd and Birhane 2023), excessive dependence that could diminish crucial human skills (Byun, Vasicek, and Seppi 2023;Lu and Yin 2021), and negative environmental impacts (Rillig et al. 2023), as well as impacts on human rights and society (Mantelero 2022). Anticipating such challenges and broader, systemic impacts of technology remains a significant challenge for AI practitioners (Prunkl et al. 2021;Weidinger et al. 2023). ...
Article
Responsible AI design is increasingly seen as an imperative by both AI developers and AI compliance experts. One of the key tasks is envisioning AI technology uses and risks. Recent studies on the model and data cards reveal that AI practitioners struggle with this task due to its inherently challenging nature. Here, we demonstrate that leveraging a Large Language Model (LLM) can support AI practitioners in this task by enabling reflexivity, brainstorming, and deliberation, especially in the early design stages of the AI development process. We developed an LLM framework, ExploreGen, which generates realistic and varied uses of AI technology, including those overlooked by research, and classifies their risk level based on the EU AI Act regulation. We evaluated our framework using the case of Facial Recognition and Analysis technology in nine user studies with 25 AI practitioners. Our findings show that ExploreGen is helpful to both developers and compliance experts. They rated the uses as realistic and their risk classification as accurate (94.5%). Moreover, while unfamiliar with many of the uses, they rated them as having high adoption potential and transformational impact.
... These models are extremely efficient for tasks such as answering queries and generating content, including text annotation [21,45,63] and powering chatbots that assist with mental health concerns [17,104]. LLMs are known for providing insights that often exceed general public knowledge [45], playing a significant role in collaborative processes between humans and AI [21,63,111,122], and in some instances, their performance can match that of expert opinions [14,35,89]. To achieve the desired output from LLMs, a set of best practices for prompt engineering has emerged [100,106]. ...
... Uses that pose low risk primarily promote goals related to environmental protection, such as life below the water (goal 14), life on land (goal 15), and clean water and sanitation (goal 6) (navy bars in Figure 3). These uses are classified as low risk by the EU AI Act. ...
Preprint
Full-text available
Integrating Artificial Intelligence (AI) into mobile and wearables offers numerous benefits at individual, societal, and environmental levels. Yet, it also spotlights concerns over emerging risks. Traditional assessments of risks and benefits have been sporadic, and often require costly expert analysis. We developed a semi-automatic method that leverages Large Language Models (LLMs) to identify AI uses in mobile and wearables, classify their risks based on the EU AI Act, and determine their benefits that align with globally recognized long-term sustainable development goals; a manual validation of our method by two experts in mobile and wearable technologies, a legal and compliance expert, and a cohort of nine individuals with legal backgrounds who were recruited from Prolific, confirmed its accuracy to be over 85\%. We uncovered that specific applications of mobile computing hold significant potential in improving well-being, safety, and social equality. However, these promising uses are linked to risks involving sensitive data, vulnerable groups, and automated decision-making. To avoid rejecting these risky yet impactful mobile and wearable uses, we propose a risk assessment checklist for the Mobile HCI community.
... Automated computational tools (e.g., LLMs and other AI tools) are increasingly being deployed to complete various tasks and workflows in professional settings, including research tasks [4,12,28]. This has already led to (sometimes tongue-in-cheek, sometimes earnest) speculation and exploration of whether computational tools could replace humans in the research process altogether [8,26]. This is generally motivated through the substantial impact it might be able to have on time saved, especially in the context of secondary research which is generally quite timeconsuming [5,32,39]. ...
Preprint
Full-text available
Automation and semi-automation through computational tools like LLMs are also making their way to deployment in research synthesis and secondary research, such as systematic reviews. In some steps of research synthesis, this has the opportunity to provide substantial benefits by saving time that previously was spent on repetitive tasks. The screening stages in particular may benefit from carefully vetted computational support. However, this position paper argues for additional caution when bringing in such tools to the analysis and synthesis phases, where human judgement and expertise should be paramount throughout the process.
... In parallel to works in the machine learning community on LLM evaluation, there has been fantastic efforts in the HCI community on comparing generative model outputs as well as on using LLMs for qualitative analysis. Works like Torii et al. (2024); Byun et al. (2023) use LLMs to generate discussions from qualitative research data to automate the data analysis process, but note the lack of comprehensive evaluation metrics. Automated data analysis on unstructured data has also been explored in Zhong et al. (2022;2023); Dunlap et al. (2024b), which use LLMs and VLMs to propose and validate candidate differences between two sets of text or images in the form of "set A contains more X", and Chiquier et al. (2024) employs an evolutionary algorithm to find text descriptions which best separates image classes to assist in zero-shot classification. ...
Preprint
Full-text available
Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These "vibes" - such as tone, formatting, or writing style - influence user preferences, yet traditional evaluations focus primarily on the single axis of correctness. We introduce VibeCheck, a system for automatically comparing a pair of LLMs by discovering identifying traits of a model ("vibes") that are well-defined, differentiating, and user-aligned. VibeCheck iteratively discover vibes from model outputs, then utilizes a panel of LLM judges to quantitatively measure the utility of each vibe. We validate that the vibes generated by VibeCheck align with those found in human discovery and run VibeCheck on pairwise preference data from real-world user conversations with llama-3-70b VS GPT-4. VibeCheck reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run VibeCheck on a variety of models and tasks including summarization, math, and captioning to provide insight into differences in model behavior. Some of the vibes we find are that Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often over-explains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash.
... Trained across diverse fields like computing, literature, and psychology, LLMs offer versatility for various tasks, including natural language programming (code generation) and code analysis. This rise in LLM usage has spurred extensive research exploring, their efficiency across different tasks [13,55,63,75,85], Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. ...
Preprint
Full-text available
Large Language Models (LLMs) are transforming programming practices, offering significant capabilities for code generation activities. While researchers have explored the potential of LLMs in various domains, this paper focuses on their use in programming tasks, drawing insights from user studies that assess the impact of LLMs on programming tasks. We first examined the user interaction behaviors with LLMs observed in these studies, from the types of requests made to task completion strategies. Additionally, our analysis reveals both benefits and weaknesses of LLMs showing mixed effects on the human and task. Lastly, we looked into what factors from the human, LLM or the interaction of both, affect the human's enhancement as well as the task performance. Our findings highlight the variability in human-LLM interactions due to the non-deterministic nature of both parties (humans and LLMs), underscoring the need for a deeper understanding of these interaction patterns. We conclude by providing some practical suggestions for researchers as well as programmers.
... The increasing reliance on LLMs within research has sparked critical discourse about their limitations and the broader implications of using AI as proxies for human behavior. While proponents argue that LLMs can improve research efficiency, scale, and diversity (see, e.g., research papers such as [9,24,29] and products such as [54,55]), there is growing scholarship that explores the ethical implications of replacing human participants with AI-generated data [3]. LLMs' effectiveness in replacing human participants is (partly) contingent on their ability to represent the perspectives of different identities. ...
Preprint
Full-text available
The recent excitement around generative models has sparked a wave of proposals suggesting the replacement of human participation and labor in research and development--e.g., through surveys, experiments, and interviews--with synthetic research data generated by large language models (LLMs). We conducted interviews with 19 qualitative researchers to understand their perspectives on this paradigm shift. Initially skeptical, researchers were surprised to see similar narratives emerge in the LLM-generated data when using the interview probe. However, over several conversational turns, they went on to identify fundamental limitations, such as how LLMs foreclose participants' consent and agency, produce responses lacking in palpability and contextual depth, and risk delegitimizing qualitative research methods. We argue that the use of LLMs as proxies for participants enacts the surrogate effect, raising ethical and epistemological concerns that extend beyond the technical limitations of current models to the core of whether LLMs fit within qualitative ways of knowing.
... We considered using AI for (1) generating translations of questions, (2) translating participant responses, and (3) analyzing responses. However, given the ongoing debates about employing Large Language Models (LLMs) in research (e.g., Paxton 2023;Rastogi et al. 2023;Hosseini, Resnik, and Holmes 2023;Byun, Vasicek, and Seppi 2023), we realized that using AI without informing participants would be inappropriate. We took into account the following: ...
Preprint
Full-text available
Calls for engagement with the public in Artificial Intelligence (AI) research, development, and governance are increasing, leading to the use of surveys to capture people's values, perceptions, and experiences related to AI. In this paper, we critically examine the state of human participant surveys associated with these topics. Through both a reflexive analysis of a survey pilot spanning six countries and a systematic literature review of 44 papers featuring public surveys related to AI, we explore prominent perspectives and methodological nuances associated with surveys to date. We find that public surveys on AI topics are vulnerable to specific Western knowledge, values, and assumptions in their design, including in their positioning of ethical concepts and societal values, lack sufficient critical discourse surrounding deployment strategies, and demonstrate inconsistent forms of transparency in their reporting. Based on our findings, we distill provocations and heuristic questions for our community, to recognize the limitations of surveys for meeting the goals of engagement, and to cultivate shared principles to design, deploy, and interpret surveys cautiously and responsibly.
... A number of claims have been made about emergent properties arising from the massive increase in size and data reflected in LLMs. These include the potential to mimic human performance in a number of psycholinguistic tasks (Dillion, Tandon, Gu, & Gray, 2023;Trott, Jones, Chang, Michaelov, & Bergen, 2023;Cai, Haslett, Duan, Wang, & Pickering, 2023), understanding cognitive states (Trott et al., 2023), and being capable of generating passable (if bland) synthetic discourse (Byun, Vasicek, & Seppi, 2023). This performance is likely based on knowledge of statistical distributions of word usage, as has been suggested in prior work (Futrell & Hahn, 2022). ...
Conference Paper
Full-text available
Despite the long-standing theoretical importance of the concept of illocutionary force in communication (Austin, 1975), quantitative measurement of it has remained elusive. The following study seeks to measure the influence of illocutionary force on the degree to which subreddit community members maintain the concepts and ideas of previous community members' comments when they reply to each other's content. We leverage an information-theoretic framework implementing a measurement of linguistic convergence to capture how much of a previous comment can be recovered from its replies. To show the effect of illocutionary force, we then ask a large language model (LLM) to write a reply to the same previous comment as though it were a member of that subreddit community. Because LLMs inherently lack illocutionary intent but produce plausible utterances, they can function as a useful control to test the contribution of illocutionary intent and the effect it may have on the language in human-generated comments. We find that LLMs indeed have statistically significant, lower entropy with prior comments than human replies to the same comments. While this says very little about LLMs on the basis of how they are trained, this difference offers a quantitative baseline to assess the effect of illocutionary force on the flow of information in online discourse.
... Although LLMs show great promise in facilitating the deductive coding process for qualitative research, they should not be viewed as a replacement for humans, as recently suggested (Byun, Vasicek, and Seppi 2023). Human input is invaluable not only for the expertise needed to craft and finetune suitable prompts to generate appropriate coding schemes, but also for the validation of LLM outputs and to monitor for misleading bias and hallucinations. ...
Article
Many behavioral science studies result in large amounts of unstructured data sets that are costly to code and analyze, requiring multiple reviewers to agree on systematically chosen concepts and themes to categorize responses. Large language models (LLMs) have potential to support this work, demonstrating capabilities for categorizing, summarizing, and otherwise organizing unstructured data. In this paper, we consider that although LLMs have the potential to save time and resources performing coding on qualitative data, the implications for behavioral science research are not yet well understood. Model bias and inaccuracies, reliability, and lack of domain knowledge all necessitate continued human guidance. New methods and interfaces must be developed to enable behavioral science researchers to efficiently and systematically categorize unstructured data together with LLMs. We propose a framework for incorporating human feedback into an annotation workflow, leveraging interactive machine learning to provide oversight while improving a language model's predictions over time.
... Hence, an LLM can reason and answer questions about text data such as user reviews. Our main motivation to use LLMs in this study was inspired by the results of a recent investigation conducted by Byun et al. [8]. In their study, they used ChatGPT-3 to analyze text data (e.g. ...
... The work of Byun et al. (2023) shows that LLMs like GPT-3 have the capacity to produce text that is comparable to that written by humans, even in qualitative analysis, which traditionally relies heavily on human insight. Their work demonstrates that AI can not only generate text but also identify themes and provide detailed analysis similar to that of human researchers. ...
Article
Full-text available
The recent surge in the integration of Large Language Models (LLMs) like ChatGPT into qualitative research in software engineering, much like in other professional domains, demands a closer inspection. This vision paper seeks to explore the opportunities of using LLMs in qualitative research to address many of its legacy challenges as well as potential new concerns and pitfalls arising from the use of LLMs. We share our vision for the evolving role of the qualitative researcher in the age of LLMs and contemplate how they may utilize LLMs at various stages of their research experience.
... The work of Byun et al. (2023) shows that LLMs like GPT-3 have the capacity to produce text that is comparable to that written by humans, even in qualitative analysis, which traditionally relies heavily on human insight. Their work demonstrates that AI can not only generate text but also identify themes and provide detailed analysis similar to that of human researchers. ...
Preprint
Full-text available
The recent surge in the integration of Large Language Models (LLMs) like ChatGPT into qualitative research in software engineering, much like in other professional domains, demands a closer inspection. This vision paper seeks to explore the opportunities of using LLMs in qualitative research to address many of its legacy challenges as well as potential new concerns and pitfalls arising from the use of LLMs. We share our vision for the evolving role of the qualitative researcher in the age of LLMs and contemplate how they may utilize LLMs at various stages of their research experience.
Preprint
Full-text available
Large Language Models (LLMs) created new opportunities for generating personas, which are expected to streamline and accelerate the human-centered design process. Yet, AI-generated personas may not accurately represent actual user experiences, as they can miss contextual and emotional insights critical to understanding real users' needs and behaviors. This paper examines the differences in how users perceive personas created by LLMs compared to those crafted by humans regarding their credibility for design. We gathered ten human-crafted personas developed by HCI experts according to relevant attributes established in related work. Then, we systematically generated ten personas and compared them with human-crafted ones in a survey. The results showed that participants differentiated between human-created and AI-generated personas, with the latter being perceived as more informative and consistent. However, participants noted that the AI-generated personas tended to follow stereotypes, highlighting the need for a greater emphasis on diversity when utilizing LLMs for persona creation.
Article
Integrating Artificial Intelligence (AI) into mobile and wearables offers numerous benefits at individual, societal, and environmental levels. Yet, it also spotlights concerns over emerging risks. Traditional assessments of risks and benefits have been sporadic, and often require costly expert analysis. We developed a semi-automatic method that leverages Large Language Models (LLMs) to identify AI uses in mobile and wearables, classify their risks based on the EU AI Act, and determine their benefits that align with globally recognized long-term sustainable development goals; a manual validation of our method by two experts in mobile and wearable technologies, a legal and compliance expert, and a cohort of nine individuals with legal backgrounds who were recruited from Prolific, confirmed its accuracy to be over 85%. We uncovered that specific applications of mobile computing hold significant potential in improving well-being, safety, and social equality. However, these promising uses are linked to risks involving sensitive data, vulnerable groups, and automated decision-making. To avoid rejecting these risky yet impactful mobile and wearable uses, we propose a risk assessment checklist for the Mobile HCI community.
Chapter
In this chapter, we will explore possible future directions in qualitative research. First, we will consider an overview of the opportunities, risks, and limitations associated with the possible use of large language models (LLMs) in qualitative research. Next, we will explore capabilities of ChatGPT in analysing qualitative data and formulating theory. Finally, we will conclude by considering some further applications of the socio-technical research framework beyond grounded theory and of socio-technical grounded theory beyond the software engineering discipline.
Article
Full-text available
GPT-3 is a large-scale natural language model developed by OpenAI that can perform many different tasks, including topic classification. Although researchers claim that it requires only a small number of in-context examples to learn a task, in practice GPT-3 requires these training examples to be either of exceptional quality or a higher quantity than easily created by hand. To address this issue, this study teaches GPT-3 to classify whether a question is related to data science by augmenting a small training set with additional examples generated by GPT-3 itself. This study compares two augmented classifiers: the Classification Endpoint with an increased training set size and the Completion Endpoint with an augmented prompt optimized using a genetic algorithm. We find that data augmentation significantly increases the accuracy of both classifiers, and that the embedding-based Classification Endpoint achieves the best accuracy of about 76%, compared to human accuracy of 85%. In this way, giving large language models like GPT-3 the ability to propose their own training examples can improve short text classification performance.
Preprint
Full-text available
GPT-3 is a large-scale natural language model developed by OpenAI that can perform many different tasks, including topic classification. Although researchers claim that it requires only a small number of in-context examples to learn a task, in practice GPT-3 requires these training examples to be either of exceptional quality or a higher quantity than easily created by hand. To address this issue, this study teaches GPT-3 to classify whether a question is related to data science by augmenting a small training set with additional examples generated by GPT-3 itself. This study compares two classifiers: the GPT-3 Classification Endpoint with augmented examples, and the GPT-3 Completion Endpoint with an optimal training set chosen using a genetic algorithm. We find that while the augmented Completion Endpoint achieves upwards of 80 percent validation accuracy, using the augmented Classification Endpoint yields more consistent accuracy on unseen examples. In this way, giving large-scale machine learning models like GPT-3 the ability to propose their own additional training examples can result in improved classification performance.
Article
Full-text available
Background Qualitative methods analyze contextualized, unstructured data. These methods are time and cost intensive, often resulting in small sample sizes and yielding findings that are complicated to replicate. Integrating natural language processing (NLP) into a qualitative project can increase efficiency through time and cost savings; increase sample sizes; and allow for validation through replication. This study compared the findings, costs, and time spent between a traditional qualitative method (Investigator only) to a method pairing a qualitative investigator with an NLP function (Investigator +NLP). Methods Using secondary data from a previously published study, the investigators designed an NLP process in Python to yield a corpus, keywords, keyword influence, and the primary topics. A qualitative researcher reviewed and interpreted the output. These findings were compared to the previous study results. Results Using comparative review, our results closely matched the original findings. The NLP + Investigator method reduced the project time by a minimum of 120 hours and costs by $1,500. Discussion Qualitative research can evolve by incorporating NLP methods. These methods can increase sample size, reduce project time, and significantly reduce costs. The results of an integrated NLP process create a corpus and code which can be reviewed and verified, thus allowing a replicable, qualitative study. New data can be added over time and analyzed using the same interpretation and identification. Off the shelf qualitative software may be easier to use, but it can be expensive and may not offer a tailored approach or easily interpretable outcomes which further benefits researchers.
Conference Paper
Full-text available
A months-long hike of the Appalachian Trail often involve long-term preparation and life-altering decisions. Would-be hikers leverage institutional knowledge from literature and online forums to physically and mentally prepare for such an arduous hike. Their use of social platforms provide useful insights on motivations for undertaking the thru-hike, how they deal with unexpected conditions on the trail and understand choices made in conditions of scarcity. By analyzing over 100,000 Reddit posts and comments in r/AppalachianTrail and applying a Sense of Community theory, we sought to understand hikers' identity as community members, how their emotional and practical needs are met, and how they evolve. We found that the role and language of thru-hikers change as they progress from pre-hike, on-hike, and post-hike stages, from a questioner early on, to an expert post-hike. We conclude with design recommendations to support offline communities online.
Article
Full-text available
Thematic analysis is a poorly demarcated, rarely acknowledged, yet widely used qualitative analytic method within psychology. In this paper, we argue that it offers an accessible and theoretically flexible approach to analysing qualitative data. We outline what thematic analysis is, locating it in relation to other qualitative analytic methods that search for themes or patterns, and in relation to different epistemological and ontological positions. We then provide clear guidelines to those wanting to start thematic analysis, or conduct it in a more deliberate and rigorous way, and consider potential pitfalls in conducting thematic analysis. Finally, we outline the disadvantages and advantages of thematic analysis. We conclude by advocating thematic analysis as a useful and flexible method for qualitative research in and beyond psychology.
Article
This study investigates the effects of task demonstrability and replacing a human advisor with a machine advisor. Outcome measures include advice-utilization (trust), the perception of advisors, and decision-maker emotions. Participants were randomly assigned to make a series of forecasts dealing with either humanitarian planning (low demonstrability) or management (high demonstrability). Participants received advice from either a machine advisor only, a human advisor only, or their advisor was replaced with the other type of advisor (human/machine) midway through the experiment. Decision-makers rated human advisors as more expert, more useful, and more similar. Perception effects were strongest when a human advisor was replaced by a machine. Decision-makers also experienced more negative emotions, lower reciprocity, and faulted their advisor more for mistakes when a human was replaced by a machine.
Article
Social researchers often apply qualitative research methods to study groups and their communications artifacts. The use of computer-mediated communications has dramatically increased the volume of text available, but coding such text requires considerable manual effort. We discuss how systems that process text in human languages (i.e. natural language processing [NLP]) might partially automate content analysis by extracting theoretical evidence. We present a case study of the use of NLP for qualitative analysis in which the NLP rules showed good performance on a number of codes. With the current level of performance, use of an NLP system could reduce the amount of text to be examined by a human coder by an order of magnitude or more, potentially increasing the speed of coding by a comparable degree. The paper is significant as it is one of the first to demonstrate the use of high-level NLP techniques for qualitative data analysis.
LaMDA: Language Models for Dialog Applications
  • Romal Thoppilan
  • Daniel De Freitas
  • Jamie Hall
  • Noam Shazeer
  • Apoorv Kulshreshtha
  • Heng-Tze
  • Alicia Cheng
  • Taylor Jin
  • Leslie Bos
  • Yu Baker
  • Yaguang Du
  • Hongrae Li
  • Lee
  • Amin Huaixiu Steven Zheng
  • Marcelo Ghafouri
  • Yanping Menegali
  • Maxim Huang
  • Dmitry Krikun
  • James Lepikhin
  • Dehao Qin
  • Yuanzhong Chen
  • Zhifeng Xu
  • Adam Chen
  • Maarten Roberts
  • Vincent Bosma
  • Yanqi Zhao
  • Chung-Ching Zhou
  • Igor Chang
  • Will Krivokon
  • Marc Rusch
  • Pranesh Pickett
  • Laichee Srinivasan
  • Kathleen Man
  • Meredith Ringel Meier-Hellstern
  • Tulsee Morris
  • Renelito Delos Doshi
  • Toju Santos
  • Johnny Duke
  • Ben Soraker
  • Vinodkumar Zevenbergen
  • Mark Prabhakaran
  • Ben Diaz
  • Kristen Hutchinson
  • Alejandra Olson
  • Erin Molina
  • Josh Hoffman-John
  • Lora Lee
  • Ravi Aroyo
  • Alena Rajakumar
  • Matthew Butryna
  • Viktoriya Lamm
  • Joe Kuzmina
  • Aaron Fenton
  • Cohen
  • Thoppilan Romal
Nikolich Alexandr, Osliakova Irina, Kudinova Tatyana, Kappusheva Inessa, and Puchkova Arina. 2021. Fine-Tuning GPT-3 for Russian Text Summarization
  • Osliakova Nikolich Alexandr
  • Kudinova Irina
  • Kappusheva Tatyana
  • Puchkova Inessa
  • Arina
  • Alexandr Nikolich
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
  • Devlin Jacob
Writing With GPT-3 With Paul Bellow
  • Joanna Penn
Co-Creating With AI Writing And Image Tools With Shane Neeley
  • Joanna Penn
Language Models are Few-Shot Learners
  • Tom Brown
  • Benjamin Mann
  • Nick Ryder
  • Melanie Subbiah
  • Jared D Kaplan
  • Prafulla Dhariwal
  • Arvind Neelakantan
  • Pranav Shyam
  • Girish Sastry
  • Amanda Askell
  • Sandhini Agarwal
  • Ariel Herbert-Voss
  • Gretchen Krueger
  • Tom Henighan
  • Rewon Child
  • Aditya Ramesh
  • Daniel Ziegler
  • Jeffrey Wu
  • Clemens Winter
  • Chris Hesse
  • Mark Chen
  • Eric Sigler
  • Mateusz Litwin
  • Scott Gray
  • Benjamin Chess
  • Jack Clark
  • Christopher Berner
  • Sam Mccandlish
  • Alec Radford
  • Ilya Sutskever
  • Dario Amodei
  • Brown Tom
Is LAMDA sentient? - ai & society
The AI-Augmented Author. Writing With GPT-3 With Paul Bellow
  • Joanna Penn
Kappusheva Inessa, and Puchkova Arina. 2021. Fine-Tuning GPT-3 for Russian Text Summarization
  • Osliakova Nikolich Alexandr
  • Kudinova Irina
  • Tatyana
  • Alexandr Nikolich