Article

Hand Collecting and Coding Versus Data-Driven Methods in Technical and Professional Communication Research

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Background: Qualitative technical communication research often produces datasets that are too large to manage effectively with hand-coded approaches. Text-mining methods, used carefully, may uncover patterns and provide results for larger datasets that are more easily reproduced and scaled. Research questions: 1. To what degree can hand collection results be replicated by automated data collection? 2. To what degree can hand-coded results be replicated by machine coding? 3. What are the affordances and limitations of each method? Literature review: We introduce the stages of data collection and analysis that researchers typically discuss in the literature, and show how researchers in technical communication and other fields have discussed the affordances and limitations of hand collection and coding versus automated methods throughout each stage. Research methodology: We utilize an existing dataset that was hand-collected and hand-coded. We discuss the collection and coding processes, and demonstrate how they might be replicated with web scraping and machine coding. Results/discussion: We found that web scraping demonstrated an obvious advantage of automated data collection: speed. Machine coding was able to provide comparable outputs to hand coding for certain types of data; for more nuanced and verbally complex data, machine coding was less useful and less reliable. Conclusions: Our findings highlight the importance of considering the context of a particular project when weighing the affordances and limitations of hand collecting and coding over automated approaches. Ultimately, a mixed-methods approach that relies on a combination of hand coding and automated coding should prove to be the most productive for current and future kinds of technical communication work, in which close attention to the nuances of language is critical, but in which processing large amounts of data would yield significant benefits as well.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Dieser Aspekt der Voraussage -oder auch Prädiktionunterscheidet automatisierte Methoden der Textanalyse von anderen Formen der Ihnen aus dem Studium bekannten quantitativen Methoden empirischer Sozialforschung wie zum Beispiel Regressionsanalysen. Auch wenn hier der Duktus vorherrscht, dass mithilfe automatisierter Verfahren latente Sinnstrukturen in Texten objektiv erkannt, ohne menschliches Zutun extrahiert, interpretiert und sogar prognostiziert werden können, so kommen auch Verfahren der automatisierten quantitativen Textanalyse nicht ohne Interpretationsleistungen von Forscher*innen aus. Einige Forscher*innen greifen mittlerweile sogar bewusst auf Methoden der qualitativen Sozialforschung zurück, um die Grobkörnigkeit der Analyseergebnisse automatisierter quantitativer Textanalysen auszugleichen und reflektieren dabei, wie eine Integration automatisierter Methoden und qualitativer Methoden möglich ist (Andreotta et al. 2019;Dillon et al. 2020;Fawcett et al. 2019;Lauer et al. 2018;Schneiker et al. 2019). Das ist vor allem der Erkenntnis geschuldet, dass die Ergebnisse automatisierter Methoden häufig schlecht für den Menschen interpretierbar sind, Inhalte und Themen unplausibel sind, oder Muster offenbaren, die der Intuition der Forschenden widersprechen und damit auf Fehler in den Daten oder theoretischen Ansätzen hindeuten. ...
Book
Full-text available
Das Open Access Buch bietet für Einsteiger*innen Erklärungen darüber, wie eine geeignete qualitative oder quantitative inhaltsanalytische Methode abhängig von a) Forschungsinteresse und b) Datenumfang ausgewählt werden kann. Teil 1 definiert Auswertungstechniken und zeigt Möglichkeiten und Grenzen sozialwissenschaftlicher Inhaltsanalysen auf. Teil 2 stellt digital unterstützte und teilautomatisierte Techniken, Teil 3 die automatisierten Techniken Korrespondenzanalyse, Sentiment Analyse und Topic Modeling vor. Alle Einführungen erfolgen mit Beispielen und Softwareanwendungen (AntConc, MAXQDA, Python, RStudio oder VosViewer).
... After determining the fitness function, considering that, in the early stages of evolution, if the offspring of excellent individuals will account for a large proportion of the whole population, the fitness values will be relatively close in the later stages of evolution. At this time, the advantages of the offspring of excellent individuals are not highlighted, and the evolution of the whole population is almost stopped when making selection operations [17]. In view of the problems of premature convergence (premature) in the early stage and slow convergence in the late stage, we also need to process it. ...
Article
Full-text available
In order to make key decisions more conveniently according to the massive data information obtained, a spatial data mining technology based on a genetic algorithm is proposed, which is combined with the k-means algorithm. The immune principle and adaptive genetic algorithm are introduced to optimize the traditional genetic algorithm, and the K-means, GK, and IGK algorithms are compared and analyzed. The results show that, in two different datasets, the objective functions obtained by the K-means algorithm are 94.05822 and 4.10373 × 10 6 , respectively, while the objective functions obtained by the GK and IGK algorithms are 89.8619 and 3.9088 × 10 6 , respectively. The difference between the three algorithms can also be reflected in the data comparison of the number of iterations. The number of iterations required for k-means to reach the optimal solution is 8.21 and 8.4, respectively, which is the most among the three algorithms, while the number of iterations required for IGK to reach the optimal solution is 5.84 and 4.9, respectively, which is the least. Although the time required for K-means is short, by comparison, the IGK algorithm we use can get the optimal solution in relatively less time.
... In other words, if researchers compared the time invested in reusing manual methods to collect data against the time that they would take to learn automated methods while conducting a new study, they would certainly find it difficult to dismiss web scraping and programing if the overall time investment was less. This was precisely my experience when collaborating with Claire Lauer and Eva Brumberger to write "Hand Collecting and Coding Versus Data-Driven Methods in Technical and Professional Communication" (Lauer et al., 2018). ...
Article
Full-text available
This article advocates for web scraping as an effective method to augment and enhance technical and professional communication (TPC) research practices. Web scraping is used to create consistently structured and well-sampled data sets about domains, communities, demographics, and topics of interest to TPC scholars. After providing an extended description of web scraping, the authors identify technical considerations of the method and provide practitioner narratives. They then describe an overview of project-oriented web scraping. Finally, they discuss implications for the concept as a sustainable approach to developing web scraping methods for TPC research.
... First, the replication of the qualitative findings using STM techniques opens the door to using STM on large corpuses that would be considered unwieldy for hand coding. However, we recommend that STM techniques be accompanied by "checks" of close qualitative reading given that STM does not taking into account the syntactic context of words (Lauer, Brumberger, and Beveridge 2018). Second, STM could be used as a fruitful first step for theme exploration and inductive code development in otherwise largely hand-coded work. ...
Article
This paper has two aims. First, we apply Bourdieu’s field theory to investigate media discourse on race and immigration, demonstrating how features of news organizations influence news content. Second, we compare contemporary natural language processing (NLP) techniques with qualitative hand-coding. Extending a previous study, we compare newspaper articles from the mainstream and black press in Atlanta. We find significant differences in both word-use and topical coverage in immigration articles aimed at the two audiences. With a focus on organizational resources and values, our quantitative approach to field theory facilitates a better understanding of the journalistic landscape.
... This produced frame frequencies and the richness of the frames based on the interviewees and not the interviewer. In the results below we present frame theme frequencies (Lauer et al., 2018;Smagorinsky, 2008) and descriptive results that show the patterns and richness of the frames (Gilgun, 2014). ...
Article
Research implies that emphasis frames control preferences. Little is known, however, about how stakeholders frame water. How is water framed in decision-making and shaping rooms? Therefore, we explore the Red Sea–Dead Sea Water Conveyance framing. We compile two data sets: (1) 18 in-depth interviews, archival data; and (2) 10 documents. For data analysis, we use qualitative and quantitative approaches – coding with Atlas.ti and text analysis with Voyant Tools. In the results, we juxtapose dominant and marginal frames, frequencies, and descriptive examples. Findings reveal which actors activate and spread salient frames, concealing pressing issues and sustaining old power structures.
... En cualquiera de los casos, es recomendable el uso de herramientas de data mining (Silwattananusarn y Tuamsuk, 2012;Lauer, Brumberger y Beveridge, 2018), como el web scraping (Saurkar, Pathare y Gode, 2018), de cara a obtener suficientes datos en el muestreo. Las deficiencias o errores en los datos obtenidos a partir de estas técnicas siempre pueden ser resueltos con una recopilación manual de la información, más pausada y precisa. ...
Article
Full-text available
Las series de televisión se han convertido en un producto especialmente atractivo entre los jóvenes y una fuente de proyección de valores, arquetipos y estilos de vida, también en lo que se refiere al mundo profesional. Entre la literatura analizada, se echa en falta un mayor desarrollo de metodologías específicas para el estudio de los roles profesionales en las series de televisión, dada la importancia que dichos perfiles podrían tener en la elección que hacen los jóvenes tanto de sus estudios como de su futuro profesional. Así, este trabajo propone una metodología, originada inicialmente con el estudio de los roles profesionales del área de la comunicación sobre las series de televisión mejor valoradas en el portal web IMDb, pero que es extrapolable a otras investigaciones sobre roles profesionales en contenidos audiovisuales. Para ello, se analizan varios enfoques metodológicos y los distintos matices necesarios para afinar las herramientas y métodos necesarios en una investigación de este tipo.
... Other works within professional communication pick up parts of web search engine technologies, like crawling and scraping. Scrapers are a key component within modern search engines that extract information from web documents and allow to annotate different entities and are known to work well for professional communication [67]. Communication also manifests itself in other forms as the following example shows: ...
Preprint
Full-text available
The digitization of the world has also led to a digitization of communication processes. Traditional research methods fall short in understanding communication in digital worlds as the scope has become too large in volume, variety, and velocity to be studied using traditional approaches. In this paper, we present computational methods and their use in public and mass communication research and how those could be adapted to professional communication research. The paper is a proposal for a panel in which the panelists, each an expert in their field, will present their current work using computational methods and will discuss transferability of these methods to professional communication.
... Other works within professional communication pick up parts of web search engine technologies, like crawling and scraping. Scrapers are a key component within modern search engines that extract information from web documents and allow to annotate different entities and are known to work well for professional communication [67]. Communication also manifests itself in other forms as the following example shows: ...
Article
In this paper, we investigate two approaches to building artificial neural network models to compare their effectiveness for accurately classifying rhetorical structures across multiple (non-binary) classes in small textual datasets. We find that the most accurate type of model can be designed by using a custom rhetorical feature list coupled with general-language word vector representations, which outperforms models with more computing-intensive architectures.
Article
This article interrogates recent computational work on discovering and analyzing topoi through the use of topic modeling in the discipline of the literary digital humanities against the long history of topical research and pedagogy in rhetoric and composition. While significant work has been done in the literary digital humanities to advance the study of texts through topic modeling, this article argues that the emphasis on the textuality of topoi in computational research neglects situated rhetorical actions and the dynamics of audience interaction. In response to this deemphasis, this article proposes an algorithmic alternative to the identification and explanation of the rhetorical topoi through the integrated use of a computational rhetorical move classifier called the Faciloscope (Omizo et al., 2016) and the pattern-matching program, Docuscope (Kaufer and Ishizaki, 1998).
Article
This article provides an overview of robust social justice work already done in technical and professional communication (TPC) to introduce the transformative paradigm, an action research framework articulated by Donna Mertens. Research articles in TPC offer examples of the axiological, ontological, epistemological, and methodological tenets of the transformative paradigm. Together with a measured discussion of the paradigm, this Methodologies and Approaches article responds to calls in TPC scholarship to articulate and practice methodologies resonant with the social justice turn.
Article
bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Introduction: This tutorial article explicates some of the 2017 updates to the 1991 Common Rule. Implemented in January 2019, these changes impact the evolving landscape of research with human participants in the US. The introduction provides context for the article as well as framing research questions. Key concepts: These include a brief definition of research for the purposes of the article; a discussion of the foundational ethical principles of autonomy (respect for persons), beneficence, and justice; and a brief overview of the recent policy revision process and its impact on scholarship in the field. After establishing this context, readers are provided five key lessons from the policy revisions. Key lessons: The lessons for researchers working with human participants include discussion of the impact of policy changes in the Common Rule associated with broad consent, single Institutional Review Board (IRB) mandate/cooperative review, updated exemptions, modified definitions of vulnerability, and international research. These are all contextualized within a specific example for professional communication researchers. Implications for practice: Although educative efforts are underway at both the federal and institutional levels, it is important for disciplines and researchers to understand how policy changes will impact the IRB review process. Attendance to the infrastructures that guide our research design and development warrants attention.
Article
Background: Consumer-to-consumer (C2C) e-commerce involves consumers re-selling products to other consumers using online platforms. Research identifies trust as a major factor in this exchange. It concludes that seller-generated product descriptions can mitigate mistrust. Further, technical and professional communication research can reveal what content sellers tend to provide and can reveal how platform design may encourage that content. Literature review: C2C e-commerce and TPC researchers agree that mistrust can be mitigated by detailed content, and they call for platform designers to help improve platform and seller reputations. Research questions: 1. What content do sellers provide about their technical products? 2. How do the platforms’ web form designs and the associated documentation about listing a product for sale encourage certain content types? Research methodology: Four platforms were chosen using specific criteria. Product descriptions were collected once per week for six weeks, generating 1900 product descriptions. These descriptions were unitized and given reliable content categories, a methodology called quantitative content analysis. Further, the documentation and processes for posting items were explored to determine how they may encourage content types. Results/discussion: Sellers mostly provide product information and sales procedures, and they rarely give benefits and goodwill to the buyer. The platform design seems to encourage this content because of the content-entry process, the content-entry options, and the required and unrequired content entry. Conclusions: This study invites technical and professional communicators to provide more guidelines for users about the kinds of content they may include, and designers to explore the content entry process using usability and user-experience research.
Article
Qualitative research offers unique opportunities to contribute to cardiovascular outcomes research. Despite the growth in qualitative research over the last decade, outcomes investigators in cardiology still have relatively little guidance on when and how best to implement these methods in their investigations, leaving the full potential of these methods unrealized. We offer a contemporary look at qualitative methods, including publication trends of qualitative studies in cardiology journals from 1998 to 2018, novel emerging data collection and analytic methods, and current use and examples of cardiovascular outcomes research that apply qualitative methods such as user-centered design, preimplementation evaluation, implementation evaluation, effectiveness evaluation, and policy analysis.
Article
Full-text available
Background: Qualitative research methods are increasingly being used across disciplines because of their ability to help investigators understand the perspectives of participants in their own words. However, qualitative analysis is a laborious and resource-intensive process. To achieve depth, researchers are limited to smaller sample sizes when analyzing text data. One potential method to address this concern is natural language processing (NLP). Qualitative text analysis involves researchers reading data, assigning code labels, and iteratively developing findings; NLP has the potential to automate part of this process. Unfortunately, little methodological research has been done to compare automatic coding using NLP techniques and qualitative coding, which is critical to establish the viability of NLP as a useful, rigorous analysis procedure. Objective: The purpose of this study was to compare the utility of a traditional qualitative text analysis, an NLP analysis, and an augmented approach that combines qualitative and NLP methods. Methods: We conducted a 2-arm cross-over experiment to compare qualitative and NLP approaches to analyze data generated through 2 text (short message service) message survey questions, one about prescription drugs and the other about police interactions, sent to youth aged 14-24 years. We randomly assigned a question to each of the 2 experienced qualitative analysis teams for independent coding and analysis before receiving NLP results. A third team separately conducted NLP analysis of the same 2 questions. We examined the results of our analyses to compare (1) the similarity of findings derived, (2) the quality of inferences generated, and (3) the time spent in analysis. Results: The qualitative-only analysis for the drug question (n=58) yielded 4 major findings, whereas the NLP analysis yielded 3 findings that missed contextual elements. The qualitative and NLP-augmented analysis was the most comprehensive. For the police question (n=68), the qualitative-only analysis yielded 4 primary findings and the NLP-only analysis yielded 4 slightly different findings. Again, the augmented qualitative and NLP analysis was the most comprehensive and produced the highest quality inferences, increasing our depth of understanding (ie, details and frequencies). In terms of time, the NLP-only approach was quicker than the qualitative-only approach for the drug (120 vs 270 minutes) and police (40 vs 270 minutes) questions. An approach beginning with qualitative analysis followed by qualitative- or NLP-augmented analysis took longer time than that beginning with NLP for both drug (450 vs 240 minutes) and police (390 vs 220 minutes) questions. Conclusions: NLP provides both a foundation to code qualitatively more quickly and a method to validate qualitative findings. NLP methods were able to identify major themes found with traditional qualitative analysis but were not useful in identifying nuances. Traditional qualitative text analysis added important details and context.
Article
Full-text available
Background: Automated text analysis is widely used across the social sciences, yet the application of these methods has largely proceeded independently of qualitative analysis. Objective: This paper explores the advantages of applying automated text analysis to augment traditional qualitative methods in demography. Computational text analysis does not replace close reading or subjective theorizing, but it can provide a complementary set of tools that we believe will be appealing for qualitative demographers. Methods: We apply topic modeling to text data from the Malawi Journals Project as a case study. Results: We examine three common issues that demographers face in analyzing qualitative data: large samples, the challenge of comparing qualitative data across external categories, and making data analysis transparent and readily accessible to other scholars. We discuss ways that new tools from machine learning and computer science might help qualitative scholars to address these issues. Conclusions: We believe that there is great promise in mixed-method approaches to analyzing text. New methods that allow better access to data and new ways to approach qualitative data are likely to be fertile ground for research. Contribution: No research, to our knowledge, has used automated text analysis to take an explicitly mixed-method approach to the analysis of textual data. We develop a framework that allows qualitative researchers to do so.
Article
Full-text available
This article proposes novel methods for computational rhetorical analysis to analyze the use of citations in a corpus of academic texts. Guided by rhetorical genre theory, our analysis converts texts to graph-theoretic graphs in an attempt to isolate and amplify the predicted patterns of recurring moves that are associated with stable genres of academic writing. We find that our computational method shows promise for reliably detecting and classifying citation moves similar to the results achieved by qualitative researchers coding by hand as done by Karatsolis (this issue). Further, using pairwise comparisons between advisor and advisee texts, valuable applications emerge for automated computational analysis as formative feedback in a mentoring situation.
Article
Full-text available
Coding is the core process in classic grounded theory methodology. It is through coding that the conceptual abstraction of data and its reintegration as theory takes place. There are two types of coding in a classic grounded theory study: substantive coding, which includes both open and selective coding procedures, and theoretical coding. In substantive coding, the researcher works with the data directly, fracturing and analysing it, initially through open coding for the emergence of a core category and related concepts and then subsequently through theoretical sampling and selective coding of data to theoretically saturate the core and related concepts. Theoretical saturation is achieved through constant comparison of incidents (indicators) in the data to elicit the properties and dimensions of each category (code). This constant comparing of incidents continues until the process yields the interchangeability of indicators, meaning that no new properties or dimensions are emerging from continued coding and comparison. At this point, the concepts have achieved theoretical saturation and the theorist shifts attention to exploring the emergent fit of potential theoretical codes that enable the conceptual integration of the core and related concepts to produce hypotheses that account for relationships between the concepts thereby explaining the latent pattern of social behaviour that forms the basis of the emergent theory. The coding of data in grounded theory occurs in conjunction with analysis through a process of conceptual memoing, capturing the theorist’s ideation of the emerging theory. Memoing occurs initially at the substantive coding level and proceeds to higher levels of conceptual abstraction as coding proceeds to theoretical saturation and the theorist begins to explore conceptual reintegration through theoretical coding.
Article
Full-text available
Qualitative sampling methods have been largely ignored in technical communication texts, making this concept difficult to teach in graduate courses on research methods. Using concepts from qualitative health research, this article provides a primer on qualitative methods as an initial effort to fill this gap in the technical communication literature. Specifically, the authors attempt to clarify some of the current confusion over qualitative sampling terminology, explain what qualitative sampling methods are and why they need to be implemented, and offer examples of how to apply commonly used qualitative sampling methods.
Article
Full-text available
This study is part of an ongoing effort to determine the core competencies of technical communicators by looking at a variety of data. This project is the final component of an attempt to determine which competencies managers seek in new employees. This report summarizes the results of both a survey and interviews of technical communication managers to determine which competencies managers seek. The survey was based on an analysis of the curricula of the 10 largest (by student enrollment) undergraduate technical communication programs. The article begins with a summary of findings and recommendations; it then describes the methodology and presents the results in more detail; finally, it discusses some implications of the data and concludes by looking at emerging trends and managerial expectations in the profession, the future roles that technical communicators may be assuming, and knowledge and skills required to fulfill these roles.
Article
Full-text available
Topic models allow the probabilistic modeling of term frequency occurrences in documents. The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics. The R package topicmodels provides basic infrastructure for fitting topic models based on data structures from the text mining package tm. The package includes interfaces to two algorithms for fitting topic models: the variational expectation-maximization algorithm provided by David M. Blei and co-authors and an algorithm using Gibbs sampling by Xuan-Hieu Phan and co-authors.
Article
Full-text available
Text Mining has become an important research area. Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. In this paper, a Survey of Text Mining techniques and applications have been s presented.
Article
Full-text available
Data analysis is the most difficult and most crucial aspect of qualitative research. Coding is one of the significant steps taken during analysis to organize and make sense of textual data. This paper examines the use of manual and electronic methods to code data in two rather different projects in which the data were collected mainly by in-depth interviewing. The author looks at both the methods in the light of her own experience and concludes that the choice will be dependent on the size of the project, the funds and time available, and the inclination and expertise of the researcher.
Article
Big data is one of the most hyped buzzwords in both academia and industry. This article makes an early contribution to research on big data by situating data theoretically as a historical object and arguing that much of the discourse about the supposed transparency and objectivity of big data ignores the crucial roles of interpretation and communication. To set forth that analysis, this article engages with recent discussion of big data and “smart” cities to show the communicative practices operating behind the scenes of large data projects and relate those practices to the profession of technical communication.
Article
Purpose: This article extends earlier studies examining the core competencies of technical communicators. Our project updates these previous perspectives by analyzing the broad range of information products, technologies, professional competencies, and personal traits requested by industry job advertisements. The analysis seeks to answer three main questions: • What genre/information product knowledge is important for success in the technical communication job market? • What technology skills are essential for success in the technical communication job market? • What professional competencies and personal characteristics are essential for success in the technical communication job market? Method: We analyzed almost 1,000 U.S. technical communication job postings from Monster.com. We mined the postings for position title, job type, education level, experience level, location, salary, and industry sector. We subsequently conducted a content analysis of the job descriptions, using open coding to identify information products, technologies, professional competencies, and personal characteristics. Results: The job postings exhibited enormous variety in position titles but fell into five main categories: Content Developer/Manager, Grant/Proposal Writer, Medical Writer, Social Media Writer, and Technical Writer/Editor. Information products and technology skills varied substantially with job type. The job postings showed some differentiation in professional competencies across job categories, but they also revealed competencies that were common to all categories. Conclusion: Technical communication positions now encompass a wide range of audiences, content, contexts, and media. The jobs data illustrate the breadth of products and competencies that drive the field. © 2015, Society for Technical Communication. All rights reserved.
Article
Full text can be found here: http://www.jmir.org/2015/8/e212/ The prevalence and value of patient-generated health text are increasing, but processing such text remains problematic. Although existing biomedical natural language processing (NLP) tools are appealing, most were developed to process clinician- or researcher-generated text, such as clinical notes or journal articles. In addition to being constructed for different types of text, other challenges of using existing NLP include constantly changing technologies, source vocabularies, and characteristics of text. These continuously evolving challenges warrant the need for applying low-cost systematic assessment. However, the primarily accepted evaluation method in NLP, manual annotation, requires tremendous effort and time. The primary objective of this study is to explore an alternative approach-using low-cost, automated methods to detect failures (eg, incorrect boundaries, missed terms, mismapped concepts) when processing patient-generated text with existing biomedical NLP tools. We first characterize common failures that NLP tools can make in processing online community text. We then demonstrate the feasibility of our automated approach in detecting these common failures using one of the most popular biomedical NLP tools, MetaMap. Using 9657 posts from an online cancer community, we explored our automated failure detection approach in two steps: (1) to characterize the failure types, we first manually reviewed MetaMap's commonly occurring failures, grouped the inaccurate mappings into failure types, and then identified causes of the failures through iterative rounds of manual review using open coding, and (2) to automatically detect these failure types, we then explored combinations of existing NLP techniques and dictionary-based matching for each failure cause. Finally, we manually evaluated the automatically detected failures. From our manual review, we characterized three types of failure: (1) boundary failures, (2) missed term failures, and (3) word ambiguity failures. Within these three failure types, we discovered 12 causes of inaccurate mappings of concepts. We used automated methods to detect almost half of 383,572 MetaMap's mappings as problematic. Word sense ambiguity failure was the most widely occurring, comprising 82.22% of failures. Boundary failure was the second most frequent, amounting to 15.90% of failures, while missed term failures were the least common, making up 1.88% of failures. The automated failure detection achieved precision, recall, accuracy, and F1 score of 83.00%, 92.57%, 88.17%, and 87.52%, respectively. We illustrate the challenges of processing patient-generated online health community text and characterize failures of NLP tools on this patient-generated health text, demonstrating the feasibility of our low-cost approach to automatically detect those failures. Our approach shows the potential for scalable and effective solutions to automatically assess the constantly evolving NLP tools and source vocabularies to process patient-generated text.
Article
The escalating obesity rate in the USA has made obesity prevention a top public health priority. Recent interventions have tapped into the social media (SM) landscape. To leverage SM in obesity prevention, we must understand user-generated discourse surrounding the topic. This study was conducted to describe SM interactions about weight through a mixed methods analysis. Data were collected across 60 days through SM monitoring services, yielding 2.2 million posts. Data were cleaned and coded through Natural Language Processing (NLP) techniques, yielding popular themes and the most retweeted content. Qualitative analyses of selected posts add insight into the nature of the public dialogue and motivations for participation. Twitter represented the most common channel. Twitter and Facebook were dominated by derogatory and misogynist sentiment, pointing to weight stigmatization, whereas blogs and forums contained more nuanced comments. Other themes included humor, education, and positive sentiment countering weight-based stereotypes. This study documented weight-related attitudes and perceptions. This knowledge will inform public health/obesity prevention practice.
Article
In his 2005 book Ambient Findability, Peter Morville argued that what we find changes who we become. In 2012 and beyond---in an information environment of filter bubbles, contextual advertising, and friend-of-friend chains that push ordinary folks well beyond the Dunbar number---perhaps Morville is in need of some updating: what finds us changes who we become.
Article
One significant concern I have for the future of technical communication, a concern I often share with my students, involves the impact of "big data." Though the term is frequently used with a sneer, or at least a slightly unsettled laugh, the methods for retrieving information from large data sets are improving as I write this. One significant question the field faces is: "what new relationships will develop and what new work will technical communicators be responsible for in emergent big data projects, in coming years?"
Conference Paper
Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. This analysis can be used for corpus exploration, document search, and a variety of prediction problems. In this tutorial, I will review the state-of-the-art in probabilistic topic models. I will describe the three components of topic modeling: (1) Topic modeling assumptions (2) Algorithms for computing with topic models (3) Applications of topic models In (1), I will describe latent Dirichlet allocation (LDA), which is one of the simplest topic models, and then describe a variety of ways that we can build on it. These include dynamic topic models, correlated topic models, supervised topic models, author-topic models, bursty topic models, Bayesian nonparametric topic models, and others. I will also discuss some of the fundamental statistical ideas that are used in building topic models, such as distributions on the simplex, hierarchical Bayesian modeling, and models of mixed-membership. In (2), I will review how we compute with topic models. I will describe approximate posterior inference for directed graphical models using both sampling and variational inference, and I will discuss the practical issues and pitfalls in developing these algorithms for topic models. Finally, I will describe some of our most recent work on building algorithms that can scale to millions of documents and documents arriving in a stream. In (3), I will discuss applications of topic models. These include applications to images, music, social networks, and other data in which we hope to uncover hidden patterns. I will describe some of our recent work on adapting topic modeling algorithms to collaborative filtering, legislative modeling, and bibliometrics without citations. Finally, I will discuss some future directions and open research problems in topic models.
Article
Agreement about research questions can strengthen disciplinary identity and give direction to a field that is still maturing. The central research question this article poses foregrounds texts, broadly defined as verbal, visual, and multimedia, and the power of texts to mediate knowledge, values, and action in a variety of contexts. Related questions concern disciplinarity, pedagogy, practice, and social change. These questions overlap and inform each other. Any single study does not necessarily fall exclusively into one area. A mapping of a field's research questions is a political act, emphasizing some questions and marginalizing or excluding others. The emphases may change over time. This mapping illustrates reasons for the tensions between the academic and practitioner areas of the field. It also points out their shared research interests and opportunities for future research.
Article
Summary Provides a framework of experiences and skills employers call for in job postingsShows that potential employers are seeking very technical or domain-specific knowledge from technical writersShows that specific technology tool skills are less important to employers than more basic technical writing skills
Article
Social researchers often apply qualitative research methods to study groups and their communications artifacts. The use of computer-mediated communications has dramatically increased the volume of text available, but coding such text requires considerable manual effort. We discuss how systems that process text in human languages (i.e. natural language processing [NLP]) might partially automate content analysis by extracting theoretical evidence. We present a case study of the use of NLP for qualitative analysis in which the NLP rules showed good performance on a number of codes. With the current level of performance, use of an NLP system could reduce the amount of text to be examined by a human coder by an order of magnitude or more, potentially increasing the speed of coding by a comparable degree. The paper is significant as it is one of the first to demonstrate the use of high-level NLP techniques for qualitative data analysis.
Looking in the dustbin: Data janitorial work, statistical reasoning, and information rhetorics
  • A Beveridge
textstem: Tools for Stemming and Lemmatizing Text Version 0.1.3
  • T Rinker
Of horsemen and layered literacies: Assessment instruments for aligning technical and professional communication undergraduate curricula with professional expectations
  • S Henschel
  • L Meloncon
Technology and technical and professional communication through the lens of the MLA
  • C Lauer
Attention ecology: Trend circulation and the virality threshold
  • N Van Horn
  • A Beveridge
  • S Morey