Article

Recent Advances in Technologies for Resource Creation and Mobilization in Language Documentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Language documentation as a subfield of linguistics has arisen over the past roughly two and a half decades more or less simultaneously with the widespread availability of inexpensive hardware and software for creating, storing, and sharing digital objects. Thus, in some ways the history of advancements within the discipline is also a history of how technological tools have been developed, tested, adopted, and eventually abandoned as newer technologies appear. In this article we examine some recent technologies used both for creating documentary resources, usually considered to include recorded language events in a variety of genres and settings and enough annotation to make them decipherable, and for then mobilizing those resources so that they can be used and shared in language learning, reclamation, revitalization, and analysis. Expected final online publication date for the Annual Review of Linguistics, Volume 9 is January 2023. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Voicebanks and files are fully cross-compatible between the different applications, and the user interfaces are also largely the same. The Cherokee UTAUloid and examples shown here were created in OpenUTAU, which is recommended for several reasons: it is in active development by an engaged, helpful, and welcoming community; its interface is fully localized into 17 languages; and its open-source license and design allows for all the benefits of using open tools in Indigenous language revitalization, including accessibility, longevity, and community customization (Berez-Kroeker et al., 2023;Brinklow et al., 2019;Salazar et al., 2021). ...
Article
Full-text available
Music plays many important roles in language revitalization, from attracting learners and fostering speech communities to supporting language learning. These effects, however, are largely independent from the skills which linguists bring to language revitalization. This study introduces one concrete way in which applied linguistics can directly support musical language revitalization with UTAUloids – speech-and-music software synthesizers – illustrated through the creation of a Cherokee UTAUloid as part of ancestral language reclamation by a learner-linguist Cherokee Nation citizen. Through their focus on “massive collaboration,” low-resource music production, and youth involvement, UTAUloids are uniquely situated to serve as instruments for language revitalization. Even the act of creating an UTAUloid itself allows speakers and learners who may not consider themselves “musical” to contribute to musical language revitalization, and this study provides a step-by-step methodology to make creating an UTAUloid as accessible as possible for anyone interested in incorporating music into their own language revitalization practice.
Chapter
Full-text available
A guide to principles and methods for the management, archiving, sharing, and citing of linguistic research data, especially digital data. “Doing language science” depends on collecting, transcribing, annotating, analyzing, storing, and sharing linguistic research data. This volume offers a guide to linguistic data management, engaging with current trends toward the transformation of linguistics into a more data-driven and reproducible scientific endeavor. It offers both principles and methods, presenting the conceptual foundations of linguistic data management and a series of case studies, each of which demonstrates a concrete application of abstract principles in a current practice. In part 1, contributors bring together knowledge from information science, archiving, and data stewardship relevant to linguistic data management. Topics covered include implementation principles, archiving data, finding and using datasets, and the valuation of time and effort involved in data management. Part 2 presents snapshots of practices across various subfields, with each chapter presenting a unique data management project with generalizable guidance for researchers. The Open Handbook of Linguistic Data Management is an essential addition to the toolkit of every linguist, guiding researchers toward making their data FAIR: Findable, Accessible, Interoperable, and Reusable.
Chapter
Full-text available
A guide to principles and methods for the management, archiving, sharing, and citing of linguistic research data, especially digital data. “Doing language science” depends on collecting, transcribing, annotating, analyzing, storing, and sharing linguistic research data. This volume offers a guide to linguistic data management, engaging with current trends toward the transformation of linguistics into a more data-driven and reproducible scientific endeavor. It offers both principles and methods, presenting the conceptual foundations of linguistic data management and a series of case studies, each of which demonstrates a concrete application of abstract principles in a current practice. In part 1, contributors bring together knowledge from information science, archiving, and data stewardship relevant to linguistic data management. Topics covered include implementation principles, archiving data, finding and using datasets, and the valuation of time and effort involved in data management. Part 2 presents snapshots of practices across various subfields, with each chapter presenting a unique data management project with generalizable guidance for researchers. The Open Handbook of Linguistic Data Management is an essential addition to the toolkit of every linguist, guiding researchers toward making their data FAIR: Findable, Accessible, Interoperable, and Reusable.
Chapter
Full-text available
A guide to principles and methods for the management, archiving, sharing, and citing of linguistic research data, especially digital data. “Doing language science” depends on collecting, transcribing, annotating, analyzing, storing, and sharing linguistic research data. This volume offers a guide to linguistic data management, engaging with current trends toward the transformation of linguistics into a more data-driven and reproducible scientific endeavor. It offers both principles and methods, presenting the conceptual foundations of linguistic data management and a series of case studies, each of which demonstrates a concrete application of abstract principles in a current practice. In part 1, contributors bring together knowledge from information science, archiving, and data stewardship relevant to linguistic data management. Topics covered include implementation principles, archiving data, finding and using datasets, and the valuation of time and effort involved in data management. Part 2 presents snapshots of practices across various subfields, with each chapter presenting a unique data management project with generalizable guidance for researchers. The Open Handbook of Linguistic Data Management is an essential addition to the toolkit of every linguist, guiding researchers toward making their data FAIR: Findable, Accessible, Interoperable, and Reusable.
Chapter
Full-text available
A guide to principles and methods for the management, archiving, sharing, and citing of linguistic research data, especially digital data. “Doing language science” depends on collecting, transcribing, annotating, analyzing, storing, and sharing linguistic research data. This volume offers a guide to linguistic data management, engaging with current trends toward the transformation of linguistics into a more data-driven and reproducible scientific endeavor. It offers both principles and methods, presenting the conceptual foundations of linguistic data management and a series of case studies, each of which demonstrates a concrete application of abstract principles in a current practice. In part 1, contributors bring together knowledge from information science, archiving, and data stewardship relevant to linguistic data management. Topics covered include implementation principles, archiving data, finding and using datasets, and the valuation of time and effort involved in data management. Part 2 presents snapshots of practices across various subfields, with each chapter presenting a unique data management project with generalizable guidance for researchers. The Open Handbook of Linguistic Data Management is an essential addition to the toolkit of every linguist, guiding researchers toward making their data FAIR: Findable, Accessible, Interoperable, and Reusable.
Chapter
Full-text available
A guide to principles and methods for the management, archiving, sharing, and citing of linguistic research data, especially digital data. “Doing language science” depends on collecting, transcribing, annotating, analyzing, storing, and sharing linguistic research data. This volume offers a guide to linguistic data management, engaging with current trends toward the transformation of linguistics into a more data-driven and reproducible scientific endeavor. It offers both principles and methods, presenting the conceptual foundations of linguistic data management and a series of case studies, each of which demonstrates a concrete application of abstract principles in a current practice. In part 1, contributors bring together knowledge from information science, archiving, and data stewardship relevant to linguistic data management. Topics covered include implementation principles, archiving data, finding and using datasets, and the valuation of time and effort involved in data management. Part 2 presents snapshots of practices across various subfields, with each chapter presenting a unique data management project with generalizable guidance for researchers. The Open Handbook of Linguistic Data Management is an essential addition to the toolkit of every linguist, guiding researchers toward making their data FAIR: Findable, Accessible, Interoperable, and Reusable.
Chapter
Full-text available
This chapter describes how data are conventionally used in conversation analysis (CA; for overviews, see Sidnell & Stivers 2013; Clift 2016). We describe where it comes from, how it is collected and organized for analysis, and how it is distributed. Over the course of this description, we make some recommendations regarding best practices and potential improvements.
Preprint
Full-text available
A guide to principles and methods for the management, archiving, sharing, and citing of linguistic research data, especially digital data. “Doing language science” depends on collecting, transcribing, annotating, analyzing, storing, and sharing linguistic research data. This volume offers a guide to linguistic data management, engaging with current trends toward the transformation of linguistics into a more data-driven and reproducible scientific endeavor. It offers both principles and methods, presenting the conceptual foundations of linguistic data management and a series of case studies, each of which demonstrates a concrete application of abstract principles in a current practice. In part 1, contributors bring together knowledge from information science, archiving, and data stewardship relevant to linguistic data management. Topics covered include implementation principles, archiving data, finding and using datasets, and the valuation of time and effort involved in data management. Part 2 presents snapshots of practices across various subfields, with each chapter presenting a unique data management project with generalizable guidance for researchers. The Open Handbook of Linguistic Data Management is an essential addition to the toolkit of every linguist, guiding researchers toward making their data FAIR: Findable, Accessible, Interoperable, and Reusable.
Article
Full-text available
Studies of human spatial behavior increasingly rely on a combination of audiovisual and geospatial recordings. So far, however, few analytical environments have offered opportunities for integrated and expedient annotation and analysis of the two. Here we report the first study aimed at integrating geospatial data in an environment developed for time-aligned annotation of audiovisual media. By calibrating the audiovisual and geospatial signals on the timeline and inserting the geo data as a tier in the annotation tool ELAN, we innovatively generate an environment in which time-aligned annotations of audiovisually observed behavior can be linked and explored in relation to the corresponding geographical coordinates. We illustrate the technique with cultural and linguistic behavior recorded on the move among indigenous communities in Southeast Asia. Our methodological principle is of potential interest to any study or discipline concerned with linking the location and properties of observable behavior.
Article
Full-text available
Corpus phonetics is enabling the comprehensive analysis of large digital speech collections. In this paper, we develop a corpus phonetics workflow that is flexible enough to be easily applied to under-documented languages. To test the capabilities of this workflow we choose a challenging vowel reduction and vowel harmony problem. In Kera (Chadic) it has been shown (Pearce, 2012), that not only is phonetic reduction linked to the phonetic duration of the vowel, but also that reduction is blocked in vowel harmony domains. We are able to replicate previously published experiments by Pearce that were originally completed using manual measurements. We expect that our corpus phonetics workflow will be of value to phonologists working on other languages.
Book
Full-text available
Understanding Linguistic Fieldwork offers a diverse and practical introduction to research methods used in field linguistics. Designed to teach students how to collect quality linguistic data in an ethical and responsible manner, the key features include: * A focus on fieldwork in countries and continents which have undergone colonial expansion, including Australia, the United States of America, Canada, South America and Africa; * A description of specialist methods used to conduct research on phonological, grammatical and lexical description, but also including methods for research on gesture and sign, language acquisition, language contact and the verbal arts; * Examples of resources that have resulted from collaborations with language communities which both advance linguistic understanding and support language revitalisation work; * Annotated guidance on sources for further reading. This book is essential reading for students studying modules relating to linguistic fieldwork or those looking to embark upon field research.
Article
Full-text available
This paper is a position statement on reproducible research in linguistics, including data citation and attribution, that represents the collective views of some 41 colleagues. Reproducibility can play a key role in increasing verification and accountability in linguistic research, and is a hallmark of social science research that is currently under-represented in our field. We believe that we need to take time as a discipline to clearly articulate our expectations for how linguistic data are managed, cited, and maintained for long-term access.
Article
Full-text available
There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.
Article
Full-text available
While efforts to document endangered languages have steadily increased, the phonetic analysis of endangered language data remains a challenge. The transcription of large documentation corpora is, by itself, a tremendous feat. Yet, the process of segmentation remains a bottleneck for research with data of this kind. This paper examines whether a speech processing tool, forced alignment, can facilitate the segmentation task for small data sets, even when the target language differs from the training language. The authors also examined whether a phone set with contextualization outperforms a more general one. The accuracy of two forced aligners trained on English (hmalign and p2fa) was assessed using corpus data from Yoloxóchitl Mixtec. Overall, agreement performance was relatively good, with accuracy at 70.9% within 30 ms for hmalign and 65.7% within 30 ms for p2fa. Segmental and tonal categories influenced accuracy as well. For instance, additional stop allophones in hmalign's phone set aided alignment accuracy. Agreement differences between aligners also corresponded closely with the types of data on which the aligners were trained. Overall, using existing alignment systems was found to have potential for making phonetic analysis of small corpora more efficient, with more allophonic phone sets providing better agreement than general ones.
Article
Full-text available
New ways of documenting and describing language via electronic media coupled with new ways of distributing the results via the World -Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resource discovery and resource creation. This article describes the infras tructure that the Open Language Archives Community (OLAC) has built in order to address these problems. Its technical and usage infrastructures address problems of resource discovery by constructing a single virtual library of distributed resources. Its go vernance infrastructure addresses problems of resource creation by providing a mechanism through which the language-resource community can express its consensus on recommended best practices.
Article
Full-text available
The process of documenting and describing the world's languages is undergoing radical transformation with the rapid uptake of new digital technologies for capture, storage, annotation and dissemination. However, uncritical adoption of new tools and technologies is leading to resources that are difficult to reuse and which are less portable than the conventional printed resources they replace. We begin by reviewing current uses of software tools and digital technologies for language documentation and description. This sheds light on how digital language documentation and description are created and managed, leading to an analysis of seven portability problems under the following headings: content, format, discovery, access, citation, preservation and rights. After characterizing each problem we provide a series of value statements, and this provides the framework for a broad range of best practice recommendations.
Chapter
This volume brings together novel, original studies on prosody and prosodic interfaces. It consists of fifteen chapters, of which some look at word prosody and phrase prosody in individual languages, some examine the interactions between lexical tones and intonation, and others analyze the syntax-prosody interface. Despite much recent attention paid to prosody, there is yet a significant number of languages and dialects that remain largely undocumented or understudied. Many chapters in this volume contribute to this empirical gap in prosodic research by presenting new data, based on original fieldwork and experiments. Moreover, many chapters address important questions pertaining to the interactions between lexical and postlexical tones with in-depth investigations of both lexical prosody and postlexical phonology. Furthermore, other chapters tackle the question of how prosodic structure—either lexical or postlexical—interacts with syntactic structure, thereby contributing to our understanding of the interaction between multiple components of the grammar, embedded in a thorough understanding of current linguistic theories. The volume as a whole addresses many difficult issues and illuminates the question of how prosody is structured in language and functions in human communication.
Article
This paper reports on progress integrating the speech recognition toolkit ESPnet into Elpis, a web front-end originally designed to provide access to the Kaldi automatic speech recognition toolkit. The goal of this work is to make end-to-end speech recognition models available to language workers via a user-friendly graphical interface. Encouraging results are reported on (i) development of an ESPnet recipe for use in Elpis, with preliminary results on data sets previously used for training acoustic models with the Persephone toolkit along with a new data set that had not previously been used in speech recognition, and (ii) incorporating ESPnet into Elpis along with UI enhancements and a CUDA-supported Dockerfile.
Article
Purpose The purpose of this paper is to demonstrate the possibility for the galleries, libraries, archives and museums sector to employ playful, immersive discovery interfaces for their collections and raise awareness of some of the considerations that go into the decision to use such technology and the creation of the interfaces. Design/methodology/approach This is a case study approach using the methodology of research through design. The paper introduces two examples of immersive interfaces to archival data created by the authors, using these as a springboard for discussing the different kinds of embodied experiences that users have with different kinds of immersion, for example, the exploration of the archive on a flat screen, a data “cave” or arena, or virtual reality. Findings The role of such interfaces in communicating with the audience of an archive is considered, for example, in allowing users to detect structure in data, particularly in understanding the role of geographic or other spatial elements in a collection, and in shifting the locus of knowledge production from individual to community. It is argued that these different experiences draw on different metaphors in terms of users’ prior experience with more well-known technologies, for example, “a performance” vs “a tool” vs “a background to a conversation”. Originality/value The two example interfaces discussed here are original creations by the authors of this paper. They are the first uses of mixed reality for interfacing with the archives in question. One is the first mixed reality interface to an audio archive. The discussion has implications for the future of interfaces to galleries, archives, libraries and museums more generally.
Article
This discussion note reviews responses of the linguistics profession to the grave issues of language endangerment identified a quarter of a century ago in the journal Language by Krauss, Hale, England, Craig, and others (Hale et al. 1992). Two and a half decades of worldwide research not only have given us a much more accurate picture of the number, phylogeny, and typological variety of the world’s languages, but they have also seen the development of a wide range of new approaches, conceptual and technological, to the problem of documenting them. We review these approaches and the manifold discoveries they have unearthed about the enormous variety of linguistic structures. The reach of our knowledge has increased by about 15% of the world’s languages, especially in terms of digitally archived material, with about 500 languages now reasonably documented thanks to such major programs as DoBeS, ELDP, and DEL. But linguists are still falling behind in the race to document the planet’s rapidly dwindling linguistic diversity, with around 35–42% of the world’s languages still substantially undocumented, and in certain countries (such as the US) the call by Krauss (1992) for a significant professional realignment toward language documentation has only been heeded in a few institutions. Apart from the need for an intensified documentarist push in the face of accelerating language loss, we argue that existing language documentation efforts need to do much more to focus on crosslinguistically comparable data sets, sociolinguistic context, semantics, and interpretation of text material, and on methods for bridging the ‘transcription bottleneck’, which is creating a huge gap between the amount we can record and the amount in our transcribed corpora.*. © 2018, Frank Seifart, Nicholas Evans, Harald Hammarström, & Stephen C. Levinson.
Article
The Oxford Handbook of Linguistic Fieldwork offers a guide to linguistic fieldwork reflecting the collaborative nature of the field across the subfields of linguistics and disciplines such as astronomy, anthropology, biology, musicology, and ethnography. Experienced scholars and fieldworkers explain the methods and approaches needed to understand a language in its full cultural context and to document it accessibly and enduringly. Articles consider the application of new technological approaches to recording and documentation, but never lose sight of the crucial relationship between subject and researcher. The book is timely: an increased awareness of dying languages and vanishing dialects has stimulated the impetus for recording them as well as the funds required to do so. © editorial matter and organization Nicholas Thieberger 2012.
Article
Researchers often wish to make knowledge publicly available, especially through publication, but may not wish to share their raw data. Human Subjects offices (IRBs) sometimes wish to keep all information private or even to have raw data destroyed. Funding agencies may hope for the data they finance collection of to be made publicly available to increase its impact. These motivations often conflict. This paper discusses how current Human Subjects regulations impact sharing of corpus data among researchers, how funding agencies influence this, and how researchers react to these forces. The paper also discusses potential future changes to the current outcomes, through changes to Human Subjects Protection regulations and changes to funding agency requirements on data sharing.
Article
Much of the work that is labeled 'descriptive' within linguistics comprises two activities, i.e. the collection of primary data and a (low-level) analysis of these data. These are indeed two separate activities as shown by the fact that the methods employed in each activity differ substantially. To date, the field concerned with the first activity — called 'documentary linguistics' here — has received very little attention from linguists. It is proposed that documentary linguistics be conceived of as a fairly independent field of linguistic inquiry and practice which is no longer linked exclusively to the descriptive framework. A format for language documentations (in contrast to language descriptions) is presented and various practical and theoretical issues connected with this format are discussed. These include the rights of the individuals and communities contributing to a language documentation, the parameters for the selection of the data to be included in a documentation, and the assessment of the quality of such data.1
Article
Following Hill’s (2002) examination of the dominant rhetorical strategies used to discuss language revitalization projects, this paper continues this investigation, utilizing examples from sustained linguistic fieldwork in an indigenous Pueblo community in New Mexico. I detail the context surrounding the Pueblo’s decision to employ written indigenous language materials as part of a community language program including the new ways of limiting access to cultural information that have been developed in response to the controversial status of writing in this community. I show that the application of the concept that Hill identifies as “universal ownership” has the potential to lead to serious ethical problems, detailing the creative approaches to textual circulation within one community and offering alternatives for scholars facing ethical issues involving publication.
Article
This paper addresses the concept of informed consent when working with remote, non-literate groups. By examining both the legal and moral obligations of informed consent, it will be argued that “erring on the side of caution”, for instance by not publishing on the Internet because the consultants/community do not have exposure to such things, is just as paternalistic as assuming that they would consent if they understood. It is further argued that the researcher has an obligation to explain the research to the consultants/community as fully as possible and to engage in an ongoing negotiation of consent, but that the researcher must respect the autonomy of the consultant/community decision, even if the consent was not fully “informed”.
Article
Recently, there has been extensive work in linguistics to develop recommendations for digitizing legacy language materials. However, relatively little work has been done on the social and legal concerns regarding rights and access to these materials, most of which were created when concerns surrounding intellectual property were less sensitive than they are today. We discuss four issues related to establishing rights and access to legacy language materials: (i) determining what ―community‖ they should be associated with, (ii) establishing rights retroactively, (iii) establishing rights and access to ―orphan‖ works, and (iv) assessing the sensitivities associated with different genres.
Article
An increasingly common theme in publications on ethical review in the social sciences is the burden that regulation places on researchers. But empirical findings of the extent of the problem are difficult to find, and much of the criticism of ethical review boards rests on anecdotal and individual reports. Within linguistics there has also been a greater focus on ethics, but discussion has focused on field research, and ethical regulation has not been systematically surveyed. In this report I present and discuss the results of an anonymous survey of linguistic fieldworkers and their responses to human subjects review. These results provide a snapshot of fieldwork regulation and its effect on field practices.*
Article
D uring the last decade, many linguists and linguistic anthropologists have participated in a campaign of advocacy on behalf of endangered languages. 1 Robins and Uhlenbeck (1992), Hale et al. (1992), Nettle and Romaine (2000), and Crystal (2000) are among many examples of a literature aimed at a wide audience mat includes scholars, students, and community members. The goal of this campaign is to recruit scholars to efforts at documentation and development; to increase general public understanding of language endangerment; to attract funds in support of efforts by communities to reclaim, maintain, and develop their heritage languages; and to assist communities in refining these efforts. In some ways, the campaign has been successful. The most important media discuss the issue from time to time, small grant funds to support community efforts have been developed, and scholarly interest in documenting endangered languages has certainly in-creased. The present article, however, is not about these successes. Instead, it critiques ways in which linguists and anthropologists may unwittingly undermine their own vigorous advocacy of endangered languages by a failure to think carefully about the multiple audiences who may hear and read advocacy rhetoric. Community language workers, speakers, and other members of local groups are both participants and overhearers in a global conversation about language endanger-ment in which the voices of academics and policymakers are especially prominent. How might this global conversation resonate for members of communities that are custodians of endangered languages—communities that are themselves a diverse audi-ence? Do they find it empowering and encouraging, unintelligible and alienating, or something in between? Can they borrow from it to conduct their own advocacy, or do they prefer to use quite different discourses? What is needed is fieldwork that explores these questions specifically. In this article, I develop some questions that such work might take up by examining some of the discursive practices of the global Joumalof Linguistic Anthropology 12(2):119-133. Copyright ©2002, American Anthropological Association.