Steven Bird

Steven Bird
Charles Darwin University | CDU · College of Indigenous Futures, Arts and Society

PhD (Edin)

About

176
Publications
183,651
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
11,290
Citations
Introduction
Steven Bird is conducting social and technological experiments in the future evolution of the world's languages. Together with his students and colleagues, he is developing scalable methods for preserving disappearing words and worldviews for future generations of speakers and scholars. He is collaborating with speech communities in diasporas and ancestral homelands to design new approaches to language maintenance and revitalisation. Steven studied computer science at the University of Melbourne before completing a PhD in computational linguistics at the University of Edinburgh. He has conducted fieldwork on endangered languages in West Africa, South America, Central Asia, Melanesia, and Australia.
Additional affiliations
July 2017 - present
Charles Darwin University
Position
  • Professor
October 2002 - June 2017
University of Melbourne
Position
  • Professor (Associate)
July 1998 - September 2002
University of Pennsylvania
Position
  • Managing Director

Publications

Publications (176)
Conference Paper
The Open Language Archives Community (OLAC) provides a comprehensive infrastructure that has allowed our community to index and discover language resources over the past 20 years. However, OLAC infrastructure has fallen behind as the digital libraries community has continued to evolve. New investment is required in order to move OLAC into the digit...
Preprint
Full-text available
Polysynthetic languages have exceptionally large and sparse vocabularies, thanks to the number of morpheme slots and combinations in a word. This complexity, together with a general scarcity of written data, poses a challenge to the development of natural language technologies. To address this challenge, we offer linguistically-informed approaches...
Conference Paper
Full-text available
Kunwinjku is a polysynthetic language spoken in northern Australia. Members of the community have expressed interest in co-developing language applications which could assist in the production of written language resources for education and language learning. Modelling Kunwinjku morphology is a step towards accomplishing these goals. We discuss som...
Chapter
Full-text available
Speakers of the world's endangered languages are rapidly gaining access to broadband internet on mobile devices. Meanwhile social mobile technologies continue to transform the way people work together. I believe that conditions are ripe for the development of a new generation of software for endangered languages. This software will enable new ways...
Conference Paper
Full-text available
We describe a reusable Web component for capturing talk about images. A speaker is prompted with a series of images and talks about each one while adding gestures. Others can watch the audiovisual slideshow, and navigate forwards and backwards by swiping on the images. The component supports phrase-aligned respeaking, translation, and commentary. T...
Conference Paper
Full-text available
A new computer science curriculum has been developed for the Victorian Certificate of Education. It gives students direct entry into second year University computer science. The curriculum focuses on data structures and algorithms, with an emphasis on the graph abstract data type and graph algorithms. We taught a pilot course during 2014 involving...
Article
Full-text available
Crosslingual word embeddings represent lexical items from different languages in the same vector space, enabling transfer of NLP tools. However, previous attempts had expensive resource requirements, difficulty incorporating monolingual data or were unable to handle polysemy. We address these drawbacks in our method which takes advantage of a high...
Conference Paper
Full-text available
Cross-lingual transfer has been shown to produce good results for dependency parsing of resource-poor languages. Although this avoids the need for a target language treebank, most approaches have still used large parallel corpora. However, parallel data is scarce for low-resource languages, and we report a new method that does not need parallel dat...
Conference Paper
Full-text available
Proliferating smartphones and mobile software offer linguists a scalable, networked recording device. This paper describes Aikuma, a mobile app that is designed to put the key language documentation tasks of recording, respeaking, and translating in the hands of a speech community. After motivating the approach we describe the system and briefly re...
Conference Paper
Full-text available
We present an unsupervised approach to part-of-speech tagging based on projections of tags in a word-aligned bilingual parallel corpus. In contrast to the existing state-of-The-art approach of Das and Petrov, we have developed a substantially simpler method by automatically identifying "good" training sentences from the parallel corpus and applying...
Conference Paper
Full-text available
Existing methods for collecting texts from endangered languages are not creating the quantity of data that is needed for corpus studies and natural language processing tasks. This is because the process of transcribing and translating from audio recordings is too onerous. A more effective method, we argue, is to involve local speakers in the field...
Article
Full-text available
With hundreds of endangered and under-documented languages, Papua New Guinea presents an enormous challenge to the documentary linguistics community. This article reports on a workshop held at the University of Goroka in May and June of 2012. The workshop aimed to collect written texts and their translations for several languages, while building lo...
Article
Full-text available
Databases of hierarchically annotated text occupy a central place in linguistic re-search and language technology develop-ment. We describe a new approach to tree query which we call "Query by Annota-tion". Users express a query by anno-tating a tree, and the annotation is com-piled into an expression in a path lan-guage. The result trees are overl...
Article
This article presents the fundamentals of descriptive phonology and gives an overview of computational phonology. Phonology is the systematic study of sounds used in language, and their composition into syllables, words, and phrases. It introduces some of the key concepts of phonology by simple examples involving real data and gives a brief discuss...
Conference Paper
Full-text available
We describe the design of a comparable corpus that spans all of the world's languages and facilitates large-scale cross-linguistic processing. This Universal Corpus consists of text collections aligned at the document and sentence level, multilingual wordlists, and a small set of morphological, lexical, and syntactic annotations. The design encompa...
Conference Paper
Full-text available
This paper explores approaches to sentiment classification of U. S. Congressional floor-debate transcripts. Collective classification techniques are used to take advantage of the informal citation structure present in the debates. We use a range of methods based on local and global formulations and introduce novel approaches for incorporating the o...
Conference Paper
Full-text available
We present a grand challenge to build a corpus that will include all of the world's languages, in a consistent structure that permits large-scale cross-linguistic processing, enabling the study of universal linguistics. The focal data types, bilingual texts and lexicons, relate each language to one of a set of reference languages. We propose that t...
Conference Paper
Full-text available
Can the speakers of small languages, which may be remote, unwritten, and endangered, be trained to create an archival record of their oral literature, with only limited external support? This paper describes the model of “Basic Oral Language Documentation”, as adapted for use in remote village locations, far from digital archives but close to endan...
Article
Full-text available
Large databases of linguistic annotations are used for testing linguistic hypotheses and for training language processing models. These linguistic annotations are often syntactic or prosodic in nature, and have a hierarchical structure. Query languages are used to select particular structures of interest, or to project out large slices of a corpus...
Conference Paper
Full-text available
A variety of query systems have been devel- oped for interrogating parsed corpora, or tree- banks. With the arrival of efficient, wide- coverage parsers, it is feasible to create very large databases of trees. However, existing ap- proaches that use in-memory search, or rela- tional or XML database technologies, do not scale up. We describe a metho...
Conference Paper
Full-text available
Under an ARC Linkage Infrastructure, Equipment and Facilities (LIEF) grant, speech science and technology experts from across Australia have joined forces to organise the recording of audio-visual (AV) speech data from representative speakers of Australian English in all capital cities and some regional centres. The Big Australian Speech Corpus (th...
Article
Full-text available
This article describes a framework for incorporating referential semantic information from a world model or ontology directly into a probabilistic language model of the sort commonly used in speech recognition, where it can be probabilistically weighted ...
Conference Paper
Full-text available
Contemporary speech science is driven by the availability of large, diverse speech corpora. Such infrastructure underpins research and technological advances in various practical, socially beneficial and economically fruitful endeavours, from ASR to hearing prostheses. Unfortunately, speech corpora are not easy to come by because they are both expe...
Article
Full-text available
Large auditory-visual (AV) speech corpora are the grist of modern research in speech science, but no such corpus exists for Australian English. This is unfortunate, for speech science is the brains behind speech technology and applications such as text-to-speech (TTS) synthesis, automatic speech recognition (ASR), speaker recognition and forensic i...
Article
Full-text available
Linguistic forms are inherently multi-dimensional. They exhibit a variety of phonological, orthographic, morphosyntactic, semantic and pragmatic properties. Accordingly, linguistic analysis involves multi-dimensional exploration, a process in which the same collection of forms is laid out in many ways until clear patterns emerge. Equally, language...
Article
The CSIR03 Division of Building Research (DBR) has built a number of text-based and graphics-based design code expert systems in PROLOG. Each of these systems involves several thousand lines of PROLOG and stretches to the limit and the capacity of the IBM ATs on which they run. We have found PROLOG to be a very powerful language for expressing the...
Article
Full-text available
The Natural Language Toolkit (NLTK) is widely used for teaching natural language processing to students majoring in linguistics or computer science. This paper describes the design of NLTK, and reports on how it has been used effectively in classes that involve different mixes of linguistics and computer science students. We focus on three key issu...
Conference Paper
Full-text available
Over the past decade, a variety of expres- sive linguistic query languages have been developed. The most scalable of these have been implemented on top of an existing database engine. However, with the arrival of efficient, wide-coverage parsers, it is feasi- ble to parse text on a scale that is several orders of magnitude larger. We show that the...
Conference Paper
Full-text available
The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics. Its primary purpose is to serve as a reference repository of r esearch results, but we believe that it can also be an object o f study and a platform for research in its own right. We describe an enriched and standar...
Article
Full-text available
This paper describes work the Open Language Archives Community (OLAC) is doing to contribute to a global infrastructure for the sustainability of language resources. After offering a definition of language resource, it addresses the issue of what makes language resources sustainable by defining six necessary and sufficient conditions for their sust...
Article
Full-text available
Discourse in and about computational linguis- tics depends on a shared body of knowledge. However, little content is shared across the introductory courses in this field. Instead, they typically cover a diverse assortment of topics tailored to the capabilities of the stu- dents and the interests of the instructor. If the core body of knowledge coul...
Article
Full-text available
This paper shows how fieldwork data can be managed using the program Toolbox together with the Natural Language Toolkit (NLTK) for the Python programming language. It provides background information about Toolbox and describes how it can be downloaded and installed. The basic functionality of the program for lexicons and texts is described, and its...
Article
Full-text available
This thesis investigates the application of structured sequence classification models to multilingual natural language processing (NLP). Many tasks tackled by NLP can be framed as classification, where we seek to assign a label to a particular piece of text, be it a word, sentence or document. Yet often the labels which we’d like to assign exhibit...
Conference Paper
Full-text available
Search engines pervade the digital world, mediating most access to information instantaneously. We have found that students can build search engine components, and even entire search engines, in the context of problem-based learn- ing in introductory and intermediate computer science courses. The courses cover a broad range of topics in algorithms,...
Conference Paper
Full-text available
Linguistic research and natural language processing employ large repositories of ordered trees. XML, a standard ordered tree model, and XPath, its associated language, are natural choices for linguistic data and queries. However, several important expressive features required for linguistic queries are missing or hard to express in XPath. In this p...
Conference Paper
Full-text available
The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguistic data structures...
Article
Full-text available
The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been rel- atively well studied over several decades. Automated approaches to written language identification are used widely throughout research and industrial contexts, over both oral and written source materials. Despite t...
Article
Most web content exists in a few dozen languages. Hundreds of other languages - the 'low-density languages' - are only represented in scarce quantities on the web. How can we locate, store and describe these low-density resources? In particular, how can we identify linguistically interesting resources, such as translation sets and multilingual docu...
Conference Paper
Full-text available
Annotated linguistic databases are widely used in linguistic research and in language technology development. These annotations are typically hierarchical, and represent the nested structure of syntactic and prosodic constituents. Recently, the LPath language has been proposed as a convenient path-based language for querying linguistic trees. We es...
Article
Full-text available
Spoken word audio collections cover many do- mains, including radio and television broadcasts, oral narra- tives, governmental proceedings, lectures, and telephone con- versations. The collection, access and preservation of such data is stimulated by political, economic, cultural and educa- tional needs. This paper outlines the major issues in the...
Article
Full-text available
Linguistic research and language technology development employ large repositories of ordered trees. XML, a standard ordered tree model, and XPath, its associated language, are natural choices for linguistic data storage and queries. However, several important expressive features required for linguistic queries are missing in XPath. In this paper, w...
Conference Paper
Full-text available
Linguistic research and language technology development employ large repositories of ordered trees. XML, a standard ordered tree model, and XPath, its associated language, are natural choices for storing and querying linguistic data. However, several important expressive features required for linguistic queries are missing in XPath. In this paper,...
Article
Full-text available
Linguistic forms are inherently multi-dimensional. They exhibit a variety of phonological, orthographic, morphosyntactic, semantic and pragmatic properties. Accordingly, linguistic analysis involves multi-dimensional exploration, a process in which the same collection of forms are laid out in many ways until clear patterns emerge. Equally, language...
Conference Paper
Full-text available
Many linguistic research projects collect large amounts of multimodal data in digital formats. Despite the plethora of data collection applications available, it is often difficult for researchers to identify and integrate applications which enable the management of collections of multimodal data in addition to facilitating the actual collection pr...
Conference Paper
Full-text available
Interlinear text has long been considered a valuable format in the presentation of multilingual data, and a variety of software tools have facilitated the creation and processing of such texts by researchers. Despite the diversity of tools, a common core of editorial functionality is provided. Identifying these core functions has important implicat...
Conference Paper
Full-text available
The prime consideration in designing sustainable language resources is to ensure that they remain interpretable for coming generations of users. In this paper we adopt a new perspective on resource creation - securing the interpretability of data, using a case study of Ega, an endangered African language for which a small amount of legacy data is a...
Article
Many linguistic research projects collect large amounts of multimodal data in digital formats. Despite the plethora of data collection applications available, it is often difficult for researchers to identify and integrate applications which enable the management of collections of multimodal data in addition to facilitating the actual collection pr...
Conference Paper
Full-text available
Interlinear text is a common presentational format for linguistic information, and its creation and management have been greatly facilitated by the development of specialised software. In earlier work we developed a four-level model and corresponding formal specification for interlinear text. Here we describe a suitable XML representation for the m...
Article
Full-text available
The goal of the TalkBank project (http://talkbank.org) is to support data-sharing and direct, community-wide access to naturalistic recordings and transcripts of human and animal communication. Toward this end, we have constructed a web accessible database of transcripts linked to audio and video media within fields such as conversation analysis, c...