Steven Bird

Steven Bird
  • PhD (Edin)
  • Professor at Charles Darwin University

About

183
Publications
250,721
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
16,314
Citations
Introduction
Steven Bird is conducting social and technological experiments in the future evolution of the world's languages. Together with his students and colleagues, he is developing scalable methods for preserving disappearing words and worldviews for future generations of speakers and scholars. He is collaborating with speech communities in diasporas and ancestral homelands to design new approaches to language maintenance and revitalisation. Steven studied computer science at the University of Melbourne before completing a PhD in computational linguistics at the University of Edinburgh. He has conducted fieldwork on endangered languages in West Africa, South America, Central Asia, Melanesia, and Australia.
Current institution
Charles Darwin University
Current position
  • Professor
Additional affiliations
July 2017 - present
Charles Darwin University
Position
  • Professor
October 2002 - June 2017
University of Melbourne
Position
  • Professor (Associate)
July 1998 - September 2002
University of Pennsylvania
Position
  • Managing Director

Publications

Publications (183)
Conference Paper
The Open Language Archives Community (OLAC) provides a comprehensive infrastructure that has allowed our community to index and discover language resources over the past 20 years. However, OLAC infrastructure has fallen behind as the digital libraries community has continued to evolve. New investment is required in order to move OLAC into the digit...
Preprint
Full-text available
Polysynthetic languages have exceptionally large and sparse vocabularies, thanks to the number of morpheme slots and combinations in a word. This complexity, together with a general scarcity of written data, poses a challenge to the development of natural language technologies. To address this challenge, we offer linguistically-informed approaches...
Conference Paper
Full-text available
Kunwinjku is a polysynthetic language spoken in northern Australia. Members of the community have expressed interest in co-developing language applications which could assist in the production of written language resources for education and language learning. Modelling Kunwinjku morphology is a step towards accomplishing these goals. We discuss som...
Chapter
Full-text available
Speakers of the world's endangered languages are rapidly gaining access to broadband internet on mobile devices. Meanwhile social mobile technologies continue to transform the way people work together. I believe that conditions are ripe for the development of a new generation of software for endangered languages. This software will enable new ways...
Conference Paper
Full-text available
We describe a reusable Web component for capturing talk about images. A speaker is prompted with a series of images and talks about each one while adding gestures. Others can watch the audiovisual slideshow, and navigate forwards and backwards by swiping on the images. The component supports phrase-aligned respeaking, translation, and commentary. T...
Conference Paper
Full-text available
A new computer science curriculum has been developed for the Victorian Certificate of Education. It gives students direct entry into second year University computer science. The curriculum focuses on data structures and algorithms, with an emphasis on the graph abstract data type and graph algorithms. We taught a pilot course during 2014 involving...
Article
Full-text available
Crosslingual word embeddings represent lexical items from different languages in the same vector space, enabling transfer of NLP tools. However, previous attempts had expensive resource requirements, difficulty incorporating monolingual data or were unable to handle polysemy. We address these drawbacks in our method which takes advantage of a high...
Conference Paper
Full-text available
Cross-lingual transfer has been shown to produce good results for dependency parsing of resource-poor languages. Although this avoids the need for a target language treebank, most approaches have still used large parallel corpora. However, parallel data is scarce for low-resource languages, and we report a new method that does not need parallel dat...
Conference Paper
Full-text available
Proliferating smartphones and mobile software offer linguists a scalable, networked recording device. This paper describes Aikuma, a mobile app that is designed to put the key language documentation tasks of recording, respeaking, and translating in the hands of a speech community. After motivating the approach we describe the system and briefly re...
Conference Paper
Full-text available
We present an unsupervised approach to part-of-speech tagging based on projections of tags in a word-aligned bilingual parallel corpus. In contrast to the existing state-of-The-art approach of Das and Petrov, we have developed a substantially simpler method by automatically identifying "good" training sentences from the parallel corpus and applying...
Conference Paper
Full-text available
Existing methods for collecting texts from endangered languages are not creating the quantity of data that is needed for corpus studies and natural language processing tasks. This is because the process of transcribing and translating from audio recordings is too onerous. A more effective method, we argue, is to involve local speakers in the field...
Article
Full-text available
With hundreds of endangered and under-documented languages, Papua New Guinea presents an enormous challenge to the documentary linguistics community. This article reports on a workshop held at the University of Goroka in May and June of 2012. The workshop aimed to collect written texts and their translations for several languages, while building lo...
Article
Full-text available
Databases of hierarchically annotated text occupy a central place in linguistic re-search and language technology develop-ment. We describe a new approach to tree query which we call "Query by Annota-tion". Users express a query by anno-tating a tree, and the annotation is com-piled into an expression in a path lan-guage. The result trees are overl...
Article
This article presents the fundamentals of descriptive phonology and gives an overview of computational phonology. Phonology is the systematic study of sounds used in language, and their composition into syllables, words, and phrases. It introduces some of the key concepts of phonology by simple examples involving real data and gives a brief discuss...
Conference Paper
Full-text available
We describe the design of a comparable corpus that spans all of the world's languages and facilitates large-scale cross-linguistic processing. This Universal Corpus consists of text collections aligned at the document and sentence level, multilingual wordlists, and a small set of morphological, lexical, and syntactic annotations. The design encompa...
Conference Paper
Full-text available
This paper explores approaches to sentiment classification of U. S. Congressional floor-debate transcripts. Collective classification techniques are used to take advantage of the informal citation structure present in the debates. We use a range of methods based on local and global formulations and introduce novel approaches for incorporating the o...
Conference Paper
Full-text available
We present a grand challenge to build a corpus that will include all of the world's languages, in a consistent structure that permits large-scale cross-linguistic processing, enabling the study of universal linguistics. The focal data types, bilingual texts and lexicons, relate each language to one of a set of reference languages. We propose that t...
Conference Paper
Full-text available
Can the speakers of small languages, which may be remote, unwritten, and endangered, be trained to create an archival record of their oral literature, with only limited external support? This paper describes the model of “Basic Oral Language Documentation”, as adapted for use in remote village locations, far from digital archives but close to endan...
Article
Full-text available
Large databases of linguistic annotations are used for testing linguistic hypotheses and for training language processing models. These linguistic annotations are often syntactic or prosodic in nature, and have a hierarchical structure. Query languages are used to select particular structures of interest, or to project out large slices of a corpus...
Conference Paper
Full-text available
A variety of query systems have been devel- oped for interrogating parsed corpora, or tree- banks. With the arrival of efficient, wide- coverage parsers, it is feasible to create very large databases of trees. However, existing ap- proaches that use in-memory search, or rela- tional or XML database technologies, do not scale up. We describe a metho...
Conference Paper
Full-text available
Under an ARC Linkage Infrastructure, Equipment and Facilities (LIEF) grant, speech science and technology experts from across Australia have joined forces to organise the recording of audio-visual (AV) speech data from representative speakers of Australian English in all capital cities and some regional centres. The Big Australian Speech Corpus (th...
Article
Full-text available
This article describes a framework for incorporating referential semantic information from a world model or ontology directly into a probabilistic language model of the sort commonly used in speech recognition, where it can be probabilistically weighted ...
Conference Paper
Full-text available
Contemporary speech science is driven by the availability of large, diverse speech corpora. Such infrastructure underpins research and technological advances in various practical, socially beneficial and economically fruitful endeavours, from ASR to hearing prostheses. Unfortunately, speech corpora are not easy to come by because they are both expe...
Article
Full-text available
Large auditory-visual (AV) speech corpora are the grist of modern research in speech science, but no such corpus exists for Australian English. This is unfortunate, for speech science is the brains behind speech technology and applications such as text-to-speech (TTS) synthesis, automatic speech recognition (ASR), speaker recognition and forensic i...
Article
Full-text available
Linguistic forms are inherently multi-dimensional. They exhibit a variety of phonological, orthographic, morphosyntactic, semantic and pragmatic properties. Accordingly, linguistic analysis involves multi-dimensional exploration, a process in which the same collection of forms is laid out in many ways until clear patterns emerge. Equally, language...
Article
The CSIR03 Division of Building Research (DBR) has built a number of text-based and graphics-based design code expert systems in PROLOG. Each of these systems involves several thousand lines of PROLOG and stretches to the limit and the capacity of the IBM ATs on which they run. We have found PROLOG to be a very powerful language for expressing the...
Article
Full-text available
The Natural Language Toolkit (NLTK) is widely used for teaching natural language processing to students majoring in linguistics or computer science. This paper describes the design of NLTK, and reports on how it has been used effectively in classes that involve different mixes of linguistics and computer science students. We focus on three key issu...
Conference Paper
Full-text available
Over the past decade, a variety of expres- sive linguistic query languages have been developed. The most scalable of these have been implemented on top of an existing database engine. However, with the arrival of efficient, wide-coverage parsers, it is feasi- ble to parse text on a scale that is several orders of magnitude larger. We show that the...
Conference Paper
Full-text available
The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics. Its primary purpose is to serve as a reference repository of r esearch results, but we believe that it can also be an object o f study and a platform for research in its own right. We describe an enriched and standar...
Article
Full-text available
This paper describes work the Open Language Archives Community (OLAC) is doing to contribute to a global infrastructure for the sustainability of language resources. After offering a definition of language resource, it addresses the issue of what makes language resources sustainable by defining six necessary and sufficient conditions for their sust...
Article
Full-text available
Discourse in and about computational linguis- tics depends on a shared body of knowledge. However, little content is shared across the introductory courses in this field. Instead, they typically cover a diverse assortment of topics tailored to the capabilities of the stu- dents and the interests of the instructor. If the core body of knowledge coul...
Article
Full-text available
This paper shows how fieldwork data can be managed using the program Toolbox together with the Natural Language Toolkit (NLTK) for the Python programming language. It provides background information about Toolbox and describes how it can be downloaded and installed. The basic functionality of the program for lexicons and texts is described, and its...
Article
Full-text available
This thesis investigates the application of structured sequence classification models to multilingual natural language processing (NLP). Many tasks tackled by NLP can be framed as classification, where we seek to assign a label to a particular piece of text, be it a word, sentence or document. Yet often the labels which we’d like to assign exhibit...
Conference Paper
Full-text available
Search engines pervade the digital world, mediating most access to information instantaneously. We have found that students can build search engine components, and even entire search engines, in the context of problem-based learn- ing in introductory and intermediate computer science courses. The courses cover a broad range of topics in algorithms,...
Article
Search engines pervade the digital world, mediating most access to information instantaneously. We have found that students can build search engine components, and even entire search engines, in the context of problem-based learning in introductory and intermediate computer science courses. The courses cover a broad range of topics in algorithms, d...
Conference Paper
Full-text available
Linguistic research and natural language processing employ large repositories of ordered trees. XML, a standard ordered tree model, and XPath, its associated language, are natural choices for linguistic data and queries. However, several important expressive features required for linguistic queries are missing or hard to express in XPath. In this p...
Conference Paper
Full-text available
The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguistic data structures...
Article
Full-text available
The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been rel- atively well studied over several decades. Automated approaches to written language identification are used widely throughout research and industrial contexts, over both oral and written source materials. Despite t...
Article
Most web content exists in a few dozen languages. Hundreds of other languages - the 'low-density languages' - are only represented in scarce quantities on the web. How can we locate, store and describe these low-density resources? In particular, how can we identify linguistically interesting resources, such as translation sets and multilingual docu...
Conference Paper
Full-text available
Annotated linguistic databases are widely used in linguistic research and in language technology development. These annotations are typically hierarchical, and represent the nested structure of syntactic and prosodic constituents. Recently, the LPath language has been proposed as a convenient path-based language for querying linguistic trees. We es...
Article
Full-text available
Spoken word audio collections cover many do- mains, including radio and television broadcasts, oral narra- tives, governmental proceedings, lectures, and telephone con- versations. The collection, access and preservation of such data is stimulated by political, economic, cultural and educa- tional needs. This paper outlines the major issues in the...
Article
Full-text available
Linguistic research and language technology development employ large repositories of ordered trees. XML, a standard ordered tree model, and XPath, its associated language, are natural choices for linguistic data storage and queries. However, several important expressive features required for linguistic queries are missing in XPath. In this paper, w...
Conference Paper
Full-text available
Linguistic research and language technology development employ large repositories of ordered trees. XML, a standard ordered tree model, and XPath, its associated language, are natural choices for storing and querying linguistic data. However, several important expressive features required for linguistic queries are missing in XPath. In this paper,...
Article
Full-text available
Linguistic forms are inherently multi-dimensional. They exhibit a variety of phonological, orthographic, morphosyntactic, semantic and pragmatic properties. Accordingly, linguistic analysis involves multi-dimensional exploration, a process in which the same collection of forms are laid out in many ways until clear patterns emerge. Equally, language...
Conference Paper
Full-text available
Many linguistic research projects collect large amounts of multimodal data in digital formats. Despite the plethora of data collection applications available, it is often difficult for researchers to identify and integrate applications which enable the management of collections of multimodal data in addition to facilitating the actual collection pr...
Conference Paper
Full-text available
Interlinear text has long been considered a valuable format in the presentation of multilingual data, and a variety of software tools have facilitated the creation and processing of such texts by researchers. Despite the diversity of tools, a common core of editorial functionality is provided. Identifying these core functions has important implicat...
Conference Paper
Full-text available
The prime consideration in designing sustainable language resources is to ensure that they remain interpretable for coming generations of users. In this paper we adopt a new perspective on resource creation - securing the interpretability of data, using a case study of Ega, an endangered African language for which a small amount of legacy data is a...
Article
Many linguistic research projects collect large amounts of multimodal data in digital formats. Despite the plethora of data collection applications available, it is often difficult for researchers to identify and integrate applications which enable the management of collections of multimodal data in addition to facilitating the actual collection pr...
Conference Paper
Full-text available
Interlinear text is a common presentational format for linguistic information, and its creation and management have been greatly facilitated by the development of specialised software. In earlier work we developed a four-level model and corresponding formal specification for interlinear text. Here we describe a suitable XML representation for the m...
Article
Full-text available
The goal of the TalkBank project (http://talkbank.org) is to support data-sharing and direct, community-wide access to naturalistic recordings and transcripts of human and animal communication. Toward this end, we have constructed a web accessible database of transcripts linked to audio and video media within fields such as conversation analysis, c...
Article
Full-text available
Spoken word audio collections cover many domains, including radio and television broadcasts, oral narratives, governmental proceedings, lectures, and telephone conversations. The collection, access and preservation of such data is stimulated by political, economic, cultural and educational needs. This paper outlines the major issues in the field, r...
Conference Paper
Full-text available
Language technology makes extensive use of hierarchi- cally annotated text and speech data. These databases are stored in flat files and manipulated using corpus-specific query tools or special-purpose scripts. While the size of these databases and the range of applications has grown rapidly in recent years, neither method for managing the data has...
Conference Paper
Full-text available
The goal of the TalkBank project (http://talkbank.org) is to support data-sharing and direct, community-wide access to naturalistic recordings and transcripts of human and animal communication. Toward this end, we have constructed a web accessible database of transcripts linked to audio and video media within fields such as conversation analysis, c...
Article
Full-text available
Large databases of annotated text and speech are widely used for developing and testing language technologies. How-ever, the size of these corpora and associated language mod-els are outpacing the growth of processing power and net-work bandwidth available to most researchers. The solu-tion, we believe, is to exploit four characteristics of languag...
Article
The Open Language Archives Community is an international partnership of institutions and individuals that is creating a worldwide virtual library of language resources. We report on the development of OLAC metadata as a specialization of Dublin Core metadata and then describe the interoperability framework in which the metadata is validated, dissem...
Article
Annotation graphs provide an efficient and expressive data model for linguistic annotations of time-series data. This paper reports progress on a complete software infrastructure supporting the rapid development of tools for transcribing and annotating time-series data. This general-purpose infrastructure uses annotation graphs as the underlying mo...
Article
Full-text available
this paper we give a brief and broad characterisation of Declarative Phonology in terms of certain key aspects, both theoretical and methodological. In Section 2 we present our identification of constraints on well-formedness with partial descriptions of phonological representations. In Section 3 we discuss the declarative model of constraint inter...
Article
Full-text available
We describe the design and early implementation of an extensible, component-based software architecture for natural language engineering applications which interfaces with high performance distributed computing services. The architecture leverages existing linguistic resource description and discovery mechanisms based on metadata descriptions, comb...
Preprint
As language data and associated technologies proliferate and as the language resources community expands, it is becoming increasingly difficult to locate and reuse existing resources. Are there any lexical resources for such-and-such a language? What tool works with transcripts in this particular format? What is a good format to use for linguistic...
Article
Full-text available
Annotation graphs and annotation servers offer infrastructure to support the analysis of human language resources in the form of time-series data such as text, audio and video. This paper outlines areas of common need among empirical linguists and computational linguists. After reviewing examples of data and tools used or under development for each...
Article
Full-text available
Annotation graphs provide an efficient and expressive data model for linguistic annotations of time-series data. This paper reports progress on a complete open-source software infrastructure supporting the rapid development of tools for transcribing and annotating time-series data. This generalpurpose infrastructure uses annotation graphs as the un...
Article
Full-text available
The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. The Dublin Core (DC) Element Set and the OAI Protocol have provided a solid foundation for the OLAC framework. However, we need more precision in community-specific aspects o...
Article
Full-text available
New ways of documenting and describing language via electronic media coupled with new ways of distributing the results via the World -Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resou...
Article
Full-text available
We describe a proposal for an extensible, component-based software architecture for natural language engineering applications. Our model leverages existing linguistic resource description and discovery mechanisms based on extended Dublin Core metadata. In addition, the application design is flexible, allowing disparate components to be combined to...
Article
As language data and associated technologies proliferate and as the language resources community expands, it is becoming increasingly difficult to locate and reuse existing resources. Are there any lexical resources for such-and-such a language? What tool works with transcripts in this particular format? What is a good format to use for linguistic...
Preprint
The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. The Dublin Core (DC) Element Set and the OAI Protocol have provided a solid foundation for the OLAC framework. However, we need more precision in community-specific aspects o...
Article
Full-text available
As language data and associatedtechnologies proliferate and as the languageresources community expands, it is becomingincreasingly difficult to locate and reuse existingresources. Are there any lexical resources forsuch-and-such a language? What tool workswith transcripts in this particular format?What is a good format to use for linguisticdata of...
Article
Full-text available
This paper investigates the incorporation of a non-procedural theory of phonology into HPSG, based on the 'one-level' model of Bird & Ellison (1992). The standard rule-representation distinction is replaced by the description-object distinction which is more germane in the context of constraint-based grammar. Prosodic domains, which limit the appli...
Article
Full-text available
In this pap,r I report on an investigation into the problem of assigning tones to pitch contours. The proposed model is intended to serve as a tool for phonologists working on instrumentally obtained pitch data from, tone languages. Motivation and exemplification for the model is provided by data taken from my fieldwork on Bamileke Dschang (Careers...
Article
Full-text available
A lexical database tool tailored for phonological research is described. Database fields include transcriptions, glosses and hyperlinks to speech files. Database queries are expressed using HTML forms, and these permit regular expression search on any combination of fields. Regular expressions are passed directly to a Perl CGI program, enabling the...
Article
The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. The Dublin Core (DC) Element Set and the OAI Protocol have provided a solid foundation for the OLAC framework. However, we need more precision in community-specific aspects o...
Article
Should an alphabetic orthography for a tone language include tone marks? Opinion and practice are divided along three lines: zero marking, phonemic marking and various reduced marking schemes. This paper examines the success of phonemic tone marking for Dschang, a Grassfields Bantu language which uses tone to distinguish lexical items and some gram...
Article
Full-text available
this article will be to explain some of the alternations and distributional asymmetries in terms of syllable structure. The only consonant clusters which occur have the form (N)C(G)(h) where N is a homorganic nasal and G is a glide, or have the form (N)OL, where O is an obstruent and L is a liquid, as is characteristic of Niger-Congo languages in g...
Article
Tone languages provide some interesting challenges for the designers of new orthographies.
Article
Full-text available
The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in com- putational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguis- tic data struct...

Network

Cited By