
Robert Dale- Doctor of Philosophy
- Consultant at Language Technology Group
Robert Dale
- Doctor of Philosophy
- Consultant at Language Technology Group
About
249
Publications
95,984
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
9,293
Citations
Introduction
I work as an independent consultant, providing expert and unbiased advice in the selection, development and deployment of natural language processing technologies. I write 'Industry Watch', a semi-regular column in the Journal of Natural Language Engineering that explores what's happening in the commercial NLP world. I also produce a 'This Week in NLP' newsletter, which you can sign up for at https://www.language-technology.com/blog.
Current institution
Language Technology Group
Current position
- Consultant
Publications
Publications (249)
A lot has happened since OpenAI released ChatGPT to the public in November 2022. We review how things unfolded over the course of the year, tracking significant events and announcements from the tech giants leading the generative AI race and from other players of note; along the way we note the wider impacts of the technology’s progress.
Since the release of ChatGPT at the end of November 2022, generative AI has been talked about endlessly in both the technical press and the mainstream media. Large language model technology has been heralded as many things: the disruption of the search engine, the end of the student essay, the bringer of disinformation … but what does it mean for c...
It’s no secret that the commercial application of NLP technologies has exploded in recent years. From chatbots and virtual assistants to machine translation and sentiment analysis, NLP technologies are now being used in a wide variety of applications across a range of industries. With the increasing demand for technologies that can process human la...
In the past few years, high-quality automated text-to-speech synthesis has effectively become a commodity, with easy access to cloud-based APIs provided by a number of major players. At the same time, developments in deep learning have broadened the scope of voice synthesis functionalities that can be delivered, leading to a growth in the range of...
Funding for AI start-ups in general is booming, and natural language processing as a subfield has not missed out. We take a closer look at early-stage funding over the last year—just over US$1B in total—for companies that offer solutions that are based on or make significant use of NLP, providing a picture of what funders think is innovative and ba...
Automated writing assistance – a category that encompasses a variety of computer-based tools that help with writing – has been around in one form or another for 60 years, although it’s always been a relatively minor part of the NLP landscape. But the category has been given a substantial boost from recent advances in deep learning. We review some h...
GPT-3 made the mainstream media headlines this year, generating far more interest than we’d normally expect of a technical advance in NLP. People are fascinated by its ability to produce apparently novel text that reads as if it was written by a human. But what kind of practical applications can we expect to see, and can they be trusted?
It took a while, but natural language generation is now an established commercial software category. It’s commented upon frequently in both industry media and the mainstream press, and businesses are willing to pay hard cash to take advantage of the technology. We look at who’s active in the space, the nature of the technology that’s available toda...
The end of the calendar year always seems like a good time to pause for breath and reflect on what’s been happening over the last 12 months, and that’s as true in the world of commercial NLP as it is in any other domain. In particular, 2019 has been a busy year for voice assistance, thanks to the focus placed on this area by all the major technolog...
It’s now remarkably easy to release to the world a cloud-based application programming interface (API) that provides some software function as a service. As a consequence, the cloud API space has become very densely populated, so that even if a particular API offers a service whose potential value is considerable, there are many other factors that...
The Journal of Natural Language Engineering is now in its 25th year. The editorial preface to the first issue emphasised that the focus of the journal was to be on the practical application of natural language processing (NLP) technologies: the time was ripe for a serious publication that helped encourage research ideas to find their way into real...
The law has language at its heart, so it’s not surprising that software that operates on natural language has played a role in some areas of the legal profession for a long time. But the last few years have seen an increased interest in applying modern techniques to a wider range of problems, so I look here at how natural language processing is bei...
It seems like there’s yet another cloud-based text analytics Application Programming Interface (API) on the market every few weeks. If you’re interested in building an application using these kinds of services, how do you decide which API to go for? In the previous Industry Watch post, we looked at the text analytics APIs from the behemoths in the...
If you’re in the market for an off-the-shelf text analytics API, you have a lot of options. You can choose to go with a major player in the software world, for whom each AI-related service is just another entry in their vast catalogues of tools, or you can go for a smaller provider that focusses on text analytics as their core business. In this fir...
Vastly improved speech recognition, backed by a more slowly improving ability to make sense of the recognized speech, has brought state-of-the-art NLP into our homes in the form of smart speakers and other devices that listen. There’s no doubt these devices can be incredibly useful, but they also may also support incursions into our privacy. We loo...
The commercialisation of natural language processing began over 35 years ago, but it’s only in the last year or two that it’s become substantially more visible, largely because of the intense popular interest in artificial intelligence. So what’s the state of commercial NLP today? We survey the main industry categories of relevance, and offer comme...
We live in a post-truth world. It now matters more whether people think something is true than whether something really is true. This is dangerous, and technology is at least partly to blame. So, as technologists, how can we help to fix this?
By all accounts, 2016 is the year of the chatbot. Some commentators take the view that chatbot technology will be so disruptive that it will eliminate the need for websites and apps. But chatbots have a long history. So what's new, and what's different this time? And is there an opportunity here to improve how our industry does technology transfer?
Ten years ago, Microsoft Word's grammar checker was really the only game in town. The software world, and the world of natural language processing, have changed a lot in that time, so what does the grammar checker marketplace have to offer today?
Machine Translation research suffered a major blow in the 1960s, but it came back with a vengeance. From a commercial point of view, it’s now a mature technology that many Internet users take for granted. We look at where we are now, and consider the scope for new entrants into the market.
With NLP services now widely available via cloud APIs, tasks like named entity recognition and sentiment analysis are virtually commodities. We look at what's on offer, and make some suggestions for how to get rich.
In almost every science fiction movie you’ll see people conversing with machines. Of course, the rise of intelligent personal assistants means you probably do this yourself already. This posting asks: what’s the difference? Also, recent news on Facebook acquisitions, spoken language translation, and sentiment analysis.
In this paper we present a previously unexplored approach to recognizing the textual extent of temporal expressions. Based on the observation that temporal expressions are syntactic constituents, we use functional dependency relations between tokens in a sentence to determine which words in addition to a trigger word belong to the extent of the exp...
This second edition of The Oxford Handbook of Computational Linguistics has been substantially revised, updated, and expanded. Alongside updated accounts of the topics covered in the first edition, it includes 17 new chapters on subjects such as deep learning, word representation, semantic role labelling, translation technology, opinion mining and...
Human speakers generally find it easy to refer to entities in such a way that their hearers can determine who or what is being talked about. In an attempt to model this behaviour, researchers in computational linguistics have explored the development of algorithms that operate in a deliberate manner, choosing attributes of an intended referent on t...
As one of the most well-defined subtasks in Natural Language Generation (NLG), the generation of referring expressions looks like a strong candidate for piloting shared evaluation tasks. Different to other areas of Natural Language Processing, it is still unclear what benefit the introduction of such tasks might have for the field of NLG. Based on...
Incorrect usage of prepositions and determiners constitute the most common types of errors made by non-native speakers of English. It is not surprising, then, that there has been a significant amount of work directed towards the automated detection and correction of such errors. However, to date, the use of different data sets and different task de...
Using the example of Murrinh-Patha, Seiss (2011) illustrates how Aus-tralian Aboriginal languages can shed light on the morphology-syntax inter-face: one aspect of their polysynthetic nature is that information often en-coded in phrases and clauses in other languages is instead found in a single morphological word. In this paper, we look at another...
The dissemination of knowledge derived from research and scholarship has a fundamental impact on the ways in which society develops and progresses, and at the same time it feeds back to improve subsequent research and scholarship. Here, as in so many other areas of human activity, the internet is changing the way things work; two decades of emergen...
Generation Challenges 2011 (GenChal'11) was the fifth round of shared-task evaluation competitions (STECs) involving the generation of natural language. It followed four previous events: the Pilot Attribute Selection for Generating Referring Expressions (ASGRE) Challenge in 2007 which had its results meeting at UCNLG+MT in Copenhagen, Denmark; Refe...
Hand-crafted approaches to content determination are expensive to port to new domains. Machine-learned approaches, on the other hand, tend to be limited to relatively simple selection of items from data sets. We observe that in time series domains, textual descriptions often aggregate a series of events into a compact description. We present a simp...
The Big Australian Speech Corpus project incorporates the strategic goals of 30 Chief Investigators from various speech science areas. Speech from 1000 geographically and socially diverse speakers is being recorded using a uniform and automated protocol plus standardized hardware and software to produce a widely applicable and extensible database -...
Recent years have seen a trend towards em-pirically motivated and more data-driven ap-proaches in the field of referring expression generation (REG). Much of this work has fo-cussed on initial reference to objects in visual scenes. While this scenario of use is one of the strongest contenders for real-world appli-cations of referring expression gen...
Semantic information retrieval requires that we have a means of capturing the semantics of documents; and a potentially useful feature of the semantics of many documents is the temporal information they contain. In particular, the temporal expressions contained in documents provide important information about the time course of the events those doc...
Traditional computational approaches to referring expression generation operate in a deliberate manner, choosing the attributes to be included on the basis of their ability to distinguish the intended referent from its distractors. However, work in psycholinguistics suggests that speakers align their referring expressions with those used previously...
In a collocation, the choice of one lexical item depends on the choice made for another. This poses a problem for simple approaches to lex-icalisation in natural language generation sys-tems. In the Meaning-Text framework, recur-rent patterns of collocations have been char-acterised by lexical functions, which offer an elegant way of describing the...
The aim of the Helping Our Own (HOO) Shared Task is to promote the development of automated tools and techniques that can assist authors in the writing task, with a specific focus on writing within the natural language processing community. This paper reports on the results of a pilot run of the shared task, in which six teams participated. We de-s...
Traditional approaches to referring expression generation (REG) have taken as a fundamental requirement the need to distinguish the intended referent from other entities in the context. It seems obvious that this should be a necessary condition for successful reference; but we suggest that a number of recent investigations cast doubt on the signifi...
We describe the second installment of the Challenge on Generating Instructions in Virtual Environments (GIVE-2), a shared task for the NLG community which took place in 2009--10. We evaluated seven NLG systems by connecting them to 1825 users over the Internet, and report the results of this evaluation in terms of objective and subjective measures.
A central purpose of referring expressions is to distinguish intended referents from other entities that are in the context; but how is this context determined? This paper draws a distinction between discourse context -other entities that have been mentioned in the dialogue- and visual context -visually available objects near the intended referent....
Automatically finding email messages that contain requests for action can provide valuable assistance to users who otherwise struggle to give appropriate attention to the actionable tasks in their inbox. As a speech act classification task, however, automatically recognising requests in free text is particularly challenging. The problem is compound...
In this paper, we propose a new shared task called HOO: Helping Our Own. The aim is to use tools and techniques developed in computational linguistics to help people writing about computational linguistics. We describe a text-to-text generation scenario that poses challenging research questions, and delivers practical outcomes that are useful in th...
The reliable extraction of knowledge from text requires an appropriate treatment of the time at which reported events take place. Unfortunately, there are very few annotated data sets that support the development of techniques for event time-stamping and tracking the progression of time through a narrative. In this paper, we present a new corpus of...
Unrehearsed spoken language often contains disfluencies. In order to correctly interpret a spoken utterance, any such disfluencies must be identified and removed or otherwise dealt with. Operating on transcripts of speech which contain disfluencies, our particular focus here is the identification and correction of speech repairs using a noisy chann...
This paper describes the First Challenge on Generating Instructions in Virtual Environments (GIVE-1). GIVE is a shared task
for generation systems which give real-time natural-language instructions to users in a virtual 3D world. These systems are
evaluated by connecting users and NLG systems over the Internet. We describe the design and results of...
Different representational systems permit differing degrees and forms of ambiguity and underspecification in the content they represent. Independently of this observation, a notable feature of natural language as a representational system is that it allows the same content to be expressed in different ways. In this paper, we examine the interaction...
Practitioners and researchers need to stay up-to-date with the latest advances in their fields, but the continual growth in the amount of literature available makes this task increasingly difficult. In this article, we describe the Citation-Sensitive In-Browser Summariser (CSIBS), a new research tool to help manage the literature browsing task. The...
In this chapter, we take the view that much of the existing work on the generation of referring expressions has focused on aspects of the problem that appear to be somewhat artificial when we look more closely at human-produced referring expressions. In particular, we argue that an over-emphasis on the extent to which each property in a description...
In abstractive summarisation, summaries can include novel sentences that are generated automatically. In order to improve
the grammaticality of the generated sentences, we model a global (sentence) level syntactic structure. We couch statistical
sentence generation as a spanning tree problem in order to search for the best dependency tree spanning...
The GIVE Challenge is a new Internet- based evaluation effort for natural lan- guage generation systems. In this paper, we motivate and describe the software in- frastructure that we developed to support this challenge.
Unrehearsed spoken language often contains many disfluencies. If we want to correctly interpret the content of spoken language, we need to be able to detect these disfluencies and deal with them appropriately. In the work de-scribed here, we use a statistical noisy channel model to detect disfluencies in transcripts of spoken language. Like all sta...
Under an ARC Linkage Infrastructure, Equipment and Facilities (LIEF) grant, speech science and technology experts from across Australia have joined forces to organise the recording of audio-visual (AV) speech data from representative speakers of Australian English in all capital cities and some regional centres. The Big Australian Speech Corpus (th...
As the complexity and sophistication of document processing tools increases, we can expect to see techniques that go beyond the syntactic and semantic features of documents to consider the more nuanced, context-sensitive aspects of language use that generally fall within the realm of pragmatics. The development of such techniques requires data that...
In this paper we present the DANTE system, a tagger for temporal expressions in English documents. DANTE performs both recognition
and normalization of these expressions in accordance with the TIMEX2 annotation standard. The system is built on modular principles,
with a clear separation between the recognition and normalisation components. The inte...
The amount of scientic material available electronically is forever increasing. This makes reading the published litera- ture, whether to stay up-to-date on a topic or to get up to speed on a new topic, a dicult task. Yet, this is an activity in which all researchers must be engaged on a regular basis. Based on a user requirements analysis, we deve...
In this paper, we explore a corpus of human-produced referring expressions to see to what extent we can learn the referen-tial behaviour the corpus represents. De-spite a wide variation in the way subjects refer across a set of ten stimuli, we demon-strate that component elements of the re-ferring expression generation process ap-pear to generalise...
We describe the first installment of the Challenge on Generating Instructions in Virtual Environments (GIVE), a new shared task for the NLG community. We motivate the design of the challenge, de- scribe how we carried it out, and discuss the results of the system evaluation.
Modern digital libraries oer all the hyperlinking possibilities of the World Wide Web: when a reader nds a citation of interest, in many cases she can now click on a link to be taken to the cited work. This paper presents work aimed at providing the same ease of navigation for legacy pdf document collections that were created before the possibility...
The GIVE Challenge is a recent shared task in which NLG systems are evaluated over the Internet. In this paper, we validate this novel NLG evaluation methodology by comparing the Internet-based results with results we collected in a lab experiment. We find that the results delivered by both methods are consistent, but the Internet- based approach o...
The GIVE Challenge is a new Internet- based evaluation effort for natural lan- guage generation systems. In this paper, we motivate and describe the software in- frastructure that we developed to support this challenge.
Abstract-like text summarisation requires a means of producing novel summary sen- tences. In order to improve the grammati- cality of the generated sentence, we model a global (sentence) level syntactic struc- ture. We couch statistical sentence genera- tion as a spanning tree problem in order to search for the best dependency tree span- ning a set...
In the early days of email, widely-used conventions for indicating quoted reply content and email signatures made it easy to segment email messages into their func- tional parts. Today, the explosion of dif- ferent email formats and styles, coupled with the ad hoc ways in which people vary the structure and layout of their messages, means that simp...
In this paper, we present a study of a large corpus of student logic exercises in which we explore the relationship between two distinct measures of difficulty: the proportion of students whose initial attempt at a given natural language to first-order logic translation is incorrect, and the average number of attempts that are required in order to...
In this paper we describe a six-ways paral-lel public-domain corpus consisting of 2100 United Nations General Assembly Resolu-tions with translations in the six official lan-guages of the United Nations, with an av-erage of around 3 million tokens per lan-guage. The corpus is available in a pre-processed, formatting-normalized TMX for-mat with para...
This paper describes the DSTO/Macquarie University System for Entity Linking (DAMSEL), which competed in the 2009 Text Acquisition Conference Knowledge Base Population task. The system achieves 73.5% accuracy. For a given named entity mention, the system selects a set of candidate entities from the knowledge base and selects the most likely candida...
In this paper, we reect on what we can learn about the processes involved in the generation of referring expres- sions by looking at a corpus of human-produced data. We nd that the data vastly underspecies what might be involved algorithmically, but it does rule out a num- ber of popular algorithms for referring expression gen- eration as candidate...
Practitioners and researchers need to stay up-to-date with the latest advances in their fields, but the constant growth in the amount of literature available makes this task increasingly difficult. We in- vestigated the literature browsing taskvia a user requirements analysis, and identi- fied the information needs that biomed- ical researchers com...
Abstract-like text summarisation requires a means of producing novel summary sentences. In order to improve the grammaticality of the generated sentence, we model a global (sentence) level syntactic structure. We couch statistical sentence generation as a spanning tree problem in order to search for the best dependency tree spanning a set of chosen...
Practitioners and researchers need to stay up-to-date with the latest advances in their fields, but the constant growth in the amount of literature available makes this task increasingly difficult. We investigated the literature browsing task via a user requirements analysis, and identified the information needs that biomedical researchers commonly...
Contemporary speech science is driven by the availability of large, diverse speech corpora. Such infrastructure underpins research and technological advances in various practical, socially beneficial and economically fruitful endeavours, from ASR to hearing prostheses. Unfortunately, speech corpora are not easy to come by because they are both expe...
Large auditory-visual (AV) speech corpora are the grist of modern research in speech science, but no such corpus exists for Australian English. This is unfortunate, for speech science is the brains behind speech technology and applications such as text-to-speech (TTS) synthesis, automatic speech recognition (ASR), speaker recognition and forensic i...
We propose a method for learning dialogue management policies from a fixed data set. The method addresses the challenges posed by Information State Update (ISU)-based dialogue systems, which represent the state of a dialogue as a large set of features, ...
We are interested in developing a better understanding of what it is that students find difficult in learning logic. We use both natural language and diagram-based methods for teaching students the formal language of first-order logic. In this paper, we present some initial results that demonstrate that, when we look at how students construct diagr...
In this paper we present a study on the interpretation of weekday names in texts. Our algorithm for assigning a date to a weekday name achieves 95.91% accuracy on a test data set based on the ACE 2005 Training Corpus, outperforming pre- viously reported techniques run against this same data. We also provide the first detailed comparison of various...
There is a prevailing assumption in the litera-ture on referring expression generation that re-lations are used in descriptions only 'as a last resort', typically on the basis that including the second entity in the relation introduces an additional cognitive load for either speaker or hearer. In this paper, we describe an experiemt that attempts t...
Here's a round-up of notable events in the commercial language technology space in the last quarter of 2007, organized by broad application category. A common thread that pops up throughout many of these is the integration of language technology into social networking applications and other related Web 2.0 themes. I'd put my money on this being a h...
The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics. Its primary purpose is to serve as a reference repository of r esearch results, but we believe that it can also be an object o f study and a platform for research in its own right. We describe an enriched and standar...
We examine the problem of content selection in statistical novel sentence generation. Our approach models the processes performed by professional editors when incorporating ma- terial from additional sentences to support some initially chosen key summary sentence, a process we refer to as Sentence Augmen- tation. We propose and evaluate a method ca...
It has long been established that many workplace tasks are managed through email communication, and that these tasks involve the exchange of requests and commitments. Users would be better able to manage and monitor tasks in their email if systems could identify the utterances which place responsibility for action on themselves or others. Such syst...
Although the literature contains reports of very high accuracy figures for the recognition of named entities in text, there are still some named entity phenomena that remain problematic for existing text processing systems. One of these is the am-biguity of conjunctions in candidate named entity strings, an all-too-prevalent problem in corporate an...
When we describe an object in order to en-able a listener to identify it, we often do so by indicating the location of that object with re-spect to other objects in a scene. This requires the use of a relational referring expression; while these are very common, they are rela-tively unexplored in work on referring expres-sion generation. In this pa...
In this paper we present the DANTE system, a tagger for temporal ex-pressions in English documents. DANTE performs both recognition and normal-ization of the expressions in accordance with the TIMEX2 annotation standard. The system is built on modular principles, with a clear separation between the recognition and normalisation components. The inte...
We examine the problem of content selection in statistical novel sentence generation. Our approach models the processes performed by professional editors when incorporating material from additional sentences to support some initially chosen key summary sentence, a process we refer to as Sentence Augmentation. We propose and evaluate a method called...
In this paper we present a study on the interpretation of weekday names in texts. Our algorithm for assigning a date to a weekday name achieves 95.91% accuracy on a test data set based on the ACE 2005 Training Corpus, outperforming previously reported techniques run against this same data. We also provide the first detailed comparison of various ap...
Temporal expressions—references to points in time or periods of time—are widespread in text, and their proper interpretation
is essential for any natural language processing task that requires the extraction of temporal information. Work on the interpretation
of temporal expressions in text has generally been pursued in one of two paradigms: the fo...
I've just come back from the 45th Annual Meeting of the Association for Computational Linguistics (ACL) in Prague; this was the biggest ever ACL conference, with more than 1,000 people attending for the first time. Attendance at ACL conferences has been growing year on year, and that is a sign of a healthy field. Another sign of health is industry...
Although the literature contains reports of very high accuracy figures for the recognition of named entities in text, there are still some named entity phenomena that remain problematic for existing text processing systems. One of these is the ambiguity of conjunctions in candidate named entity strings, an all-too-prevalent problem in corporate and...
Summary form only given. The task of referring expression generation is concerned with determining what semantic content should be used in a reference to an intended referent so that the hearer will be able to identify that referent. The task has been a focus of interest within natural language generation at least since the early 1980s, in part bec...