Conference PaperPDF Available

LINTest: a development tool for testing dialogue systems.

Authors:

Abstract and Figures

In this paper we present a development tool for testing dia- logue systems. Testing software through the specification is im- portant for software development in general and should be as au- tomated as possible. For dialogue systems, the corpus can be seen as one part of the specification and the dialogue system should be tested on available corpora on each new build. The testing tool is inspired from work on agile software development methods, test driven development and unit testing, and can be used in two modes and during various phases of development. Index Terms: dialogue systems, development tools, corpus, test- ing.
Content may be subject to copyright.
LINTest, A development tool for testing dialogue systems
Lars Degerstedt, Arne J ¨
onsson
Department of Computer and Information Science
Link¨
opings universitet, Link¨
oping, Sweden
larde@ida.liu.se, arnjo@ida.liu.se
Abstract
In this paper we present a development tool for testing dia-
logue systems. Testing software through the specification is im-
portant for software development in general and should be as au-
tomated as possible. For dialogue systems, the corpus can be seen
as one part of the specification and the dialogue system should be
tested on available corpora on each new build. The testing tool
is inspired from work on agile software development methods, test
driven development and unit testing, and can be used in two modes
and during various phases of development.
Index Terms: dialogue systems, development tools, corpus, test-
ing.
1. Introduction
Natural Language Processing (NL P) and Language Engineering
(LE) are two complementary branches of the field of language
technology aiming at similar goals. Following Cunningham [1]
NL P is part of the science of computation whose subject matter is
computer systems that process human language”[1, p. 4] whereas
“Language Engineering is the discipline or act of engineering soft-
ware systems that perform tasks involving processing human lan-
guage” [1, p. 5]. Thus, although both NL P and LE are studying
processing of human language, there is a difference in focus. For
LE the primary goal is on developing software products that are
stable, robust and effective whereas NLP focuses on investigations
of computational means of more theoretical nature, for processing
human language.
From a language engineering perspective the goal is not only
to construct language technology but to develop it following the
principles of engineering—measurable and predictable systems,
methodologies, and resource bounds [1]. Factors of reuse, effec-
tive development, and guaranteed quality then becomes important
issues. The primary goal of language engineering is developing
software products that are stable, robust and effective. For the lan-
guage engineer there is thus a need to re-use previously developed
language components and to make them gradually more generic.
This is important not only for industry projects but also for re-
search projects.
The work presented in this paper is based on iterative devel-
opment [2, 3, 4] and an evolutionary view on project development,
cf. [5], much resembling the ideas advocated by the agile mani-
festo1.
The agile methodology facilitates robust handling of surprises
and changes during the course of a project. It emphasises the need
for collaboration between team players with different background
1http://agilemanifesto.org/
and purpose [6]. Moreover, it is highly influential on the new soft-
ware development tools of today and the future, such as strong
emphasis on unit testing and code refactoring. The common main-
stream software theory is thereby also becoming more agile and
so the agile technology will by necessity influence a related field
such as language technology.
When building an advanced interactive system, we are work-
ing with sparse knowledge models in new territory due to the in-
herent complexity in the underlying communicative processes. For
dialogue systems, we typically use a new mixture of different the-
ories in new blends for each new system, in order to deal with the
complexity and uniqueness of natural language. We believe that
the agile methodology has much to offer such unforeseeable de-
velopment processes, and also that natural language processing is
an interesting border-case to study for the agile development com-
munity.
The agile pieces and project competences must all fit together
for the team to work well on all aspects. Both the individual and
the team should strive to become a Jack of All Trades [7] to be
able to handle the inherently unforeseeable situation during devel-
opment. This includes handling surprises on both linguistic and
software level well, in an unforeseeable and rapidly changing sys-
tem construction process. The team must embrace change on all
levels, just as the natural language itself does - it evolves.
Language engineering must consider both adaptiveness of the
agile methodology for the software development (cf.[8]) and cu-
mulative language modelling on the linguistic level.
Agile development emphasise unit testing and for many ag-
ile methods test driven development is important, e.g. to write the
tests before the code. JUnit2is a tool developed to facilitate unit
testing that is easy to use, that can be integrated in the normal
build phase and that automatise testing of software components.
With JUnit, users write test cases that are executed on each new
build, to make sure that a certain component behaves as before.
2. Dialogue systems development
When developing dialogue systems it is crucial to have knowledge
on how users will interact with the system. Such knowledge is
often elicited from corpora, collected in various ways, natural, ar-
tificial or a mixture. The corpus is then used in the development of
the dialogue system. From a testing perspective we identify three
development activities:
Specification, when new behaviour is implemented
Refactoring, when the code is re-written without changing
the external behaviour
2http://www.junit.org/
Refinement, when the external behaviour is refined
The activities are not performed in any specific order and have
mainly the purpose of illustrating our testing tool.
2.1. Specification
Specification includes all activities where new functionality is de-
veloped for the dialogue system. From a corpus perspective, initial
development can be based on a subset of the corpus handling basic
utterances, for instance, simple information requests. Already at
this stage we see a need for a tool allowing for easy testing of the
new features.
We also see a need for administration of the corpus dividing
it into those utterances that can be handled by the current system,
those that should have been handled but for some reason did not
work and those that have not yet been considered. When more
advanced features are to be incorporated into the dialogue system,
this means that new corpora, or new subsets of existing corpora,
are to be included in the test set.
2.2. Refactoring
Refactoring is the technique of changing the internal behaviour of
the dialogue system without changing the external behaviour [9].
Refactoring is a common activity, especially in language engineer-
ing, where robust and generic code is in focus. Testing plays an
important role in refactoring to ensure that everything that worked
before works also after refactoring. Testing the re-factored system
on all dialogues in the corpus that the system handled correctly
before is therefore crucial. Such testing is preferably done auto-
matically and in the background at each new build.
2.3. Refinement
With refinement we understand the process of fine-tuning the ex-
ternal behaviour of the system. We separate refinement from spec-
ification as the requirements on a testing tool differ. Specification
means adding new behaviour and often include extending the cor-
pus with dialogues covering the new phenomenon to be developed.
Refinement also means changing the external behaviour, but only
slightly. The new behaviour of the dialogue system is to be so close
to the previous that the corpus in principle should work. Changing
the prompt is a typical example of a refinement, i.e. the old corpus
will no longer pass the test as the prompt is changed. Once the
prompt in the corpus is changed to the new prompt the test will
pass again. Thus, a minor update of the corpus is needed. The
testing tool should allow for such update of the corpora reflecting
the new refined behaviour.
3. LINTest
LINTest is a tool for testing dialogue systems against a corpus3.
LINTest is inspired by the simplicity of JUnit, but it is not aimed
at unit testing. It facilitates test-driven development of dialogue
systems, but is intended for testing dialogue behaviour. LINTest is
a tool to be easy to use for any dialogue systems developer and is
less ambitious than interactive development environments such as
GATE4.
3LINTest is available for download from
http://nlpfarm.sourceforge.net/
4http://gate.ac.uk/
Dialogue
System
LINTEST
Corpus
editor
Interactive
Log file
Figure 1: LINTest usage
LINTest runs a corpus collection on a dialogue system and
compares the output from the dialogue system with the corpus. If
the corpus thus produced is the same as the original corpus the dia-
logue system is assumed to behave according to current specifica-
tion. LINTest only runs on typed interactions, but any transcribed
spoken dialogue can in theory be used as long as the transcribed
utterances’ can be accepted by the dialogue system.
LINTest can be used in two different modes: batch mode and
interactive mode, see figure 1. In batch mode nothing really hap-
pens as long as a test passes, the result from the test is presented
in a log file. In interactive mode LINTest allows for more verbose
testing and behaves more like a debugging tool.
Although both modes can be used in every phase of dialogue
systems development, we believe that batch mode is more useful
during refactoring and the interactive mode is more useful during
refinement.
Testing during refactoring resembles unit testing in the sense
that we want to ensure that previous external behaviour is pre-
served no matter how we rewrite the code. Testing should then
interfere as little as possible, when all is fine.
Refinement, on the other hand, is about changing the external
behaviour, and consequently, very few tests will pass. In interac-
tive mode the software developer is presented with the result from
running the modified system together with the corpus and can in-
spect effects of the refinement and also update the corpus to avoid
having LINTest fail on further occasions of running the refined di-
alogue system.
Testing during specification adhere to the principle of writing
the test code first [2]. Again, this is by no means unit testing, LIN-
Test is used for system testing, i.e. the functionality of the whole
system. In this phase of the development a mixture of interactive
and batch mode is used. Initially, when new dialogues are added,
interactive mode helps updating the corpus whereas batch mode
allows for a less intrusive testing behaviour when developing a
certain behaviour.
4. Illustration
To illustrate the use of LINTest we will briefly present how we
have used LINTest in the development of BIRDQU EST [10].
run-lintest:
[lintest] Dialog failed: 1141046899757(turn 4) utterance was V¨
alj n˚
agon av f¨
oljande l¨
arkor:
tr¨
adl¨
arka, tofsl¨
arka, s˚
angl¨
arka och bergl¨
arka
[lintest] expected V¨
alj en av f¨
oljande l¨
arkor: tr¨
adl¨
arka, tofsl¨
arka, s˚
angl¨
arka och bergl¨
arka
[lintest] Dialog failed: 1141128937123(turn 2) utterance was V¨
alj n˚
agon av f¨
oljande svanar:
s˚
angsvan, mindre s˚
angsvan och kn¨
olsvan
[lintest] expected V¨
alj en av f¨
oljande svanar: s˚
angsvan, mindre s˚
angsvan och kn¨
olsvan
[lintest] Tests run: 20 Failures: 2 Errors: 0
Figure 3: Typical output from LINTest. (The numbers after Dialog failed: are automatically generated log-file names.)
<target name="run-lintest">
<lintest agentclass="birdquest.TestAgent"
showoutput="true"
agentpath="${build.path}">
<fileset dir="${source.path}/test/corpus"/>
</lintest>
</target>
Figure 2: A typical target for testing BIRDQU EST using the dia-
logue system class birdquest.TestAgent with the corpus
specified in the fileset parameter, presenting output on the
screen. LINTest allows more parameters, e.g. to specify a file for
output and to halt on failure.
The BIRDQU EST system is a dialogue system aimed at full
support for basic dialogue capabilities, such as clarification sub-
dialogues and focus management, developed for Swedish Na-
tional Television (S VT), with information about Nordic birds.
BIRDQUE ST was developed from a corpus of 264 questions about
birds collected by SV T on a web site for one of their nature pro-
grams, where the public could send in questions. We did unfortu-
nately not have LINTest for this initial development.
BIRDQUE ST is available on-line5and we have logged inter-
actions for a number of years. The corpus currently (April 1
2006) comprise 284 dialogues with a total of 2046 user utter-
ances (ranging from 1 to 56 user turns) (all interactions are in
Swedish). This corpus has been used for refinement and refactor-
ing of BIRDQU EST as briefly presented below. The presentation is
aimed at demonstrating the use of LINTest.
4.1. Refactoring
As stated above, with refactoring we mean coding activities that
do not change the external behaviour of the system. This ranges
from minor changes, such as changing variable names, to re-
implementing the whole system, for instance, in a new program-
ming language.
Using LINTest during refactoring means that the systems ex-
ternal behaviour should be as before and we need means to run
the test automatically on every build. Thus, LINTest is preferably
used with a build tool. Currently we use ant and in the ant file
5http://maj4.ida.liu.se:8080/birdquest
specify the LINTest search path and additional parameters, in the
target specification. An example entry is seen in Figure 2.
When building a new dialogue system with the target as speci-
fied in Figure 2 we get the usual build information and information
on the number of test runs, normally that there were no errors or
failures. A more interesting case is seen in Figure 3, where there
are two errors, due to refinement.
4.2. Refinement
With refinement we understand changes to the system such that the
corpus needs to be updated. Again consider Figure 3 which illus-
trates LINTest output when we change Choose some of the follow-
ing birds: (V¨
alj n˚
agon av f¨
oljande f˚
aglar: Swedish) to Choose one
of the following birds: (V¨
alj en av f¨
oljande f˚
agar: Swedish). This
refinement of system behaviour means that the corpus will fail on
many dialogues, and we therefore need to update the corpus to
reflect this new system behaviour.
This done using the interactive editor in the upper right of Fig-
ure 1. Figure 4 depicts a screen shot of the interactive editor.
The interactive editor presents a result of the build and
also displays utterances that differ. The user has four buttons.
Replace means replacing the current utterance with the new.
Replace all means replacing all occurrences of the error. By
choosing Replace all the whole corpus is updated and when
the developer continues testing, this particular refinement will not
cause a failure. This option also allows for specifying which
part(s) of the utterance to replace. The Ignore button in LIN-
Test is used to mark errors as temporarily accepted.
4.3. Specification
When we analysed the corpus we found certain behaviours that
BIRDQUE ST could not handle [11]. Some of these errors are easy
to handle, e.g. adding new words. Others involve more coding and
also make more behaviour changes to the system. Both are exam-
ples of specification, meaning that we can not, as in refactoring
expect that the system will have the same external behaviour nor,
as in refinement, assume that we just want to change the corpus to
reflect the new behaviour.
Instead, during specification we intertwine between batch and
interactive testing. Often we start with a fairly small corpus sub-
set and incrementally develop the new behaviour using LINTest
in batch mode. Once the behaviour is near completion, or we for
some other reason need to make sure that the system still handles
previous dialogues in the corpus, we use LINTest on a larger por-
Figure 4: Interactive editing in LINTest
tion of the corpus (or the whole corpus) in interactive mode to
update the corpus with the new feature.
LINTest includes features to make a large corpus more
tractable. It is possible to partition the corpus, using only those
parts of the corpora that the are relevant for the current modifica-
tion to the dialogue system. Partitions are especially useful when
the corpus is large and we are adding new behaviour to the dia-
logue system as we then need a faster code-build-test loop. It is
also possible to see a failure in context. Pressing the View button
in the incremental editor brings up a dialogue viewer, see Figure 4.
5. Summary
In this paper we presented LINTest, a tool for testing dialogue sys-
tems from corpus data. LINTest origin from a need to support
language engineering activities and is available for download as
open source. LINTest facilitates incremental development of dia-
logue systems and can be used either in batch mode or in a GUI
allowing for efficient corpus update.
6. Acknowledgements
This research was supported by Santa Anna IT Research Institute
AB.
7. References
[1] Hamish Cunningham, “A Definition and Short History of
Language Engineering,” Natural Language Engineering,
vol. 5, no. 1, pp. 1–16, 1999.
[2] Kent Beck, Extreme Programming Explained, Addison-
Wesley, 2000.
[3] Lars Degerstedt and Arne J¨
onsson, “A Method for System-
atic Implementation of Dialogue Management,” in Workshop
notes from the 2nd IJCAI Workshop on Knowledge and Rea-
soning in Practical Dialogue Systems, Seattle, WA, 2001.
[4] Joris Hulstijn, Dialogue Models for Inquiry and Transaction,
Ph.D. thesis, Universiteit Twente, 2000.
[5] M. M. Lehman and J. F. Ramil, “An Approach to a Theory
of Software Evolution, in Proc. of the 4th int. workshop on
Principles of software evolution, Vienna, Austria, September
2001.
[6] Alistair Cockburn and Jim Highsmith, “Agile software devel-
opment: The people factor, IEEE Computer, pp. 131–133,
November 2001.
[7] Andrew Hunt and David Thomas, The Pragmatic Program-
mer, Addison-Wesley, 2000.
[8] Jim Highsmith, Adaptive Software Development, Addison-
Wesley, 1998.
[9] Martin Fowler, Refactoring: Improving the Design of Exist-
ing Code, Addison-Wesley Object Technology Series, 2000.
[10] Arne J¨
onsson, Frida And´
en, Lars Degersedt, Annika Flycht-
Eriksson, Magnus Merkel, and Sara Norberg, “Experiences
from Combining Dialogue System Development with Infor-
mation Access Techniques, in New Directions in Question
Answering, Mark T. Maybury, Ed. AAAI/MIT Press., 2004.
[11] Annika Flycht-Eriksson and Arne J¨
onsson, “Some empir-
ical findings on dialogue management and domain ontolo-
gies in dialogue systems – implications from an evaluation
of birdquest,” in Proceedings of 4th SIGdial Workshop on
Discourse and Dialogue, Sapporo, Japan, 2003.
... LINTEST is a tool for the evaluation of conversational agents using dialogue corpora (Degerstedt & Jönsson, 2006). It allows two operation modes: batch and interactive. ...
Article
The main objective of multimodal conversational agents is to provide a more engaged and participative communication by allowing users to employ more than one input methodologies and providing output channels that are different to exclusively using voice. This chapter presents a detailed study on the benefits, disadvantages, and implications of incorporating multimodal interaction in conversational agents. Initially, it focuses on implementation techniques. Next, it explains the fusion and fission of multimodal information and focuses on the core module of these agents: the dialogue manager. Later on, the chapter addresses architectures, tools to develop some typical components of the agents, and evaluation methodologies. As a case of study, it describes the multimodal conversational agent in which we are working at the moment to provide assistance to professors and students in some of their daily activities in an academic centre, for example, a University's Faculty.
... Their evaluation methodology employs test sets obtained from real corpora along with the commonly used evaluation criteria. Degerstedt and Jönsson (2006) proposed the LINTEST tool to carry out evaluation of dialogue systems using the JUNIT corpus. A very detailed review of the most relevant efforts on generalization of evaluation criteria and practices can be found in Dybkjaer et al. (2004) andAraki (2005), whereas Möller et al. (2007) present a review of the de-facto criteria extracted from all these studies and an example of their usage to evaluate a particular dialogue system. ...
... In conclusion, we strongly believe like Degerstedt and Jönsson (2006), Agile and Test-Driven Development form a most-suitable environment for building NLP-based applications. ...
Conference Paper
c-rater® is the Educational Testing Service technology for automatic content scoring for short free-text responses. In this paper, we contend that an Agile and test-driven development environment optimizes the development of an NLP-based technology.
... In conclusion, we strongly believe like Degerstedt and Jönsson (2006), Agile and Test-Driven Development form a most-suitable environment for building NLP-based applications. ...
Article
Full-text available
Developments in Natural Language Processing technologies promise a variety of benefits to the localization industry, both in its current form in performing bulk enterprise-based localization and in the future in supporting personalized web-based localization on increasingly user-generated content. As an increasing variety of natural language processing services become available, it is vital that the localization industry employs the flexible software integration techniques that will enable it to make best use of these technologies. To date however, the localization industry has been slow to reap the benefits of modern integration technologies such as web service integration and orchestration. Based on recent integration experiences, we examine how the localization industry can best exploit web-based integration technologies in developing new services and exploring new business models
Article
Evaluation of spoken dialogue systems has been traditionally carried out in terms of instrumentally or expert-derived measures (usually called “objective” evaluation) and quality judgments of users who have previously interacted with the system (also called “subjective” evaluation). Different research efforts have been made to extract relationships between these evaluation criteria. In this paper we report empirical results obtained from statistical studies, which were carried out on interactions of real users with our spoken dialogue system. These studies have rarely been exploited in the literature. Our results show that they can indicate important relationships between criteria, which can be used as guidelines for refinement of the systems under evaluation, as well as contributing to the state-of-the-art knowledge about how quantitative aspects of the systems affect the user’s perceptions about them.
Article
Incluye una versión íntegra de la tesis escrita en inglés Tesis Univ. Granada. Departamento de Lenguajes y Sistemas Informáticos. Leída el 2 de julio de 2008
Conference Paper
Full-text available
Traditional Q&A systems are efficient at interpretation of single questions and extraction of the corresponding answer from unstructured texts, but cannot handle many of the interaction features that make dialogue systems so efficient for information access. Dialogue systems, on the other hand, can handle connected dialogues, but are normally developed to access structured data, often stored in databases. The challenge is therefore to combine these areas of language technology research and develop dialogue systems that can access information from unstructured text documents. A first step towards this goal is to use information extraction techniques that pull out relevant information from textual documents to compile unstructured information to a database. This might sound as a straightforward endeavour, but in practice, it involves a number of research issues, such as, handling different ontological perspectives, dealing with information gaps, inference both inside the dialogue and in the interpretation of source documents, etc. We have addressed this combined research issue of utilising information extraction tech- niques to automatically create structured databases from unstructured documents to be ac- cessed by dialogue systems. A system, BirdQuest, has been developed based on a bird en- cyclopaedia from which information was extracted and transformed to a relational database. In the paper we present the system architecture, its components, and evaluations from the perspectives of users of the system and the development of a dialogue system that access a database created automatically utilising information extraction.
Article
In this paper we present implications for development of dialogue systems, based on an evaluation of the system BIRDQUEST which combine dialogue in- teraction with information extraction. A number of issues detected during the evaluation concerning primarily dialogue management, and domain knowledge rep- resentation and use are presented and dis- cussed.
Article
This paper discusses the nature, history and current characteristics of Language Engineering, which is contrasted with Natural Language Processing and Computational Linguistics, and which is shown to have attained its own distinct identity in recent years. Major trends in the field are examined, including its focus on large-scale practical tasks and on quantitative evaluation of progress, and its willingness to embrace a diverse range of techniques. The importance of software engineering in this context is noted, as are some sociological aspects of the practitioner group.
Chapter
Wer dieses Buch liest, lernt: die Anwender zu begeistern, die echten Anforderungen zu finden, den Verfall von Software zu bekämpfen, gegen Redundanz anzugehen, dynamischen und anpassbaren Quelltext zu schreiben, effektiv zu testen, Teams von Pragmatischen Programmierern zu bilden und durch Automatisierung sorgfältiger zu entwickeln. Zunehmende Spezialisierung und Technisierung verstellen den Softwareentwicklern oft den Blick auf das Wesentliche: Anforderungen in ein funktionierendes und wartbares Programm zu überführen, das die Anwender begeistert. Der Pragmatische Programmierer rückt dies wieder in den Mittelpunkt. Egal ob Einsteiger, erfahrener Programmierer oder Projektmanager: Wer die Tipps täglich anwendet, wird seine Produktivität, Genauigkeit und Zufriedenheit bei der Arbeit rasch steigern und legt damit die Basis für den langfristigen Erfolg als Pragmatischer Programmierer.
Conference Paper
Almost every expert in Object-Oriented Development stresses the importance of iterative development. As you proceed with the iterative development, you need to add function to the existing code base. If you are really lucky that code base is structured just right to support the new function while still preserving its design integrity. Of course most of the time we are not lucky, the code does not quite fit what we want to do. You could just add the function on top of the code base. But soon this leads to applying patch upon patch making your system more complex than it needs to be. This complexity leads to bugs, and cripples your productivity.
Conference Paper
Summary form only given. The paper briefly refers to a number of the, by now well known, results of the author's studies of software evolution since they provide a basis and framework for the development of a theory of the phenomenon. The author then summarises his most recent results outlining a proof that every E-type program reflects an unbounded number of assumptions about the application implemented, supported or modelled by the program. He shows that the presence of such assumptions is inevitable and that some of these become invalid over time as a consequence of changes in the dynamic real world. Some of the finite set of known assumptions reflected in the program is also become invalid. Together the resultant ever extending invalidity causes the software to require continuing change or to become ever more unsatisfactory or even invalid. This phenomenon provides the underlying and unavoidable cause of the universal experience that E-type software must be continually evolved to remain satisfactory and suggests that its further study and the development of methods and tools to reduce its considerable impact, economic and social cost, must form an important part of future software engineering R & D.