Proceedings of the 1st Workshop on NLP for Positive Impact, pages 135–142
Bangkok, Thailand (online), August 5, 2021. ©2021 Association for Computational Linguistics
A Speech-enabled Fixed-phrase Translator for Healthcare Accessibility
Pierrette Bouillon1,Johanna Gerlach1,Jonathan Mutal1,Nikos Tsourakis1, and
e Spechbach2
1FTI/TIM, University of Geneva, Switzerland
opitaux Universitaires de Gen`
eve (HUG), Switzerland
{Pierrette.Bouillon, Johanna.Gerlach, Jonathan.Mutal,
In this overview article we describe an appli-
cation designed to enable communication be-
tween health practitioners and patients who do
not share a common language, in situations
where professional interpreters are not avail-
able. Built on the principle of a fixed phrase
translator, the application implements differ-
ent natural language processing (NLP) tech-
nologies, such as speech recognition, neural
machine translation and text-to-speech to im-
prove usability. Its design allows easy porta-
bility to new domains and integration of dif-
ferent types of output for multiple target au-
diences. Even though BabelDr is far from
solving the problem of miscommunication be-
tween patients and doctors, it is a clear ex-
ample of NLP in a real world application de-
signed to help minority groups to communi-
cate in a medical context. It also gives some
insights into the relevant criteria for the devel-
opment of such an application.
1 Motivation
Access to healthcare is an important component
of quality of life, but it is often compromised by
the language barrier which prevents effective com-
munication. In hospitals, medical staff are increas-
ingly confronted with patients with whom they do
not share a common language. Lack of clear com-
munication can lead to increased risk for patients
(Flores et al.,2003) but also discourages vulnera-
ble groups from seeking medical assistance. When
professional interpreters are not easily available,
for example in emergency situations, there is a cru-
cial need for tools to overcome the language bar-
rier in order to provide medical care. While many
generic translation solutions are available on the
web, they present numerous disadvantages, includ-
ing the unreliability of machine translation (Bouil-
lon et al.,2017), the insufficient data confidential-
ity of cloud services or the absence of resources
for minority languages. To overcome these issues,
specifically designed tools based on a limited set
of pre-translated sentences have been developed.
These phraselators (Seligman and Dillinger,2013)
have the advantage of portability, accuracy and
reliability. Although these tools have limited cov-
erage, and do not solve all communication issues,
recent studies have shown that they are generally
preferred to machine translation as they are per-
ceived as more reliable and trustworthy in these
safety critical contexts (Panayiotou et al.,2019;
Turner et al.,2019).
This paper aims to provide an overview of the
NLP components included in the speech-enabled
phraselator called BabelDr. In Section 2we will
give an overview of BabelDr usage. We then ex-
plain the artificial training data derived from the
grammar to specialise the different components in
Section 3. In sections 4,5,6,7and 8we explain
BabelDr’s components in detail, as well as the pos-
sible outputs available to users. We then present
several usage studies with target groups in Section
9.1, report on the performance of the whole system
in Section 9.2 and conclude in Section 10.
2 BabelDr
is a joint project between the Faculty of
Translation and Interpreting of the University of
Geneva and Geneva University Hospitals (HUG).
(Bouillon et al.,2017). The aim of the project is to
develop a speech to speech translation system for
emergency settings which meets three criteria: reli-
ability, data security and portability to low-resource
target languages relevant for HUG. It is designed
to allow French-speaking medical practitioners to
carry out triage and diagnostic interviews with pa-
tients speaking Albanian, Arabic, Dari, Farsi, Span-
ish, Swiss French sign language and Tigrinya.
1More information available at
Figure 1: Overview of BabelDr usage
BabelDr is a web application designed to func-
tion on desktops and mobiles. Built on the prin-
ciple of a phraselator, it relies on a limited set
of pre-translated sentences, hereafter called core-
sentences, collected with doctors. For improved
usability and more natural interaction with the pa-
tient, it includes a speech recognition component:
instead of searching for utterances in menus, medi-
cal staff can speak freely and the system will map
the spoken utterances to the closest pre-translated
core-sentence. This sentence is then presented for
validation, in a backtranslation step, ensuring that
the doctor knows exactly what is being translated
for the patient. The patient can then respond by
means of a pictogram-based interface. All com-
ponents can be deployed on a local server with
no dependency on cloud services, thus ensuring
the data confidentiality that is essential for medi-
cal applications. Figure 1illustrates the usage of
3 Training data and grammars
Due to confidentiality issues, training data for spo-
ken French medical dialogues is scarce. For this
reason, the first version of the system was built
around a manually defined Synchronous Context
Free Grammar (SCFG, Aho and Ullman,1969),
used for grammar-based speech recognition and
parsing (Rayner et al.,2017). This grammar is
now leveraged to generate artificial data used both
for backtranslation (Section 5) and for specialising
speech recognition (Section 4).
The grammar maps source variation patterns,
described in a formalism similar to regular ex-
pressions, to core-sentences. Due to the repetitive
nature of the content, the grammars make use of
compositional sentences to make resources more
compact. These sentences contain one or more
variables, which are replaced by different values at
system compile time. Figure 2gives an example of
a compositional utterance rule.
The current version of the grammar includes
2629 utterance rules, organised by medical domain,
which expand to 10’991 core-sentences once vari-
ables are replaced by values. These core-sentences
are mapped to hundreds of millions of surface sen-
tences. Figure 3shows an example of the aligned
core-sentence - variation corpus that can be gener-
ated from the grammar.
4 Speech-to-Text
To ensure both accuracy and usability, the system
uses a hybrid approach for speech recognition, com-
bining two recognisers. The first is a grammar
based speech recogniser using GRXMLs generated
from the original SCFG (see Section 3). While
this is fast and accurate, since it directly yields
a core-sentence, it is unable to handle utterances
that are out of grammar coverage. It is therefore
complemented by a large-vocabulary recogniser
specialised with the monolingual artificial corpus
described in Section 3. The results of the two ap-
proaches are combined based on the confidence
score provided by the grammar based recogniser:
if the score is over a pre-defined threshold, this
result is kept, else the system falls back on the
large-vocabulary result. Their performance is eval-
uated in terms of WER, which is 38.9% for the
GRXML grammar and 14.4% (see Table 2) for the
large vocabulary model. In this case we have used
the dataset of the user study described in (Bouillon
et al.,2017).
Figure 2: Example of a grammar rule
Figure 3: Example of the aligned corpus generated from the grammar: core-sentences with corresponding source
For the GRXML recogniser we use the Nuance
ASR v10 and the Nuance Transcription Engine
4 for the large-vocabulary one. Both can be ac-
cessed over the network through our custom API
using HTTP POST requests. The recognition is file-
based and it proves to work well for any real-time
interaction. The distributed nature of our back-end
platform permits easy scaling and load balancing
so that multiple users can interact simultaneously
with the recognisers. Especially, for the GRXML
case, we can load and compile grammars on the fly
or change the parameters of the recogniser dynami-
cally. We can also parse any text against a specific
grammar using an HTTP request.
5 Backtranslation
The backtranslation (introduced in Section 2) is an
essential step in BabelDr since it maps the speech
recognition result to a core-sentence that is pre-
sented to the doctor for validation. For the GRXML
recogniser, backtranslation is performed directly
by the grammar. For the large vocabulary recog-
niser, as the set of core-sentences is limited (see
Section 3), the backtranslation task can be seen
as a sentence classification task where the core-
sentences are the categories, or as translation task
into a controlled language. As a resource, we
use the bilingual corpus generated from the gram-
mar as training data. Rayner et al. (2017) intro-
duced an approach based on tf-idf indexing and
dynamic programming (DP) achieving 91.8% on
accuracy (assuming perfect speech recognition and
1-best). Mutal et al. (2019) then applied different
approaches using deep learning methods, neural
machine translation (NMT) and sentence classifi-
cation achieving 93.2% (see Table 2) accuracy on
core-sentence matching for transcriptions (assum-
ing perfect speech recognition), improving on the
previous approach. This approach is currently used
in BabelDr.
6 Elliptical Sentences
In dialogues, elliptical utterances are very common,
since they ensure the principle of economy and usu-
ally avoid duplication (Hamza,2019). In BabelDr,
they allow doctors to question patients in a more ef-
ficient way (Tanguy et al.,2011). However, literal
translation of these utterances could affect com-
munication as illustrated in Table 1. In BabelDr,
elliptical utterances are not translated literally, but
are instead mapped to the closest non-elliptical
core-sentence, based on the context.
To avoid a wrong backtranslation in elliptical
sentences, a context-level information (the previous
accepted utterance) is added to the model. There-
fore, when an utterance is identified as an ellipsis,
Utterance Translation
do you have pain in your stomach? ¿le duele el est´
in your head? *¿en tu cabeza?
Good Translation: ¿Le duele la cabeza?
Table 1: Example of a bad translation of ellipsis. The * means a bad translation.
it is concatenated with the previous translated ut-
terance before backtranslating. In the context of
BabelDr, elliptical utterances are detected using
a binary classifier. The model was trained using
handcrafted features, such as sentence length, ab-
sence of verbs or nouns, part of speech of the first
word, and identification of pronouns that refer to en-
tities in the context (using morphological features).
On an artificial ellipsis data set, the model achieves
98% accuracy on detecting elliptical sentences and
88% on backtranslating them to a core-sentence
(see more, Mutal et al.,2020).
7 Output
After validation of the backtranslation, BabelDr
presents the target language output to the patient in
written and spoken form, which are both based on
the same human translations of the core-sentences.
In the following sections we first outline the trans-
lation approach and then describe how the trans-
lations are rendered for the patient, in audio (for
spoken languages) or video format (for sign lan-
7.1 Translation
High translation quality is essential for a medi-
cal phraselator, therefore the translations are pro-
duced by professional translators. Translating for
BabelDr presents technical challenges, since lan-
guage resources must be in a specific structured
data format not easily accessible to translators. An
online translation platform which includes a trans-
lation memory and allows translators to efficiently
handle the compositional items was developed to
facilitate the translators’ task and ensure the quality
and coherence of the translations (Gerlach et al.,
The translations are aimed at patients with no
medical knowledge and designed to be understand-
able by patients with a low level of literacy. Sen-
tences were also adapted to account for cultural
aspects, such as sensitive or intimate topics that are
not commonly discussed, related for example to
sexual habits (Halimi et al.,2020). Since the sys-
tem provides translations both in written and spo-
ken form, the translators had to choose phrasings
that would function in both. A recent evaluation of
the translations for two of the system’s target lan-
guages (Albanian and Arabic) has shown that these
translations are easy to understand, and thereby
make the system more trustworthy in comparison
to MT (in publication, Gerlach et al.,2021).
Ongoing developments include the extension of
the system to new target languages and modalities
to make the system accessible to further popula-
tion groups. One addition involves translation to
pictographs targeted at people with intellectual dis-
abilities, another is translation into easy language,
beginning with Simple English.
7.2 Text-to-Speech
Audio has been an important output modality for
the BabelDr system, as it presents various compet-
itive advantages for the patients. It alleviates the
burden of looking on the screen, which proves to be
challenging in a medical setting, e.g. positioning of
the physician and patient. Especially, for illiterate
users, it is an essential component, and having a
system talking in their own language can improve
user experience. While it would be possible to have
a human record all the pre-translated sentences, due
to the number and repetitive nature of the sentences,
the time and cost involved in recording were con-
sidered too high. The option of a Text-to-Speech
(TTS) system was therefore adopted from the be-
ginning of the project in order to announce the
translated questions of the physician. State-of-the-
art systems like Nuance Vocalizer are now part
of our content creation pipeline for crafting the
Systems of this kind, however, lack support for
low-resource languages that the BabelDr system
also targets. For this reason, we have investigated
the option of building our own TTS for those lan-
guages from scratch. In a previous study, posi-
tive feedback in terms of comprehensibility was
Figure 4: Doctor and patient interfaces
Task Model Metric Result
Speech to Text GRXML
Large Vocabulary WER 38.9%
Back Translation NMT Accuracy 93.2%
Overall (3-best) SER 5%
Table 2: Performance by component and overall
received (Tsourakis et al.,2020), after building a
synthetic female voice for the Albanian language
based on Tacotron 2, a neural network architecture
for speech synthesis directly from text (Shen et al.,
2017). Among the target languages supported by
BabelDr, Tigrinya is one for which no public TTS
is available.
For this reason, a female voice talent was re-
cruited to record all the prompts that were subse-
quently used in the online system. This allowed us
to create a corpus with 18 hours of speech that we
exploit in order to create the Tigrinya synthesized
voice. The training process is similar to the one
found in (Tsourakis et al.,2020). As new content
is constantly added to the system, new recordings
of the translations are requested. This time we first
generate the output with the TTS and ask the voice
talent to listen to the prompts. If the result is accept-
able the TTS version is kept, otherwise, a human
recording is necessary. In a set of 2150 prompts
the human had to record 573 files (26.7%).
7.3 French Sign Language
Establishing effective and reliable communication
between a doctor and a deaf patient is a compli-
cated task. The scarcity of professional interpreters
and the lack of awareness of medical staff for
deaf culture severely impedes communication. To
create sign language output for our fixed-phrase
translator, we have investigated two different ap-
proaches: recorded human signers and an avatar
(using JASigning, Glauert and Elliott,2011). An
evaluation carried out with the deaf community
showed that the recorded human signers are supe-
rior in terms of understandability and acceptability,
but it was found that the avatar could be useful in
this context (in print, Bouillon et al.,2021). The
recorded videos were recorded by a sign language
interpreter in collaboration with a deaf nurse, and
are freely accessible in the online system, providing
a human translation reference in sign language for
medical questions. These resources present oppor-
tunities to evaluate what affects the communication
task with deaf people in this specialised context.
8 Patient response interface
The original BabelDr system was limited to yes-no
questions or questions where the patient could re-
spond non-verbally, for example by pointing at a
body part. This restrictive approach was problem-
atic both for doctors, who are used to asking open
questions, and for patients who had little means
to actively contribute to the direction of the dia-
logue. To build a bidirectional version that would
allow more complex responses from the patient,
we considered different options. Building a system
that would allow patients to respond with speech
presents numerous difficulties. No speech recog-
nisers exist for many of the minority languages
targeted by our system, and few or no resources
such as speech corpora are available to build such
systems. A text interface, as found in traditional
phraselators, while easier to implement, would not
be accessible to patients with low literacy. Addi-
tionally, in the context of a fixed phrase transla-
tor, some user training is necessary to familiarise
with system coverage, which is not possible for
patients who arrive at an emergency service. For
these reasons, we chose to add a simple pictograph
based response interface, shown in Figure 4. Each
core-sentence is linked to a set of corresponding re-
sponse pictographs among which the patient can se-
lect their response. Evaluation of these pictographs
in terms of understandability and acceptability by
patients of different educational and cultural back-
grounds is ongoing (Norr
e et al.,2020). A task-
based evaluation showed that all patients prefered
the bidirectional version since they could explain
their symptoms more efficiently.
9 Evaluation
9.1 Task based
A translation system for the healthcare domain
should be evaluated on the task it is designed to as-
sist, which in the case of BabelDr is the diagnostic
interview. To this end, we carried out several usage
studies. In a preliminary study, we asked four med-
ical students and five doctors to diagnose two stan-
dardised Arabic speaking patients, using BabelDr
and Google Translate (GT). Results showed that in
comparison to the generic machine translation tool,
BabelDr provides higher-quality translations and
led to a higher number of correct diagnoses (8/9
for BabelDr against 5/9 for GT), in particular with
medical students (Bouillon et al.,2017). A subse-
quent crossover study where 12 French speaking
doctors where asked to diagnose two Arabic speak-
ing standardised patients using BabelDr confirmed
that the application allows doctors to reach accu-
rate and reliable diagnoses (24/24 correct). It was
agreed among participating medical professionals
that BabelDr could be used in their everyday medi-
cal practice (Spechbach et al.,2019).
The system is currently in use at the HUG outpa-
tient emergency unit and a user satisfaction study
is ongoing to collect patients’ and doctors’ feed-
back on system usage in real emergency settings by
means of questionnaires (Janakiram et al.,2020).
The study includes only patients with no under-
standing of French and no common language with
the doctor. Overall, 90% of the 30 patients included
so far reported a positive level of satisfaction. The
doctors reported 87%.
9.2 System performance
To evaluate the performance of the current version
of the complete system, we have used the spoken
data set collected during the usage study described
above (Spechbach et al.,2019). Since the system
relies on human pre-translation, it is sufficient to
evaluate the output in terms of backtranslation, as a
correct core-sentence will result in a correct transla-
tion for the patient. We measured the performance
using sentence error rate (SER), which is defined
as the percentage of core-sentences that are not
identical to the annotated correct core-sentences.
Since the system interface presents a selection of
core-sentences to the doctor, for this evaluation we
considered 3-best backtranslation results, including
the GRXML result when it was above the confi-
dence threshold and two or three backtranslations
of large vocabulary recogniser results. With this
configuration, the system achieved 5% SER on this
data set.
10 Conclusion
Healthcare translation is required to facilitate the
engagement with people with diverse language, cul-
tural, and literacy backgrounds. The development
of culturally effective and patient-oriented trans-
lation tools has become increasingly urgent. Al-
though BabelDr is far from solving the problem of
miscommunication, it is an example of a concrete
application of natural language processing to help
minority groups communicate in a medical context.
The developed tool, resources and evaluations
are a first step toward accessible healthcare apps.
This research is essential to define criteria which
can be used in the development and evaluation of
new medical interpreting technologies with a view
to enhancing the usability among patients from
refugee, migrant, or other socioeconomically dis-
advantaged populations.
This project was supported by the ”Fondation
ee des H
opitaux Universitaires de Gen
We would also like to thank Nuance Inc for gen-
erously making their software available to us for
research purposes.
