Content uploaded by Dominik Lukes
Author content
All content in this area was uploaded by Dominik Lukes on Jul 20, 2015
Content may be subject to copyright.
Page 1
Building a Phonics Engine for
Automated Text Guidance
Dominik Lukeš
Research, Development and Policy,
Dyslexia Action, United Kingdom
dlukes@dyslexiaaction.org.uk
Chris Litsas
National Technical University of Athens,
Greece
chlitsas@central.ntua.gr
Abstract— This paper describes the results of work done to
develop an automated guidance system for readers with dyslexia
and related literacy difficulties. The system was developed to
address the specific needs of dyslexic readers and teachers
providing support to them in the analysis of words into their
phonic components. This includes division into syllables both in
spoken and written form, mapping of graphemes to phonemes and
highlighting of common spelling patterns. Despite a large volume
of NLP resources being available, we could identify no system that
would provide this level of analysis. Such system is necessary to
advance the field of computer-assisted literacy support. For
instance, by developing open-world games not limited to pre-set
lists of words or a reader application that can highlight phonic
features of any word. In order to provide such guidance we
designed an annotated dictionary, list of phonic features with
associated look up routines and dictionary annotation algorithms.
The resulting phonics engine was then used to power a reader
application and a phonic games system developed as part of
iLearnRW. We make recommendations for further directions in
research and development.
The research and development described in this paper was
conducted as part of the EU FP7 ICT project iLearnRW -
Integrated Intelligent Learning Environment for Reading and
Writing (318803).
Keywords—natural language processing, computer-aided
instruction, phonics, mobile technology, dyslexia, readability
INTRODUCTORY NOTE
The iLearnRW project which sponsored the work
reported in this paper developed all tools for two languages
English and Greek. However, the vast majority of the work
done was on English due to its peculiar level of complexity
in the area of phonics and orthography. Therefore, unless
specified, the issues discussed pertain to English. Greek is
mentioned only as relevant.
INTRODUCTION
Literacy interventions that are offered to readers with
dyslexia and related linguistic or cognitive impairments
require highly specialised knowledge of English phonics on
the part of the teacher. Thus parents, community volunteers
or even teaching assistants employed by schools may not be
relied on to provide the appropriate structured guidance to
the struggling reader (child or adult). Furthermore, due to
the complexities of English phonology and orthography,
even specialist teachers are not always able to apply the
principles to any given word that may present itself in the
course of reading. Yet, there is no automated tool to make
this easier for literacy professionals. Established tools such
as pronunciation dictionaries or online corpora are of
limited utility because they do not combine sufficient
orthographic information. Even machine readable data sets
developed for Natural Language Processing do not provide
information about word properties like syllable division in
both pronunciation and orthography nor do they map the
concept of vowel or consonant phonemes onto that of vowel
or consonant letters. For instance, this makes it difficult to
automate tasks such as:
Finding all examples of ‘a’ spelled to rhyme with ‘hay’
in a text or a corpus.
Sorting words by their phoneme/grapheme ratio.
Identifying appropriate syllable boundaries in the
written form of a multi-syllable word based on
knowledge of the syllable boundaries in pronunciation.
As a result, there are no automated tools on the market
to help teachers identify potentially difficult words in a text
or classify texts by their phonic difficulties. All current text
classification measures are also reduced to finding proxies
for word difficulty such as length or the number of syllables.
While this may have produced largely accurate predictions
at the population levels, we know little about the accuracy
of text complexity measures on readers with phonological
deficits such as dyslexia. (See [1] for an overview of the
issues.)
Another consequence is that most phonics instructional
software is limited to a rather small set of example words
limiting the opportunities for practice. Furthermore, these
words are generally selected to be appropriate to readers at a
very young age and therefore such software is less appealing
to older learners who are discouraged by the presence of
words they perceive to be childlike [2].
BACKGROUND ON NEEDS OF STRUGGLING READERS
Dyslexia is described as a “learning difficulty that
affects accurate and fluent word reading and spelling” [3].
The reading difficulty appears regardless of general
intelligence and even with high quality typical instruction.
However, dyslexic readers do respond to a targeted
intensive intervention that is structured and cumulative.
They also respond well to consistent feedback and
overlearning. While many aspects of dyslexia intervention
work best when performed by a specialist teacher, there are
Page 2
many others that could be aided by automated tools (in
particular overlearning and consistent feedback). Even
computer-aided instruction has been shown to work best
when accompanied by an instructor [4]. However, even
instructors should be able to benefit from automated
guidance about the features of a specific text or about a
wider range of words to practise specific phonic features
with.
In particular, an automated system could assist dyslexic
readers with:
Identifying the syllables in a word
Recognising the structure of words (stem, prefix,
suffix)
Highlighting typical or repeated patterns of English
orthography
Identifying phoneme/grapheme correspondence
Learning the pronunciation of a word
Learning the meaning of a word
In a focus group session with 8 specialist teachers
working for Dyslexia Action we asked each teacher to write
down the top issues they felt they had to address the most
during their literacy intervention sessions with dyslexic
students. We then asked them to prioritise these as a group.
The collated list that resulted from this exercise revealed
that teacher’s top priorities when working with students
were the following:
1. Syllable division
2. Vowels - Length
3. Understanding words from context
4. Affixing
5. Grapheme / phoneme correspondence
6. Letter / word patterns (Identify subpatterns of
words)
7. Irregular/sight words
8. Letter similarity
9. Blending/word attack
Of these, 1, 2, 4, 5, 6, 7, and 8 are obvious candidates for
computer-aided instruction. Yet, there are no tools that can
automate these tasks for students (or novice teachers) when
faced with a new word or a new text.
This list is consonant with the typical issues faced by
readers with dyslexia as described in the research literature.
It also mirrors difficulties of English orthography outlined
below.
BACKGROUND ON LINGUISTIC ANALYSIS
English orthography is peculiarly complex when
compared to that of any other European language. While as
much as 80% of spellings are subject to regular patterns [5],
the large number of these patterns makes it often easier for
learners to learn regularities as exceptions when they apply
to only to a few words or words that only appear
infrequently. Furthermore, even regular patterns often have
frequent alternatives which have to be learned for each
word. As a result, even potentially straightforward patterns
may present difficulties for learners. Things are even more
difficult for the construction of an automated guidance tool
because there is no definitive list of patterns or rules for
their application to individual words. To produce accurate
results, a dictionary linking orthography to pronunciation is
necessary.
However, the problems do not stop there. During the
construction of such a dictionary, decisions have to be made
as to its annotation between equally good options that may
each be appropriate in different contexts. For instance, let’s
take syllable division in English. While teachers identified
syllable division as their top concern during their
intervention (see above), there are three legitimate ways of
determining syllable boundaries. Two in pronunciation and
one in orthography. Thus the word ‘hospital’ can be divided
as ‘hos.pit.al’ or ‘hosp.i.tal’. To make matters worse the two
leading pronunciation dictionaries of English ([6] and [7])
each take a different route. And what’s more, in phonics
teaching, syllables are divided in the orthographic form of
the word leading to a division like ‘kit.ten’ although only
one ‘t’ sound is pronounced. Item 2 on the teachers’ list, viz
vowel length, is equally controversial. While a phonetician
would consider ‘a’ in ‘mate’ a diphthong, a phonics teacher
will call it a long vowel counterpart of the short vowel ‘a’ in
‘mat’. How is an automated system to respond to a
command ‘highlight all long vowels in the text’ without
knowing the background of who is asking? Linguists, who
are the primary driving forces behind Natural Language
Processing systems, are generally unaware of these
difficulties and therefore do not develop tools that would
address the needs of phonics teaching. Phonics teachers, in
turn, have limited training in linguistics and are not aware of
tools such as corpora that could spur them into asking for
assistance.
Greek on the other hand has a transparent orthography
for reading. Discounting exceptions, the pronunciation of
letters or combinations of letters can always be inferred
from context. Greek difficulties come with spelling where
for instance, the phoneme / i / can be spelled in five
different ways which appear arbitrary to the writer (although
they can often be inferred when one is aware of the
etymology). This makes the automated processing of Greek
orthography a relatively straightforward affair.
PHONICS ENGINE SPECIFICATION
The iLearnRW project set out to integrate serious games
and a simple reader app to engage young readers with
dyslexia and provide them with sufficient guidance to be
able to use the system on a tablet with minimal teacher
intervention. From the start, it was faced with the problem
of dealing with the complexities and idiosyncrasies of
English and Greek orthographies. The project was presented
with an issue of providing guidance on an open ended list of
words as well as pre-processing these words for the game
engine to present to the players in various game contexts.
This made it necessary to develop a system for automated
processing of phonics information of any word that can be
found in a text encountered by a reader in the demographic
targeted by the project (age 9-11).
The main aims of this phonics engine coming from the
requirements of the iLearnRW project were:
provide automated guidance to students and teachers
reading texts (using highlighting as well as explicit
information)
Page 3
generate more extensive word lists for practice
activities within the serious games
provide information about word structure to the game
engine
DETAILS OF PHONICS ENGINE IMPLEMENTATION
Building an English phonic profile
In order to identify phonic features that may be the
object of automated support, we extracted 413 items of
instruction from several interventions used by the teachers
we surveyed and supplemented them by other features.
These were divided into categories (number next to each
category indicates number of items).
1. Consonants (49)
2. Vowels (71)
3. Blends and letter patterns (131)
4. Syllables (13)
5. Suffixes (92)
6. Prefixes (42)
7. Confusing letters (15)
The record for each item took the following form
expressed in JSON:
{"descriptions":["a-æ"],
"problemType":"LETTER_EQUALS_PHONEME",
"humanReadableDescription":"a=æ (at)
<> Pronounce a as æ. For example: at,
as, and",
"cluster":3,
"character":"Short vowel"}
This method provided us with a fairly complete profile
of English phonics although several omissions were
identified during implementation. See [8].
A similar (much shorter) profile was constructed for
Greek, which due to the nature of Greek orthography, only
contained typical problem areas identified by experts rather
than the complete profile of Greek phonics.
This profile would also be used to construct a user
model with individual severities of difficulty attached to
each entry. This was then used to construct a representation
of phonic ability for individual users. This could be
leveraged by the system as a whole but was not necessary
for the fundamental working of the phonics engine.
Building a phonic dictionary of English
After investigating different options, including using a
phonological component from an English text-to-speech
engine, it was decided to build an English wordlist. We
started with a list of 5,000 most frequent words in English
from word frequency data provided by the Corpus of
Contemporary American English and used the open source
spelling check tool hunspell to generate their possible
inflectional and derivational forms. Then we used an online
tool to generate pronunciation for each item on our list. This
resulted in over 15,000 lexical forms for most of which we
had pronunciations and syllabification information.
From then on we had to develop algorithms for:
1. Identifying phoneme-grapheme mappings
2. Identifying orthographic syllabification
3. Labelling suffix and prefix types
4. Adding number of letters, phonemes and syllables
The individual entries had the following annotations:
Word form:
feelings
Related stem:
feeling
Pronunciation:
ˈfiː.lɪŋz
Phoneme/Grapheme
Mapping:
f-f,ee-iː,l-l,i-
ɪ,ng-ŋ,s-z
Orthographic
syllabification:
fee.lings
Number of letters:
8
Number of phonemes:
6
Number of syllables:
2
Frequency band:
4
Suffix type:
SUFFIX_ADD
Suffix form:
s
Prefix type:
PREFIX_NONE
Prefix form:
NULL
Table 1 Dictionary structure
The final form of the dictionary required significant
work in manual checking, editing and refining the
annotation algorithms.
Building a dictionary of Greek was much easier. Since
the pronunciation of Greek is mostly regular, it was not
necessary to come up with grapheme phoneme
correspondences. Also, algorithms exist for syllabification
of Greek. As a result, the Greek dictionary could contain
many more lexemes (500.000) than English. In addition,
Greek contained labels of parts of speech which had been
identified as important by Greek literacy intervention
experts. Parts of speech were omitted in the English
dictionary due to the level of difficulty of identifying them
automatically in a free text. Also, since the same form can
belong to multiple parts of speech, this would complicate
the structure of the dictionary unduly. Solving this issue will
be the task of future work.
Linking the profile and the dictionary with look-up and
annotation routines
In order to leverage the annotations in the dictionary, it
was necessary to link it to the phonic profile of each
language. To achieve that profile entries were classified into
problem types (such as "problemType":
"LETTER_EQUALS_PHONEME"). A look up routine was
developed for each problem type and was fed profile entry
definition such as "descriptions":["a-æ"]. This
would then return all words containing a-æ in the
dictionary annotation for Grapheme/Phoneme
correspondence.
For Greek, the system was much simpler on the
orthographic side but had to deal with the issue of
identifying parts of speech and other morphological features
in text.
The look up routines were accompanied by annotation
routines to be used to highlight the phonic feature in the
word.
The look up and annotation routines were used in
multiple applications (see below):
Page 4
Identify and highlight phonic features in text
Generate word lists for phonics games
Provide word structure to the game engine
Java Implementation
The annotation rules were developed as Java code to
support the rest of iLearnRW system. We used special Java
classes for Text, Sentence and Word to provide three
different layers of text analysis. The Word class is the most
important one since it contains all word information that the
dictionary contains and also it can be passed to the
annotation rules routines in order to get annotated. To
support the operation of the annotation we also use a Java
class named AnnotatedWord. Objects of this class contain
both a Word object and a list of "start", "end" indices point
to word parts that match specific language structure
properties.
Finally, the interface of the annotation rules module is
available as a web-service and can be called from
authenticated users of the iLearnRW system.
All services developed using the Spring Framework [9].
APPLICATIONS OF THE ILEARNRW PHONICS ENGINE
Using the phonics engine were able to build three
applications for the iLearnRW project that provide
automated guidance to dyslexic readers and/or their
teachers. Additionally, using the Phonics Profile, we were
able to create a model of phonic ability (user model) for
each individual user of the system by assigning levels of
competence (called problem severities) to each entry in the
profile. This enabled us to develop personalised text-
guidance to all readers about whom we have skill level
information. This made it possible to go beyond traditional
readability measures in classifying the text for users. These
are the systems in which draw upon the phonics engine as a
service.
Phonics Aware Reader
A prototype reader app was developed (see [10]) to
provide direct help to the student reading text. Two features
of the reader made use of the phonics engine.
1. Provide information about the phonic and
orthographic structure of a word to the reader on request.
See Figure 1.
Figure 1 Word structure information
This interface popped up on tap and hold on the tablet
reader. During the evaluation, many users commented
positively on the usefulness of the feature. For teachers, it
provides information about the word not available in
standard dictionaries.
2. Annotation rules to highlight chosen phonic features
in all words in the text. See Figure 2 and Figure 3 for an
example of implementation.
Figure 2 Phonics highlight settings
Figure 3 Example of highlighted text (red: long vowels, blue: short
vowels)
In order to provide a sensible interface for teachers to
choose from 413 phonic features in the profile, pre-sets of
typical contrasts were developed to cover areas such as
long/short vowels, different pronunciations of vowel letters
(e.g. ‘a’ as in ‘made’ and ‘mad’), different pronunciations of
consonant letters (e.g. soft/hard ‘c’), different prefix/suffix
types (e.g. –ing DOUBLE as in ‘running’ vs ADD as in
‘keeping’).
This sort of functionality is not available in any other
automated tool relying on traditional machine readable
dictionaries. It was highly appreciated by the teachers we
surveyed although it was not possible to evaluate it with
readers due to time constraints during the evaluation.
Serious Games Support
Using the text look up routines, a sorted list of words
was generated for each entry in the phonics profile. These
word lists were then fed into a serious game which utilised
the user model to present words associated with a given
entry to the player.
The game then used the dictionary annotations to derive
the word structure as it was presented to the player in the
game depending on the features practiced (e.g. splitting
prefixes from words as in Figure 4). The user model and
phonics engine were online services which the game can
query every time it presents a new activity.
Page 5
Figure 4 Game utilizing information about word structure
A further layer of clusters was added to the phonics
profile to suggest to the game the order in which entries
should be practised and which entries can be practised
together.
This enabled the game to transcend the limitation of a
preset word list and present the readers with more age
appropriate words. This builds on the work reported in [2]
and opens up the possibilities of the type of phonics learning
activities that can automatically process any given word.
Text Classification Tool
This tool takes as input a user profile and a list of plain
texts. Then, it calculates the difficulty for each of the texts
with respect to the user's difficulties described in her/his
user model. Based on the final scores the user can select
which of these text has the most appropriate content for
her/him. See [11].
Online Text Analysis Service
We are currently completing an online service where
teachers can submit their own text for analysis. See Figure 5
for an early mockup of the phonics selection interface.
Figure 5 Mockup of an online text analysis service for teachers
One of the features available is a heatmap of phonic
features present in a particular text. When present, this can
be related to the profile associated with an individual user.
Figure 6 Text phonics heatmap
CONCLUSIONS AND FUTURE STEPS
We have shown that automated phonics and
orthographic guidance is possible and beneficial. Early
reactions of teachers who saw the demonstrations have been
extremely encouraging. Students also appreciated the results
through games and reader feedback, although for them, their
exposure to the system was implicit.
This work has revealed a significant gap in the NLP
field related to phonics and automated literacy support and
we hope that more work in this area will be forthcoming.
The work on the project has generated a number of
insights and important lessons. While the approach we took
solved the key issues we set out to address, our work also
uncovered many new issues that will need addressing. These
include the structure of the phonics profile and the balance
of individual annotations in the dictionary.
These are the key areas of further development and
refinement we have identified based on the present work.
Refine the phonics profile of English. In particular,
conduct a frequency analysis of individual patterns and
reflect issues of difficulty and time of exposure. This
will be made possible by the new tools developed as
part of this project.
Longer English dictionary is necessary. We are
currently working on a dictionary based on the MRC
Psycholinguistic Database [12] to address the gap.
More annotations about word structure and parts of
speech need to be added to the phonics dictionary both
for English and Greek.
Semantic information should be integrated into the
guidance, most importantly as it relates to semantic
relationships such as synonyms or hyponyms as
opposed to dictionary definitions which are often more
complex than the word they try to explain. We suggest
investigating integration of information from WordNet
or FrameNet.
While the current dictionary contains some frequency
information, this is based on a general purpose corpus
of American English. Words should be annotated not
only with raw overall frequencies but also with
frequencies related to particular genres as well as those
derived from a corpus of texts, students are likely to
have been exposed to in certain contexts and by certain
age. Here we can build on the work of the Children’s
Printed Word database [13].
Integrate more NLP features into the engine both to
identify phonically relevant features such as parts of
speech but also semantic information such as sentiment
analysis and named entity extraction to increase the
accessibility of texts through exposing their structure.
Conduct further research into readability metrics with
respect to reading difficulties such as dyslexia and
evaluate the benefits of personalised text classification.
We are currently in the process of formulating further
projects to take this effort forward but we hope to have
inspired others to advance our work further. We are working
on making the work done so far available to the research
community under an open licence.
Page 6
ACKNOWLEDGMENTS
This work has been supported by the EU FP7 ICT
project iLearnRW - Integrated Intelligent Learning
Environment for Reading and Writing (318803).
Other members of the iLearnRW team contributed to
this project. Daniel Gooch developed the first version of the
Profile. Johan Andersson of Dolphin Computer Access
developed the first version of the dictionary and look up
routines.
REFERENCES
[1] H. A. E. Mesmer, Tools for matching readers to texts:
research-based practices. New York: Guilford Press,
2008.
[2] L. Rello, C. Bayarri, and A. Gorriz, “What is Wrong
with This Word? Dyseggxia: A Game for Children
with Dyslexia,” in Proceedings of the 14th
International ACM SIGACCESS Conference on
Computers and Accessibility, New York, NY, USA,
2012, pp. 219–220.
[3] J. Rose, “Identifying and teaching children and young
people with dyslexia and literacy difficulties: an
independent report,” 2009.
[4] G. Brooks, “What works for children and young people
with literacy difficulties?,” The Dyslexia-SpLD Trust,
2013.
[5] D. J. Culpeper, P. F. Katamba, P. P. Kerswill, P. R.
Wodak, and P. T. McEnery, Eds., The English
Language: Description, variation and context. Palgrave
Macmillan, 2009.
[6] J. C. Wells, Longman pronunciation dictionary.
Pearson Education India, 2008.
[7] D. Jones, P. Roach, J. Setter, and J. Esling, Cambridge
English pronouncing dictionary. Cambridge University
Press, 2011.
[8] D. Gooch, M. Vasalou, D. Lukes, J. Flower, and L.
Benton, “User modelling for users with dyslexia and
dysorthographia,” iLearnRW Deliverable 4.1.
[9] C. Walls and R. Breidenbach, Spring in action (3rd
Edition). Greenwich [Conn.]: Manning, 2011.
[10] D. Lukes, “Dyslexia Friendly Reader: Prototype,
Designs, and Exploratory Study,” presented at the IISA
2015.
[11] C. Litsas, M. Mastropavlou, and A. Symvonis, “Text
classification for children with dyslexia employing user
modelling techniques,” in The 5th International
Conference on Information, Intelligence, Systems and
Applications, IISA 2014, 2014, pp. 1–6.
[12] M. Wilson, “MRC Psycholinguistic Database:
Machine-usable dictionary, version 2.00,” Behav. Res.
Methods Instrum. Comput., vol. 20, no. 1, pp. 6–10,
1988.
[13] J. Masterson, M. Stuart, M. Dixon, and S. Lovejoy,
“Children’s printed word database: Continuities and
changes over time in children’s early reading
vocabulary,” Br. J. Psychol., vol. 101, no. 2, pp. 221–
242, 2010.