Conference PaperPDF Available

Predicting Reaction Times in Word Recognition by Unsupervised Learning of Morphology

June 2011

DOI:10.1007/978-3-642-21735-7_34

Source
DBLP

Conference: Artificial Neural Networks and Machine Learning - ICANN 2011 - 21st International Conference on Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part I

Authors:

Sami Virpioja

University of Helsinki

Minna Lehtonen

University of Turku

Annika Hultén

Max Planck Institute for Psycholinguistics

Show all 5 authorsHide

A central question in the study of the mental lexicon is how morphologically complex words are processed. We consider this question from the viewpoint of statistical models of morphology. As an indicator of the mental processing cost in the brain, we use reaction times to words in a visual lexical decision task on Finnish nouns. Statistical correlation between a model and reaction times is employed as a goodness measure of the model. In particular, we study Morfessor, an unsupervised method for learning concatenative morphology. The results for a set of inflected and monomorphemic Finnish nouns reveal that the probabilities given by Morfessor, especially the Categories-MAP version, show considerably higher correlations to the reaction times than simple word statistics such as frequency, morphological family size, or length. These correlations are also higher than when any individual test subject is viewed as a model.

Scatter plot of reaction times and log-probabilities from Morfessor CategoriesMAP. The words are divided into four groups: low-frequency monomorphemic (LM), low-frequency inflected (LI), high-frequency monomorphemic (HM), and highfrequency inflected (HI). Words that have faster reaction times than predicted are often very concrete and related to family, nature, or stories: tyttö (girl), ¨ aiti (mother), haamu (ghost), etanaa (snail + partitive case), norsulla (elephant + adessive case). Words that have slower reaction times than predicted are often more abstract or professional: ohjelma (program), tieto (knowledge), hankkeen (project + genitive case), käytön (usage + genitive case), hiippa (miter), kapselin (capsule + genitive case).

…

The effect of training corpus on correlations of Morfessor Baseline (blue circles), Categories-MAP (red squares), and logarithmic surface frequencies (black crosses). The dotted lines show the results on subsets of the same corpus. Unconnected points show the results using different types of corpora.

…

Figures - uploaded by Krista Lagus

Content may be subject to copyright.

Content uploaded by Krista Lagus

Content may be subject to copyright.

Predicting Reaction Times in Word Recognition

by Unsupervised Learning of Morphology

Sami Virpioja1, Minna Lehtonen2,3,4, Annika Hult´en3,4,

Riitta Salmelin3, and Krista Lagus1

1Department of Information and Computer Science,

Aalto University School of Science

2Cognitive Brain Research Unit, Cognitive Science,

Institute of Behavioural Sciences, University of Helsinki

3Brain Research Unit, Low Temperature Laboratory,

Aalto University School of Science

4Department of Psychology and Logopedics, ˚

Abo Akademi University

Abstract. A central question in the study of the mental lexicon is how

morphologically complex words are processed. We consider this question

from the viewpoint of statistical models of morphology. As an indicator

of the mental processing cost in the brain, we use reaction times to words

in a visual lexical decision task on Finnish nouns. Statistical correlation

between a model and reaction times is employed as a goodness measure

of the model. In particular, we study Morfessor, an unsupervised method

for learning concatenative morphology. The results for a set of inﬂected

and monomorphemic Finnish nouns reveal that the probabilities given

by Morfessor, especially the Categories-MAP version, show considerably

higher correlations to the reaction times than simple word statistics such

as frequency, morphological family size, or length. These correlations are

also higher than when any individual test subject is viewed as a model.

1 Introduction

The processing of morphologically complex words is a central question in the

study of the mental lexicon. Theoretical models have been put forward that sug-

gest that morphologically complex words are recognized either through full-form

representations [3], full decomposition (e.g. [17]) or a combination of the two

(e.g. [11]). For example, Finnish words can be combined of several morphemes,

and one single noun can, in principle, attain up to 2000 diﬀerent forms [7]. Having

separate neural representations for each of these forms would seem unnecessar-

ily demanding compared to a process where words would be analyzed based on

their compound morphemes. In behavioral word recognition tasks, a processing

cost (i.e., long reaction times and high error rates) has been robustly associ-

ated with inﬂected Finnish nouns in comparison to matched monomorphemic

nouns [11,10]. This has been taken as evidence for the existence of morphologi-

cal decomposition for most Finnish inﬂected words, with the possible exception

of very high frequency inﬂected nouns [15].

T. Honkela et al. (Eds.): ICANN 2011, Part I, LNCS 6791, pp. 275–282, 2011.

Springer-Verlag Berlin Heidelberg 2011

276 S. Virpioja et al.

Statistical models of language learning would be attractive both conceptu-

ally and because they yield quantitative predictions that may be tested against

measured values of performance and, eventually, of brain activation. In this ﬁrst

feasibility test, we use reaction times as a proxy, providing an indirect measure

of the underlying mental processing. In previous studies, several factors, in-

cluding the cumulative base frequency (i.e., the summative frequency of all the

inﬂectional variants of a single stem, [16]), surface frequency (i.e., whole form

frequency, [1]), and morphological family size (i.e., the number of derivations

and compounds where the noun occurs as a constituent, [2]), have been found to

aﬀect the recognition times of morphologically complex words. However, we do

not know of any previous work that would use statistical models of morphology

as models of the reaction times. In the proposed evaluation setting, we exam-

ine how well they predict the average reaction times for individual inﬂected and

monomorphemic words in a word recognition task. As a particular morphological

model we examine an unsupervised method for word segmentation, Morfessor,

that induces a compact lexicon of morphs from unannotated text data.

2 Experimental Setup

Our experimental setup can be summarized as follows: (1) Data recordi ng: Mea-

surement data from humans is obtained, namely reaction times recorded on test

subjects in a lexical decision task with inﬂected and monomorphemic words. (2)

Model estimation: Using training data of varying size and type, we estimate sta-

tistical models of morphology that can be used to predict the recognition times

of words. In addition, we collect such statistics of the words that are known to

aﬀect the reaction times. (3) Model evaluation: We calculate linear correlation

between model predictions and the average reaction times of the test subjects.

A good model is one which produces costs that have high correlation to the

reaction times. Also any of the human test subjects can be viewed as a model,

and their reaction times thus correlated with those of the rest of the subjects.

2.1 Reaction Time Data and Model Evaluation

We use the reaction time data reported in [9]. Sixteen Finnish-speaking univer-

sity students participated in the experiment. The task was to decide as quickly

and accurately as possible whether the letter string appearing on the screen was

a real Finnish word or not, and to press a corresponding button. The stimuli

consisted of 320 real Finnish nouns and 320 pseudowords. The words were taken

from an unpublished Turun Sanomat newspaper corpus of 22.7 million word to-

kens and divided into four groups of 80 words according to their frequency in

the corpus (high or low) and morphological structure (monomorphemic or in-

ﬂected). There were four kinds of pseudowords (monomorphemic, real stem with

pseudosuﬃx, pseudostem with real suﬃx, and incorrect combination of real stem

and suﬃx) and their lengths and bigram frequencies (i.e., the average frequency

of letter bigrams in the word) were similar to the real words.

Predicting Reaction Times in Word Recognition 277

As preprocessing, we exclude all incorrect responses and reaction times of

three standard deviations longer or shorter than the individual’s mean. For the

remaining data, we take the logarithm of the reaction times, normalize them to

zero mean for each subject, and calculate the average across subjects per each

word. To evaluate the predicted costs, we calculate the Pearson product-moment

correlation coeﬃcient ρbetween the costs and the average reaction times, with

ρ∈[−1,+1] and ρ= 0 for uncorrelated variables. This is equilavent to calculat-

ing linear regression, as ρ2corresponds to the coeﬃcient of determination, i.e.,

the fraction of variance of the predicted variable explained by the predictor.

2.2 Statistics and Computational Models

Several statistics are calculated for each stimulus word: length, surface frequency,

base frequency, morphological family size, and bigram frequency. As logarithmic

frequencies often correlate with reaction times better than direct frequencies,

we also test those. The computational models examined here give a probability

distribution p(W) over the words. Thus, we can use the cost or self-information

−log p(W) to explain the reaction times in a similar manner as with the word

frequencies: a high probability is assumed to correlate with a low reaction time.

N-gram Models. We use n-gram models to get a good estimate on how com-

mon the form of the word (sequence of letters li) is among all the words in

the language. An n-gram model of order nis a (n−1):th order Markov model,

thus approximating p(W=l1l2...l

N)asN

i=1 p(li|li−n+1 ...l

i−1). For esti-

mating the n-gram probabilities p(li|li−n+1 ...l

i−1), the standard techniques

include smoothing of the maximum likelihood distributions and interpolation

between diﬀerent lengths of n-grams. We apply one of the state-of-the-art meth-

ods, Kneser-Ney interpolation [4], implemented in VariKN toolkit [14].

Morfessor Baseline. Morfessor [6] is a method for unsupervised learning of

concatenative morphology. It does not limit the number of morphemes per word,

and is thus suitable for modeling complex morphology such as that in Finnish.

The basic idea can be explained using the Minimum Description Length (MDL)

principle [13], where modeling is viewed as a problem of encoding a data set

eﬃciently in order to transmit it. In two-part MDL coding, one ﬁrst transmits

the model M, and then the data set by referring to the model. Thus the task

is to ﬁnd the model that minimizes the sum of the coding lengths L(M)and

L(corpus|M). In the case of segmenting words into morphs, the model simply

consists of a lexicon of unique morphs, and a pointer assigned for each. The

corpus is then transmitted by sending the pointer of each morph as they occur

in the text. Using L(X)=−log p(X), the task is equivalent to probabilistic

maximum a posteriori (MAP) estimation, where p(M|corpus) is maximized.

In Morfessor Baseline, the lexicon consists of the strings and frequencies of

the morphs. The cost of the lexicon increases by the number and length of the

morphs. Each pointer in the corpus corresponds to a maximum likelihood prob-

ability set according to the morph frequency. Thus, for a known segmentation,

278 S. Virpioja et al.

the likelihood for corpus is simply the product of the morph probabilities. Dur-

ing training, Morfessor applies a greedy algorithm for ﬁnding simultaneously

the morph lexicon and a segmentation for the training corpus. After training, a

Viterbi-like algorithm can be applied to ﬁnd the segmentation with the highest

probability—the product of the respective morph probabilities—for any single

word. For details, see, e.g., [6] and [5].

Morfessor Categories-MAP. The assumption of the independence between

the morphs in a word is an obvious problem in Morfessor Baseline. For example,

the model gives an equal probability to “s + walk” and “walk + s”. The later

versions of Morfessor extend the model by adding another layer of representa-

tion, namely a Hidden Markov Model (HMM) model of the segments [6]. In

Morfessor Categories-MAP, the HMM has four categories (states): preﬁx, stem,

suﬃx, and non-morpheme. While the model allows hierarchical segmentation

to non-morphemes, the ﬁnal analysis of a word is restricted by the regular ex-

pression (prefix* stem+ suffix*)+. Context-sensitivity of the model has lead

to improved segmentation results when compared to a linguistic gold standard

segmentation of words into morphemes [6].

2.3 Data for Learning Computational Models

The main corpus in our experiments is the one used in the Morpho Challenge

2007 competition [8]. It is part of the Wortschatz collection [12] and contains

three million sentences collected from World Wide Web. To observe the eﬀect of

the training corpus, we also use 30 000, 100 000, 300000 and one million sentence

random subsets of the corpus. In addition, we use three smaller corpora: “Book”

(4.4 million words) and “Periodical” (2.1 million words) parts of Finnish Parole

corpus [18], subtitles of movies from OpenSubs corpus [19] (3.0 million words),

and their combination.

It is often unclear whether intra-word models should be trained on a cor-

pus (word tokens), a word lexicon (types), or something in between. For ex-

ample, Morfessor Baseline gives segments that correspond better to linguistic

morphemes when trained on types rather than tokens [6,5]: with token counts,

many inﬂected high-frequency words are not segmented. Morfessor Categories-

MAP, however, is by default trained on tokens [6]: the context-sensitivity of the

Markov model reduces the eﬀect of direct corpus frequencies. We compare mod-

els trained on types, tokens, and an intermediate approach, where the corpus

frequencies care reduced using a logarithmic function f(c) = log(1 + c).

3Results

Table 1 shows the correlations of the diﬀerent statistics and logarithmic proba-

bilities of the models to the average reaction times for the stimulus words. All

values, except for the bigram frequency, showed statistically signiﬁcant correla-

tion (p(ρ=0)<0.01). Among the statistics, logarithmic frequencies gave higher

Predicting Reaction Times in Word Recognition 279

Tabl e 1 . Correlation coeﬃcients ρof diﬀerent word statistics and models to average

human reaction times. Surface frequency I and other statistics are from the Turun

Sanomat newspaper corpus. Surface frequency II is from the Morpho Challenge corpus

used for training the models. The last row shows correlations for reaction times of

individual subjects. The highest correlations are marked with an asterisk.

Word statistics Logarithmic Linear

Surface frequency I −0.5108 −0.2806

Surface frequency II −0.5353* −0.2376

Base frequency −0.4453 −0.1901

Morphological family size −0.4233 −0.2916

Bigram frequency −0.0211 +0.0221

Length (letters) +0.2180 +0.2158

Length (morphemes) +0.5417* +0.5417*

Models Types Log-frequencies Tokens

Letter 1-gram model +0.1818 +0.1816 +0.1799

Letter 5-gram model +0.5394 +0.5380 +0.5160

Letter 9-gram model +0.6952* +0.6920 +0.6358

Morfessor Baseline +0.6605 +0.6765* +0.5817

Morfessor Categories-MAP +0.6620 +0.6950* +0.5474

Other Minimum Median Maximum

Reaction times of a single sub ject +0.2030 +0.4774 +0.5681*

correlations than linear frequencies, and the highest ones were obtained for the

number of morphemes in the word and the surface frequency. Among the models,

the n-grams were best trained with word types, while training with the logarit-

mic frequencies gave the highest correlations for Morfessor. The highest corre-

lation was obtained for the letter 9-gram model trained with word types—any

longer n-grams did not improve the results. Categories-MAP correlated almost

as well as the 9-gram model, while Baseline did somewhat worse. All of them

had markedly higher correlations than the maximum correlation obtained for an

single test subject to the average reaction times of the others.

With logarithmic counts, the Categories-MAP model segmented 135 of the

160 inﬂected nouns, but also 33 of the 160 monomorphemic nouns. The Baseline

model segmented less: 39 of the inﬂected and 5 of the monomorphemic nouns.

Figure 1 shows how the reaction times and probabilities given by Categories-

MAP model match for individual stimulus words. Observing the words that have

poor match between the predicted diﬃculty and reaction time led us to suspect

that some of the unexplained variance is due to a training corpus that does

not match the material that humans are exposed to. Thus we next studied the

eﬀect of the training corpus for the morphological models (Fig. 2). Increasing

the amount of word types in the corpus clearly improved the correlation between

model predictions and measured reaction times. However, the data from books,

periodicals and subtitles gave usually higher correlations than the same amount

of the Morpho Challenge data.

280 S. Virpioja et al.

Fig. 1. Scatter plot of reaction times and log-probabilities from Morfessor Categories-

MAP. The words are divided into four groups: low-frequency monomorphemic

(LM), low-frequency inﬂected (LI), high-frequency monomorphemic (HM), and high-

frequency inﬂected (HI). Words that have faster reaction times than predicted are

often very concrete and related to family, nature, or stories: tytt¨o(girl), ¨aiti (mother),

haamu (ghost), etanaa (snail + partitive case), norsulla (elephant + adessive case).

Words that have slower reaction times than predicted are often more abstract or pro-

fessional: ohjelma (program), tieto (knowledge), hankkeen (project + genitive case),

k¨ayt¨on (usage + genitive case), hiippa (miter), kapselin (capsule + genitive case).

Fig. 2. The eﬀect of training corpus on correlations of Morfessor Baseline (blue circles),

Categories-MAP (red squares), and logarithmic surface frequencies (black crosses). The

dotted lines show the results on subsets of the same corpus. Unconnected points show

the results using diﬀerent types of corpora.

4 Discussion

We studied how language models trained on unannotated textual data can pre-

dict human reaction times for inﬂected and monomorphemic Finnish words in

a lexical decision task. Three models, the letter-based 9-gram model and the

Predicting Reaction Times in Word Recognition 281

Morfessor Baseline and Categories-MAP models, provided not only higher cor-

relations than the simple statistics of words previously identiﬁed as impor-

tant factors aﬀecting the recognition times in morphologically complex words

(cf. [16,1,2]), but also higher than the correlations of reaction times of individual

subjects to the average times of the others. The level of correlation was sur-

prisingly high especially because the training corpus is likely to diﬀer from the

material humans encounter during their course of life. Based on the results us-

ing several training corpora, we assume that even higher correlations would be

obtained with more realistic training data.

The highest correlations were obtained for the letter 9-gram model. However,

its number of parameteres—almost 6 million n-gram probabilities—was very

large. As the estimates of the word probabilities are very precise, we assume

that they are good predictors especially for early visual processing stages.

The Categories-MAP model had almost as high correlation as the 9-gram

model with much fewer parameters (178 000 transition and emission probabili-

ties). It has three important aspects: First, it applies morpheme-like units instead

of words or letters. Second, it ﬁnds units that provide a compact representation

for the data. Third, the model is context-sensitive: the cost of next unit depends

on the previous unit. It is still unclear which contributes more to the high corre-

lations: the morpheme lexicon learned by minimizing the description length, or

the underlying probabilistic model. One way to study this question further is to

apply a similar model to a linguistic morphological analysis of a corpus.

While behavioral reaction times necessarily incorporate multiple processing

stages, brain activation measures could provide markedly more precise markers

of the diﬀerent stages of visual word processing. At the level of the brain, eﬀects

of morphology have been previously detected in neural responses that have been

associated with later stages of word recognition such as lexical-semantic, phono-

logical and syntactic processing [9,20]. Future work includes ﬁnding out whether

the predictive power of the models stems from some of these stages, or from an

earlier one related to the processing of visual word forms.

Acknowledgments. This work was funded by Academy of Finland, Graduate

School of Language Technology in Finland, Sigrid Jus´elius Foundation, Finnish

Cultural Foundation, and Stiftelsen f¨or ˚

Abo Akademi.

References

1. Alegre, M., Gordon, P.: Frequency eﬀects and the representational status of regular

inﬂections. Journal of Memory and Language 40, 41–61 (1999)

2. Bertram, R., Baayen, R., Schreuder, R.: Eﬀects of family size for complex words.

Journal of Memory and Language 42, 390–405 (2000)

3. Butterworth, B.: Lexical representation. In: Butterworth, B. (ed.) Language Pro-

duction, pp. 257–294. Academic Press, London (1983)

4. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language

modeling. Computer Speech & Language 13(4), 359–393 (1999)

282 S. Virpioja et al.

5. Creutz, M., Lagus, K.: Unsupervised morpheme segmentation and morphology

induction from text corpora using Morfessor 1.0. Tech. Rep. A81. Publications in

Computer and Information Science. Helsinki University of Technology (2005)

6. Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and mor-

phology learning. ACM Transactions on Speech and Language Processing 4(1)

(January 2007)

7. Karlsson, F.: Suomen kielen ¨a¨anne- ja muotorakenne (The Phonological and Mor-

phological Structure of Finnish). Werner S¨oderstr¨om, Juva (1983)

8. Kurimo, M., Creutz, M., Varjokallio, M.: Morpho challenge evaluation using a

linguistic gold standard. In: Peters, C., Jijkoun, V., Mandl, T., M¨uller, H., Oard,

D.W., Pe˜nas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp.

864–872. Springer, Heidelberg (2008)

9. Lehtonen, M., Cunillera, T., Rodr´ıguez-Fornells, A., Hult´en, A., Tuomainen, J.,

Laine, M.: Recognition of morphologically complex words in Finnish: evidence

from event-related potentials. Brain Research 1148, 123–137 (2007)

10. Lehtonen, M., Laine, M.: How word frequency aﬀects morphological processing

in monolinguals and bilinguals. Bilingualism: Language and Cognition 6, 213–225

(2003)

11. Niemi, J., Laine, M., Tuominen, J.: Cognitive morphology in Finnish: foundations

of a new model. Language and Cognitive Processes 9, 423–446 (1994)

12. Quasthoﬀ, U., Richter, M., Biemann, C.: Corpus portal for search in monolin-

gual corpora. In: Proceedings of the Fifth International Conference on Language

Resources and Evaluation, LREC 2006, Genoa, Italy, pp. 1799–1802 (2006)

13. Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)

14. Siivola, V., Hirsim¨aki, T., Virpioja, S.: On growing and pruning Kneser-Ney

smoothed n-gram models. IEEE Transactions on Audio, Speech & Language Pro-

cessing 15(5), 1617–1624 (2007)

15. Soveri, A., Lehtonen, M., Laine, M.: Word frequency and morphological processing

revisited. The Mental Lexicon 2, 359–385 (2007)

16. Taft, M.: Recognition of aﬃxed words and the word frequency eﬀect. Memory and

Cognition 7, 263–272 (1979)

17. Taft, M.: Morphological decomposition and the reverse base frequency eﬀect. The

Quarterly Journal of Experimental Psychology A 57, 745–765 (2004)

18. The Department of General Linguistics, University of Helsinki and Research Insti-

tute for the Languages of Finland (gatherers): Finnish Parole Corpus (1996–1998),

available through CSC, http://www.csc.fi/

19. Tiedemann, J.: News from OPUS — A collection of multilingual parallel corpora

with tools and interfaces. In: Recent Advances in Natural Language Processing,

vol. 5, pp. 237–248. John Benjamins, Amsterdam (2009)

20. Vartiainen, J., Aggujaro, S., Lehtonen, M., Hult´en, A., Laine, M., Salmelin, R.:

Neural dynamics of reading morphologically complex words. NeuroImage 47, 2064–

2072 (2007)

Statistical models of morphology predict eye-tracking measures during visual word recognition

Article

Full-text available

May 2019

We studied how statistical models of morphology that are built on different kinds of representational units, i.e., models emphasizing either holistic units or decomposition, perform in predicting human word recognition. More specifically, we studied the predictive power of such models at early vs. late stages of word recognition by using eye-tracking during two tasks. The tasks included a standard lexical decision task and a word recognition task that assumedly places less emphasis on postlexical reanalysis and decision processes. The lexical decision results showed good performance of Morfessor models based on the Minimum Description Length optimization principle. Models which segment words at some morpheme boundaries and keep other boundaries unsegmented performed well both at early and late stages of word recognition, supporting dual- or multiple-route cognitive models of morphological processing. Statistical models based on full forms fared better in late than early measures. The results of the second, multi-word recognition task showed that early and late stages of processing often involve accessing morphological constituents, with the exception of short complex words. Late stages of word recognition additionally involve predicting upcoming morphemes on the basis of previous ones in multimorphemic words. The statistical models based fully on whole words did not fare well in this task. Thus, we assume that the good performance of such models in global measures such as gaze durations or reaction times in lexical decision largely stems from postlexical reanalysis or decision processes. This finding highlights the importance of considering task demands in the study of morphological processing.

Using Statistical Models of Morphology in the Search for Optimal Units of Representation in the Human Mental Lexicon

Article

Dec 2017
COGNITIVE SCI

Determining optimal units of representing morphologically complex words in the mental lexicon is a central question in psycholinguistics. Here, we utilize advances in computational sciences to study human morphological processing using statistical models of morphology, particularly the unsupervised Morfessor model that works on the principle of optimization. The aim was to see what kind of model structure corresponds best to human word recognition costs for multimorphemic Finnish nouns: a model incorporating units resembling linguistically defined morphemes, a whole-word model, or a model that seeks for an optimal balance between these two extremes. Our results showed that human word recognition was predicted best by a combination of two models: a model that decomposes words at some morpheme boundaries while keeping others unsegmented and a whole-word model. The results support dual-route models that assume that both decomposed and full-form representations are utilized to optimally process complex words within the mental lexicon.

Empirical Comparison of Evaluation Methods for Unsupervised Learning of Morphology

Article

Full-text available

Jan 2011

Unsupervised and semi-supervised learning of morphology provide practical solutions for processing morphologically rich languages with less human labor than the traditional rule-based analyzers. Direct evaluation of the learning methods using linguistic reference analyses is important for their development, as evaluation through the final applications is often time consuming. However, even linguistic evaluation is not straightforward for full morphological analysis, because the morpheme labels generated by the learning method can be arbitrary. We review the previous evaluation methods for the learning tasks and propose new variations. In order to compare the methods, we perform an extensive meta-evaluation using the large collection of results from the Morpho Challenge competitions.

Information properties of morphologically complex words modulate brain activity during word reading

Article

Full-text available

Mar 2018

Neuroimaging studies of the reading process point to functionally distinct stages in word recognition. Yet, current understanding of the operations linked to those various stages is mainly descriptive in nature. Approaches developed in the field of computational linguistics may offer a more quantitative approach for understanding brain dynamics. Our aim was to evaluate whether a statistical model of morphology, with well-defined computational principles, can capture the neural dynamics of reading, using the concept of surprisal from information theory as the common measure. The Morfessor model, created for unsupervised discovery of morphemes, is based on the minimum description length principle and attempts to find optimal units of representation for complex words. In a word recognition task, we correlated brain responses to word surprisal values derived from Morfessor and from other psycholinguistic variables that have been linked with various levels of linguistic abstraction. The magnetoencephalography data analysis focused on spatially, temporally and functionally distinct components of cortical activation observed in reading tasks. The early occipital and occipito-temporal responses were correlated with parameters relating to visual complexity and orthographic properties, whereas the later bilateral superior temporal activation was correlated with whole-word based and morphological models. The results show that the word processing costs estimated by the statistical Morfessor model are relevant for brain dynamics of reading during late processing stages.

Unsupervised Models for Morpheme Segmentation and Morphology Learning

Article

Full-text available

Feb 2007

We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs , is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data.

Frequency Effects and the Representational Status of Regular Inflections

Article

Full-text available

Jan 1999

There have been many recent proposals concerning the nature of representations for inflectional morphology. One set of proposals addresses the question of whether there is decomposition of morphological structure in lexical access, whether complex forms are accessed as whole words, or if there is a competition between these two access modes. Another set of proposals addresses the question of whether inflected forms are generated by rule-based systems by connectionist type associative networks or if there is a dual system dissociating rule-based regular inflections from association-based irregular inflections. A central question is whether there are whole-word representations for regularly inflected forms. A series of five lexical decision experiments addressed this question by looking at whole-word frequency effects across a range of frequency values with constant stem-cluster frequencies. Frequency effects were only found for inflected forms above a threshold of about 6 per million, whereas such effects were found for morphologically simple controls in all frequency ranges. We discuss these data in the context of two kinds of dual models and in relation to competition models proposed within the connectionist literature.

Corpus Portal for Search in Monolingual Corpora

Article

Full-text available

Jan 2006

A simple and flexible schema for storing and presenting monolingual language resources is proposed. In this format, data for 18 different languages is already available in various sizes. The data is provided free of charge for online use and download. The main target is to ease the application of algorithms for monolingual and interlingual studies.

Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0

Article

Full-text available

Jan 2005

In this work, we describe the first public version of the Morfessor software, which is a program that takes as input a corpus of unannotated text and produces a segmentation of the word forms observed in the text. The segmentation obtained often resembles a linguistic morpheme segmentation. Morfessor is not language-dependent. The number of segments per word is not restricted to two or three as in some other existing mor-phology learning models. The current version of the software essentially implements two morpheme segmentation models presented earlier by us (Creutz and Lagus, 2002; Creutz, 2003). The document contains user's instructions, as well as the mathematical formula-tion of the model and a description of the search algorithm used. Additionally, a few experiments on Finnish and English text corpora are reported in order to give the user some ideas of how to apply the program to his own data sets and how to evaluate the results.

News from OPUS—A Collection of Multilingual Parallel Corpora with Tools and Interfaces

Chapter

Jan 2009

Jörg Tiedemann

Cognitive Morphology in Finnish: Foundations of a New Model

Article

Aug 1994

We summarise the main results from a series of Finnish studies dealing with single-word experiments with aphasics as well as lexical decision and eye-movement registration tests performed on normals. On the basis of our experimental results, we propose a processing model of Finnish nouns. For the input and central lexicons, this Stem Allomorph/Inflectional Decomposition (SAID) model assumes morphological decomposition of inflected (with the exception of the most frequently encountered inflected noun forms) but not derived noun forms. For the output lexicon, it predicts that both inflected and productive derived forms have decomposed representations. In the case of marked stem variation (resulting from stem formation and/or morphophonological alternation), the model assumes that the stems are represented by their allomorphs, and not by a single morph. In this respect, our model postulates more suppletion in the input/output lexicons than would be predicted on the basis of formal morphological analyses. However, among the allomorphs, the nominative singular of nouns appears to have a special status.

Word Frequency and Morphological Processing in Finnish Revisited

Article

Dec 2007

The aims of the present study were to investigate the effects of word frequency on morphological processing of inflected words in Finnish, and to re-test previous results obtained for high frequency inflected words in Finnish which suggest that inflected words of high frequency might have full-form representations in the mental lexicon. Our results from three visual lexical decision experiments with monolingual Finnish speakers suggest that only very high frequency inflected Finnish words have full-form representations. This finding differs from results obtained from related studies in morphologically more limited Indo-European languages, in which full-form representations for inflected words seem to exist at a much lower level of frequency than in the morphologically rich Finnish language.

Effects of Family Size for Complex Words

Article

Apr 2000

R. Schreuder and R. H. Baayen (1997) reported that in visual lexical decision, response latencies to a simplex noun are shorter when this noun has a large morphological family, i.e., when it appears as a constituent in a large number of derived words and compounds. This article addresses the question of whether the family size of the base word of a complex word likewise affects lexical processing. College students participated in 6 experiments that show that family size plays a role for both inflected and derived words. Posthoc analyses show that the effect of family size is driven by the semantically transparent family members and that this effect is further constrained by semantic selection restrictions of the affix in the target word. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

How word frequency affects morphological processing in monolinguals and bilinguals [Electronic version]

Article

Dec 2003
BILING-LANG COGN

Matti Laine

The present study investigated processing of morphologically complex words in three different frequency ranges in monolingual Finnish speakers and Finnish-Swedish bilinguals. By employing a visual lexical decision task, we found a differential pattern of results in monolinguals vs. bilinguals. Monolingual Finns seemed to process low frequency and medium frequency inflected Finnish nouns mostly by morpheme-based recognition but high frequency inflected nouns through full-form representations. In contrast, bilinguals demonstrated a processing delay for all inflections throughout the whole frequency range, suggesting decomposition for all inflected targets. This may reflect different amounts of exposure to the word forms in the two groups. Inflected word forms that are encountered very frequently will acquire full-form representations, which saves processing time. However, with the lower rates of exposure, which characterize bilingual individuals, full-form representations do not start to develop.

Lexical Representation

Chapter