ArticlePDF Available

Constructing Pseudowords with Constraints on Morphological Features - Application for Polish Pseudonouns and Pseudoverbs

Authors:

Abstract and Figures

Pseudowords allow researchers to investigate multiple grammatical or syntactic aspects of language processing. In order to serve that purpose, pseudoword stimuli need to preserve certain properties of real language. We provide a Python-based pipeline for the generation of pseudoword stimuli that sound/read naturally in a given language. The pseudowords are designed to resemble real words and clearly indicate their grammatical class for languages that use specific suffixes from parts of speech. We also provide two sets of pseudonouns and pseudoverbs in Polish that are outcomes of the applied pipeline. The sets are equipped with psycholinguistically relevant properties of words, such as orthographic Levenshtein distance 20. We also performed two studies (overall N = 640) to test the validity of the algorithmically constructed stimuli in a human sample. Thus, we present stimuli that were deprived of direct meaning yet are clearly classifiable as grammatical categories while being orthographically and phonologically plausible.
This content is subject to copyright. Terms and conditions apply.
Accepted: 26 April 2022
© The Author(s) 2022
Joanna Daria Dołżycka
joanna.dolzycka@uni-ulm.de
1 Nicolaus Copernicus University, Toruń, Poland
2 Department of Applied Emotion and Motivation Psychology, Ulm University, Ulm, Germany
3 SWPS University of Social Sciences and Humanities, Warsaw, Poland
4 Institut für Psychologie und Pädagogik, Abteilung Angewandte Emotions- und
Motivationspsychologie, Albert-Einstein-Allee 47, D-89081 Ulm, Germany
Constructing Pseudowords with Constraints on
Morphological Features - Application for Polish Pseudonouns
and Pseudoverbs
Joanna DariaDołżycka1,2,4 · JanNikadon1,3· MagdalenaFormanowicz1,3
Journal of Psycholinguistic Research
https://doi.org/10.1007/s10936-022-09884-6
Abstract
Pseudowords allow researchers to investigate multiple grammatical or syntactic aspects of
language processing. In order to serve that purpose, pseudoword stimuli need to preserve
certain properties of real language. We provide a Python-based pipeline for the generation
of pseudoword stimuli that sound/read naturally in a given language. The pseudowords are
designed to resemble real words and clearly indicate their grammatical class for languages
that use specic suxes from parts of speech. We also provide two sets of pseudonouns
and pseudoverbs in Polish that are outcomes of the applied pipeline. The sets are equipped
with psycholinguistically relevant properties of words, such as orthographic Levenshtein
distance 20. We also performed two studies (overall N = 640) to test the validity of the
algorithmically constructed stimuli in a human sample. Thus, we present stimuli that were
deprived of direct meaning yet are clearly classiable as grammatical categories while
being orthographically and phonologically plausible.
Keywords Pseudowords · Grammar · Linguistic processing · Psycholinguistics · Wuggy
Introduction
Paradoxically, when investigating natural language processing, researchers are often com-
pelled to employ pseudowords and articial languages. Pseudowords are used in psycho- and
neurolinguistic studies because they allow researchers to abstract from the actual meaning
and context conveyed by real words. The use of pseudowords enables better control of the
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
morphological (Longtin & Meunier, 2005; Snyder, 1995), semantic (Dorner & Harris,
1997)⁠ and syntactic (Opitz, 2004)⁠ properties of stimuli, which in turn allows researchers to
disentangle the specic role of these features in language processing.
Yet, to ensure that pseudowords can be used in a meaningful way, they must be devised
with care and control. In this paper, we describe former methods of pseudowords generation
and point out their advantages and disadvantages. As a following step, we propose a novel
procedure of pseudoword generation based on the Wuggy algorithm. We augmented the
basic procedure with language-specic rules based on linguistic priors and real language-
use data, which allows for the distinction of grammatical categories. As an example of
application of this procedure, we provide a Python-based script for (Polish) pseudoverb
and pseudonoun construction. Additionally, we provide two grammatically diverse sets of
stimuli in Polish (i.e., pseudonouns and pseudoverbs with comparable psycholinguistically
relevant properties) that can be used for various research applications. The rst set contains
159 pseudowords with a variety of stems and suxes typical for a specic grammatical
class (noun or verb). The second set (242 pseudowords) contains pseudonoun-pseudoverb
pairs that share the common stems. These two example pseudoword stimuli sets can be
used in psycholinguistic research that focuses on the dierences in processing words that
belong to the mentioned grammatical classes, especially by those contributing to the ongo-
ing debate on the dierences in the processing of nouns and verbs (Vigliocco et al., 2011). In
this article, we focus on verbs and nouns as an example. Importantly, however, the proposed
method could be easily extended to any grammatical category, and thus, it oers a broader
framework of pseudoword construction whenever some specic morphological features of
pseudowords are of special consideration. Finally, we provide an open-access Python-script
that implements our pseudoword generation procedure and can be used by other researchers
to generate similar datasets in other languages or to impose other morphological constraints
on the pseudowords generated.
Psycholinguistically Relevant Properties of Pseudoword Stimuli
Examining word processing mechanisms by employing pseudowords is an established
methodological approach (Heim et al., 2007; Kissler & Herbert 2013; Price et al., 1996). A
particular advantage of using pseudowords in linguistic research is that they are devoid of
literal meaning, which allows for the evaluation of nonsemantic aspects of language pro-
cessing. Pseudowords also facilitate control over a multitude of other potentially confound-
ing features of linguistic stimuli (that are dicult to manipulate, manage, and account for
in case of real words), for example, imageability (Tyler et al., 2002) and age of acquisition
(Kuperman et al., 2012). The advantages of pseudowords are appreciated in many contexts.
First, in investigation of sublexical processing and its relationship to cognitive function,
pseudowords are used as stimuli to test mechanisms of cognitive models of reading (Col-
theart et al., 2001; Grainger & Jacobs, 1996; Harm & Seidenberg, 2004). Second, pseu-
dowords are useful when social aspects of language are studied. Presenting vocal stimuli
that are detached from meaning but contain frequency and intensity cues corresponding to
real emotional words allows for examining the impact of prosodic factors in linguistically
induced emotion processing (Kissler & Herbert, 2013; Preti et al., 2016). Also, an important
application of pseudowords can be found in examinations of grammar—and specically
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
whether grammatical categories deprived of literal semantic content have the metasemantic
eect of conveying meaning through grammatical cues (Formanowicz et al., 2017). How-
ever, due to structural dierences among languages, preparing stimuli that reect univer-
sal rules for processing grammatical classes can be very dicult if not impossible (e.g.,
Vigliocco et al., 2011).
When constructing pseudowords for various research applications, however, some fea-
tures need to be taken into account. First, pseudowords should be constructed with respect
to the statistical properties of real word morphology. In that sense, pseudowords dier from
nonwords with respect to orthographic and phonological regularities (Rosinski & Wheeler,
1972). According to Ziegler et al., (1997), in perceptual identication tasks, words and
pseudowords show reaction-time-related priority eects over nonwords. This means that,
if the presented stimulus is more word-like (complies with linguistic rules of a given lan-
guage), the reaction time for its identication as a language-related item will be shorter. This
indicates that, in many research situations, it is not enough to use nonwords (random letter
strings), but stimuli must be constructed according to certain rules in order to be treated as
linguistic stimuli regarding the amount of word-specic information that they carry. Barca
and Pezullo’s (2012) study supports this conclusion. They presented three types of stimuli—
words, pseudowords, and random letter strings—and participants were asked to classify
stimuli as linguistic or nonlinguistic. The results of a computer mouse movement trajectory
analysis demonstrated the presence of a lexical dimension line, which is a continuous mea-
sure of recognition of linguistic stimuli. Real high- and low-frequency words were stable
and correctly recognized as lexical stimuli, pseudowords were problematic but eventually
categorized as nonlexical stimuli, and letter strings were clearly judged as nonlexical stim-
uli. These ndings demonstrate that nonwords and pseudowords are processed dierently
and the latter need to meet specic linguistic criteria that should be rigorously controlled to
be considered a linguistic unit.
Second, when designing pseudoword stimuli, not only should the resemblance to real
words be considered, but also the frequency of real word occurrences from which pseu-
dowords originate. A study by Perea et al., (2005) found that pseudowords generated from
high-frequency real words yielded less reaction-time latency than pseudowords generated
from low-frequency real words because the verication mechanism of the high-frequency
base of pseudowords facilitates its comprehension.
The third important feature in pseudoword construction is the frequency of the rst syl-
lable (Chetail & Mathey, 2009). According to Carreiras et al., (2006), behavioral and neural
responses can dier depending on the rst-syllable frequency. Thus, from the linguistic
research perspective, the probability of obtaining phonetically correct syllables (N-grams)
is another important factor of pseudoword construction. The rst-syllable frequency
impacts word naming and lexical decision tasks, where multisyllabic words with frequent
rst syllables are named more quickly than those with less frequent ones, but they elicit
slower reaction times in decision tasks (Álvarez et al., 2000). Thus, pseudoword construc-
tion should also consider the rst-syllable eect to enhance the eciency of pseudoword
reading uency.
The fourth feature regarding the construction of pseudowords is their similarity to real
words. Neurolinguistic research has shown that pseudowords that are more similar to real
words can elicit faster (and, in a neurophysiological sense, stronger) responses than stim-
uli that are more distant from real words (Dorner & Harris, 1997)⁠. Thus, the similarity
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
between pseudowords and real words should be controlled. Typically, pseudowords used in
research should be generated from high-frequency real words, but they also should not be
too similar to real words to avoid evoking unintended associations with real word semantic
content and an overlapping of the neural activity associated with this content. Real words
are typically considered to be similar in three dierent ways (Siew, 2018): semantically
(i.e., with respect to their meaning, e.g., nail, hammer, and mallet), phonologically (i.e.,
with respect to their sound in spoken language, e.g., chute and shoot), and nally, ortho-
graphically (i.e., with respect to their spelling in written language, e.g., nail and mail). In the
case of languages that use nonlogographic writing systems (e.g., alphabetical), phonological
and orthographic similarity are strongly associated. In the case of pseudoword-based audi-
tory stimuli, their phonological similarity to real words is of greater importance, while for
pseudoword-based visual stimuli, their orthographic similarity to real words should be of
primary consideration. For example, orthographic similarity aects the speed of recognition
of pseudowords; a high number of orthographic neighbors leads to faster responses (Hunts-
man & Lima, 2002).
The extent of the orthographic neighborhood of a (pseudo)word is most often quanti-
ed using either the Coltheart’s orthographic neighborhood size metric (ON, Coltheart’s
N) or orthographic Levenshtein distance 20 (OLD20; Yarkoni et al., 2008). ON can only
be applied to strings with the same number of characters and is a strictly binary-based
measure (two strings are considered to be neighbors if one of them can be transformed into
the other by a single letter substitution, e.g., braintrain). This approach to orthographic
similarity quantication is considered relatively restrictive (Yarkoni et al., 2008) because
the perceptual similarity of words (or other character sequences) can also be achieved by
letter insertion (e.g., widow window), deletion (e.g., planet plane), or transposition
(e.g., trailtrial), none of which is considered when using the ON approach (Yarkoni et
al., 2008). The OLD20 method overcomes these constraints, as it is based on Levenshtein
distance (LD; Levenshtein 1966), a string distance metric widely used in computer science.
The LD between two words is dened as the minimum number of substitutions, insertions,
or deletions required to transform one string into the other. Thus, it quanties the ortho-
graphic distance between two character sequences in a nonbinary manner, and moreover, it
is not limited to the comparison of strings of the same length. For example, for the words
brains and transit, we get LD = 4 because the four possible operations that can generate the
second word from the rst are (1) substitute “b” for “t” trains; (2) delete “i” trans;
(3) insert “i” transi; and (4) insert “t” transit. Yarkoni et al., (2008) dened OLD20
as the mean LD from a given word (in this research, a pseudoword) to its 20 closest LD-
based orthographic neighbors (in this research, real words). This implies that the calculation
of OLD20 for a given character sequence requires the computation of LD between this
sequence and every word in a given (reference) lexicon before the selection of the top 20 of
its closest neighbors can be made and their average LD distance can be calculated. Due to
these advantages of OLD20 over the ON measure, in this research we relied on the former
to quantify the orthographic distance between the pseudowords produced and real words.
Finally, when constructing stimuli for psycholinguistic and neurolinguistic studies, one
must consider that each language has unique properties, and it is impossible to create uni-
versal stimuli sets that can be applied to all or many dierent languages. For example,
Polish and English have signicantly dierent structural properties. In terms of linguistic
typology, Polish is an inectional language (Polański, 1999); the grammatical function of a
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
word is determined by its inectional ending, which communicates information about the
word’s grammatical class (e.g., verb, noun, or adjective) and inherently conveys a partial
meaning of the word, for example, by including a grammatical gender exponent in nouns or
adjectives. In contrast, in English, which is an isolating (analytical) language with a xed
sentence pattern (SVO), the grammatical function of words within a sentence is conveyed
by word order (Moravicsik, 2013). In other words, the easiest way to determine the gram-
matical class of a word is through the sentence pattern (e.g., I desert the desert, where the
grammatical class of the word desert in its written form could not be grammatically distin-
guished without the context). In Polish, and other inectional languages, due to the variety
of inectional exponents, it is easier to distinguish individual lexical forms, regardless of
their order in the sentence. That is to say, grammatical classes can be recognized by looking
at the word itself (e.g., bieg-ø is a noun, while bieg-a-ć is an innitive verb form; in con-
trast, run, the English translation, can only be disambiguated by adding “the” or “to”). One
method for resolving dierences in grammatical class processing is to employ pseudowords
with suxes that clearly indicate a particular grammatical class, which is possible in Polish
and other inectional languages.
The approach presented in this article provides a solution to the problem of homonymy
of linguistic units encountered in the single word presentation paradigm. When words have
the same forms (e.g., run), they are grammatically indistinguishable without an auxiliary
word (“to” or “the”) as in positional languages. Our method results in the presentation of a
pseudoword as a single letter string with a xed complexity (pseudoword stem and gram-
matical exponent) rather than a stimulus consisting of two separate letter strings. Such pseu-
dowords are best suited for researchers who want to investigate the cognitive connotations
associated with understanding concepts embedded in grammatical classes. We focus here on
pseudoverbs and pseudonouns, to refer to the current debate addressing the issue of whether
syntactic or semantic features of words have an impact on dierences in functional process-
ing in psycholinguistic tasks (Vigliocco et al., 2011). Given the importance of studying
verbs and nouns as main cognitive concepts and sentence-building grammatical categories,
we propose a novel method of constructing pseudoverbs and pseudonouns that meets the
highest standards of psycholinguistic research. Interestingly, in pseudoword construction
paradigms, there is no method that preserves grammatical class while still maintaining con-
trol over the psycholinguistic features of a word. Before we introduce such a method, we
will rst introduce other commonly used methods of pseudoword generation.
Methods of Pseudoword Generation
One of the rst widely used methods of non- and pseudoword production was letter substi-
tution. Basically, the letter substitution method is based on random letter selection and the
substitution of those with other letters. This results in obtaining one or more pseudowords
from one real word. For example, a pseudoword like rompuner, comnumer, or cormuter
could be obtained from the word computer by substituting one or two letters (Brown et
al., 1987). For example, this method was used to generate a list of 3,023 Polish nouns
for linguistic research (Imbir et al., 2015) and in the psychology of language eld within
studies on the verb–agency link (Formanowicz et al., 2017). It was also used in the Eng-
lish Lexicon Project (Balota et al., 2007), in French (Ferrand et al., 2010), and in Dutch
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
(Keuleers et al., 2010). Although widely used, this method of generating pseudowords does
not guarantee that the produced items will meet the linguistic criteria due to the lack of
simultaneous evaluation of features, such as the frequency of bigrams and trigrams. There-
fore, pseudowords are often also evaluated by competent judges who determine whether a
given string of letters can be considered a pseudoword or not in order to select units that
are subjectively felt to resemble real words but lack literal meaning. The judging criteria,
however, are chosen subjectively by the researchers. For example, some pseudoword evalu-
ation criteria are based on whether pseudowords are constructed from existing or potential
syllables or if there is a possibility that the given pseudowords may be read uently (i.e., if
they are pronounceable; Imbir et al., 2015). This is often insucient, because it is possible
to uently read a pseudoword beginning with the letter ę (ęgatek) or the consonant cluster
“pf” (pfkiczyć), yet these pseudowords do not follow the phonotactic and orthographic rules
of Polish language (none of the real Polish words begin with those letters). Additionally, it
is possible that the syllables of pseudowords ending in vowels could be considered prob-
able due to the open syllable rule (a phonetic rule of the Proto-Slavic language from which
the Polish language originated), whereby each syllable in a word most often ends with a
vowel. Given this, words that have syllables ending in a vowel may sound familiar to natu-
ral language users, leading to an expansion of the criteria for including pseudowords (for
example, ignoring the rules described above). The lack of algorithmic control over stimulus
production increases the frequency of atypical occurrences of phonetic clusters within the
syllable. Overall, there are a number of factors that can aect stimuli created with the use
of the mentioned method, thus potentially inducing a bias in the results of studies on how
people process linguistic items.
Over time, methods of generating pseudoword items have become more technically
advanced. Researchers have developed programs, such as MCWord (Medler & Binder,
2005), WordGen (Duyck et al., 2004), and WordCreator (Trost, 2002), to produce pseu-
dowords. All these programs work at the letter level and perform N-gram matching using
language rules dened in the software settings. Such methods are based on an amalgamation
of subword units (typically syllables) to obtain pseudowords that meet the eligibility criteria
for the chosen language. The most important criterion is based on the observation that in
real, natural language some bigrams and trigrams have higher frequency than others (Suen,
1979). Thus, their overall frequency distribution should be preserved in the pseudowords
produced (e.g., Solso et al., 1979). The biggest advantage of this approach is that it is based
on the most frequent cluster combination, which provides the opportunity to obtain stimuli
that resemble real word letter combinations in syllables, and it also includes the calcula-
tion of the orthographic neighborhood (e.g., Coltheart’s N). However, in this case as well,
the generated pseudowords do not have specic grammatical class markers but are merely
clusters of word-like syllables. In addition, these programs usually already contain a list of
words from which pseudowords are generated. Consequently, it is only possible to create
stimuli with a xed input embedded in the particular tool. Another problem a researcher
may encounter is that these tools are only developed for few languages such as English
(MCWord and WordGen), French (WordGen), and Dutch (WordGen), and due to the noned-
itability of their functions, it is not possible to obtain stimuli in other languages.
One of the recent breakthrough-advancements in pseudoword production is the Wuggy
algorithm (Keuleers & Brysbaert, 2010), which can be used for the construction of poly-
syllabic pseudowords that obey phonotactic constraints. The Wuggy algorithm eciently
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
produces legal sequences (words and pseudowords) in the language of choice using bigram
chains that are built based on language specic lexicons (Keuleers & Brysbaert, 2010). This
approach alleviates the combinatorial explosion problem that is present in solutions that are
based on production of all combinations of subsyllabic elements (i.e., onset, nucleus and
coda) that are legal in the language of choice (Keuleers & Brysbaert, 2010). The Wuggy
algorithm accounts for the probability of the occurrence of subsyllabic elements by match-
ing the language pattern with the source word used as a reference (Keuleers & Brysbaert,
2010). The Wuggy algorithm can be downloaded and partially customized. The words from
which the output will be generated can be typed independently or imported as a list, allow-
ing more exibility in the selection of real words. These can be, for example, words with
specic emotional valence. The algorithm’s functions allow researchers to determine the
length of pseudowords with respect to real words by matching subsyllabic segment length,
which results in pseudowords that have the same syllable structure as the input words.
Wuggy also calculates orthographic neighborhoods using the OLD20 method, which quan-
ties the similarity of the given pseudoword generated to the 20 most similar words from the
corpus. With the function of splitting the word into syllables before generating the output
and adjusting the values of the subsyllabic elements, it is possible to generate pseudowords
from the input coming from English (Tucker & Brenner, 2017), Dutch (Heyman et al.,
2015), German (Hasenäcker et al., 2017), Turkish (Erten, 2013), Basque (Ferré et al., 2015),
French (De Simone et al., 2021), Spanish (Aguasvivas et al., 2018), and Bulgarian (Shtereva
et al., 2020). This makes Wuggy a tool with the most extensive and detailed options that can
be customized for not only positional languages but also those with more complex morphol-
ogy, such as Polish. Given that Wuggy can compute the psycholinguistically relevant prop-
erties of words, such as OLD20, that play an important role in pseudoword construction, we
used this tool to produce Polish pseudowords. However, by default, the Wuggy algorithm
does not allow for constructing stimuli with the predened features, such as suxes indica-
tive of pseudowords belonging to a grammatical class. To address this gap, we provide a
pipeline that employs the Wuggy algorithm and facilitates the production of pseudowords
of a particular grammatical class in inectional languages with additional ne control over
other signicant psycholinguistic features of the output pseudowords.
Pipeline for Polish Pseudonoun and Pseudoverb Generation and Initial
Selection
We used Python as the primary scripting language to generate Polish pseudowords. The
process of pseudoword generation is presented in Fig. 1 and outlined in more detail in sub-
sections “Stimuli Generation and Selection” for Studies 1 and 2. We conducted two studies
that allowed us to evaluate the stimuli using ratings from online surveys for crosscheck of
machine and human stimuli evaluation. This resulted in two sets of pseudowords containing
information on their grammatical properties (e.g., grammatical class, grammatical gender
of pseudonouns) and psycholinguistically relevant properties (e.g., number of letters, num-
ber of syllables, and OLD20). To resolve the problem of the probability of letter-cluster
occurrences in the beginning of the word (Hawelka et al., 2013), we also provided the rst-
syllable frequency for each pseudoword that we extracted from the Polish corpus (Lexical
Computing CZ s.r.o., 2015; Jakubíček et al., 2013; Suchomel & Pomikálek 2012). The
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
supplementary materials also contain results of a rating study (e.g., the percent of correct
identications of grammatical class, mean and standard deviation of ratings of similarity to
a real word).
To select the most characteristic grammatical endings for each of the chosen grammatical
classes, we used data from the most common Polish language dictionaries, which contain
summaries of morphological features embedded in nouns and verbs (Drabik et al., 2018;
Dunaj, 2012; Sobol 2002). We extracted specic exponents for feminine, masculine, and
neuter nouns in their nominative form and verbs in the innitive form for Stimuli Set 1 and
kept the last syllable of the real word in Stimuli Set 2.
Importantly, the proposed method of pseudoword stimuli generation can be extended to
any other grammatical category and, thus, oers a broader framework on pseudoword con-
struction for all languages that rely on words’ morphological features to signify their gram-
matical properties. This capability also easily extends beyond grammatical categories to all
grammatical, syntactic, or other features that go beyond the semantics and are indicated by
word’s morphology in a given language (e.g., gender in nouns and tense in verbs). Research-
ers should verify that such an indication is unambiguous for the language in which they
plan to conduct research. For example, in Polish, inection of a verb can correspond to the
gender of the agent in the sentence, and feminine nouns and several masculine nouns use the
same suxes - Koleżanka(f.)/Kolega(m.) otwieradrzwi.’ (‘A colleague opens the door’).
Thus, pseudowords pertaining to any of these categories would not be unambiguously dis-
tinguishable. For that reason, we abstained from the generation of this type of stimuli in our
example pseudoword sets, and we used only innitive verbs and excluded masculine nouns
that have suxes identical to feminine nouns. The scripts used to generate the two initial
sets of pseudowords for this paper are available as supplemental online material (SOM) at
https://gin.g-node.org/SocGramLab/pseudo-word-stimuli-with-arbitrary-constrains.git.
Study 1: Pseudowords with Different stems
Stimuli Generation and Selection
We extracted the most frequently used Polish words using the Sketch Engine (https://app.
sketchengine.eu) framework (Kilgarri et al., 2004). Specically, we used the Wordlist tool
together with the Polish web corpus (Lexical Computing CZ s.r.o., 2015; Jakubíček et al.,
2013; Suchomel & Pomikálek 2012). This corpus comprises more than 7 billion words
in 22 million online documents that were crawled in June 2012 to obtain over 36 million
unique word forms. As this was the primary data input for Study 1, we selected 1,000 of the
most frequent verbs and 1,000 of the most frequent nouns.
The initial word lists were ltered using the following procedure. First, we removed
duplicate entries and word forms that contained any characters not in the standard Pol-
ish alphabet (i.e., word forms that were included in the corpus due to errors and defective
procedures, such as optical character recognition). Next, to obtain additional characteristics
of each word, we used the Grammatical Dictionary of Polish (Saloni et al., 2015), which
includes a Morfeusz 2 inectional analyzer and generator (Kieraś & Woliński, 2017). For
verbs, we selected only innitives. For nouns, only the nominative forms of singular mas-
culine, feminine, and neuter forms were preserved (further classication details, including
lists of suxes used to assign these categories, are available in the SOM). Furthermore,
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
we kept only nouns classied as common nouns and disregarded any words that were clas-
sied as belonging to a specialized domain, such as music (Latin-based, e.g., allegro, or
“fast music tempo”), linguistics (relating to metalinguistics, e.g., przypadek, or “case”), or
archaic (rarely used in modern Polish, e.g., dziewierz, or “brother-in-law”).
Fig. 1 Schematic depiction of pseudoword generation pipelines used in Study 1 and 2. Further implementa-
tion details are provided in the main text and Python code is provided in the SOM
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
Next, words were hyphenated according to Polish syllabication rules using Pyphen
(Kozea Community, 2018). We set the language dictionary to “pl_PL” and the minimum
number of characters of the rst and last syllable to 1. We eliminated words that had less
than two syllables to ensure that words clearly indicated the grammatical class with a suf-
x in the last syllable. We also eliminated words that had more than four syllables to avoid
obtaining pseudowords with extremely dierent lengths.
To ensure that the input nouns and verbs were matched with respect to their usage fre-
quency in real language, we applied the Hungarian algorithm to create pairs of nouns and
verbs with the closest frequency in the corpus (Kuhn, 1955). In short, for a given 2D array
containing some weights (interpreted as the penalties or costs of assigning items that typi-
cally belong to two distinct categories to pairs), this algorithm solves the assignment prob-
lem in which the sum of weights (i.e., total penalty/cost) is minimized. In our case, the array
represented the dierence between the frequencies of verbs and nouns in the corpus. This
resulted in a collection of source nouns and verbs with the most closely matched frequen-
cies. The average absolute frequency dierence for all pairs, normalized by the average
frequency of the items in the pair, was 0.08 with a standard deviation of 0.27 (p.d.u. based
on the dierence between words with respect to their absolute frequency, which is a Sketch
Engine metric dened as the direct count of how many times a given item was found in the
corpus).
Next, the word pairs were inputted into the Wuggy program (Keuleers & Brysbaert, 2010)
with the Polish module, which generated pseudowords based on the provided real words
without altering the sux that denotes grammatical class. In order to preserve the sux of
the original source word we used Wuggy’s Matching Expression feature that provides a way
to require pseudowords to match provided regular expression (Keuleers & Brysbaert, 2010;
further information about regular expressions is widely available online, for a technical
description refer to IEEE et al., 2018). For example, using the source word “verbifying” and
the regular expression “.+ing$” would result in Wuggy generating only pseudowords based
on the source word and ending in –ing. In our pipeline, for each real word used as a source
of pseudowords we used a regular expression that required generated pseudowords to retain
the ultimate syllable from the source word. For the obtained pseudonouns and pseudoverbs,
we calculated the frequencies of their initial syllables (all but the last one), and we veried
that these two groups of stimuli were similar in terms of a syllable frequency distribution
(Kolmogorov-Smirnov test was insignicant, suggesting that the two independent samples
are drawn from the same distribution). We repeated this test for the rst syllables only, and
this test was also insignicant, suggesting that our pseudonouns and pseudoverbs did not
dier with respect to the frequency of their rst syllables. We obtained a total of 640 stimuli
(320 pseudonouns, e.g., syjcol, setylda, and 320 pseudoverbs e.g., chlocić, osordać), each of
which contains six to eight letters. For these pseudowords we computed OLD20 values with
reference to the Grammatical Dictionary of Polish (Saloni et al., 2015).
Stimuli Evaluation Procedure
The goal of this procedure was to test the pseudowords with multiple stems in terms of
association of OLD20 values, based on people’s pseudowords perception, and extraction of
grammatical class clearly indicative of a pseudoword being a verb or a noun. Participants
were invited to take part in an online study and were randomly assigned to one of 16 lists
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
that presented 40 pseudowords in a xed random order. Before the stimuli evaluation task,
participants were asked to provide basic demographic information, such as age, gender,
education, foreign language knowledge, and diagnosed language processing disorders, such
as dyslexia or other language impairments. We also asked whether Polish was their native
language.
For each pseudoword, we asked two questions. The rst concerned the similarity of
the presented pseudoword to any real word in Polish (e.g., “To what extent does the word
poceda resemble a word that exists in Polish?”), with answers ranging from 0 (not at all) to
4 (very much). The second question concerned the grammatical class of the presented pseu-
doword (e.g., “Which grammatical class might the word poceda be?”). Participants had to
select one of the presented options: “verb,” “noun,” “adjective,” “other,” or “I don’t know.”
Participants
A total of 328 people participated in this study (247 women, 76 men, 4 people who refused
to indicate their gender, and 1 person who declared their gender to be “other”; Mage = 31.06
years, SDage = 8.58 years). Twenty-four people were excluded from the data analysis due to
their declaration of diagnosed language processing disorders, and one person was excluded
because their declared native language was not Polish. We based the nal evaluation of
pseudowords in this study on ratings from 303 people.
Resulting Word List for Study 1
In order to choose stimuli that were most clearly perceived as verbs or nouns, we applied
the following criteria. First, the accuracy of grammatical class identication had to be above
80% to ensure that words were commonly recognized as either verbs or nouns. Moreover,
we only chose pseudowords that were rated signicantly less than 3 on the scale measuring
similarity to Polish words to ensure that the pseudowords were not too similar to any exist-
ing word. The nal dataset of pseudowords contained 26 feminine, 79 masculine, and 12
neuter nouns as well as 41 verbs in the innitive form. A summary of the basic properties of
the words is presented in Table 1, and the dataset containing all stimuli and those selected
for our list is available in the SOM.
The results of a Spearman correlation indicated a nonsignicant relationship between
OLD20 values and human average ratings of pseudoword similarity to real words for
pseudonouns (r(115) = 0.12, p = .21) and a nonsignicant relationship between OLD20 val-
ues and human ratings for pseudoverbs (r(39) = 0.23, p = .14). These ndings indicate that
the OLD20 measure of similarity to existing words and human subjects’ judgment of pseu-
dowords’ similarity to real words are independent. Furthermore, even though the two corre-
lations were not signicant, the correlation of pseudonouns and OLD20 values was positive
and the correlation of pseudoverbs and OLD20 values was negative. Therefore, we applied
Variables Pseudonouns Pseudoverbs
M (SD) M (SD)
Similarity to real words 1.93 (0.35) 2.19 (0.27)
Correct identication (%) 90.72 (6.24) 93.86 (6.51)
Orthographic neighbors (OLD20) 2.59 (0.18) 2.60 (0.15)
Table 1 Properties of Study 1
dataset
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
using the Fisher’s Z-Transformation to compare whether there was a signicant dierence
between the two correlation coecients. Importantly, the comparison indicated that the two
correlation coecients were similar z = -1.89, p = .06. Therefore, we can conclude that the
pseudonouns and pseudoverbs are similarly unrelated to OLD20 values.
Study 2: Pseudowords with Shared Stem
Stimuli Generation and Selection
We extracted the most frequently used Polish words using the Sketch Engine (https://app.
sketchengine.eu) framework (Kilgarri et al., 2004). Specically, similarly to Study 1,
we used the Wordlist tool together with the Polish web corpus (Lexical Computing CZ
s.r.o., 2015; Jakubíček et al., 2013; Suchomel & Pomikálek 2012), however, for Study 2 we
selected 20,000 of the most frequent words as the primary data input (initial corpus).
Our aim in this study was to obtain pairs that consist of pseudoverbs and pseudonouns
that share a common stem but dier with respect to their suxes (which enable distinction
between grammatical classes). To achieve this goal we developed a procedure for plausible
last-syllables substitution that resulted in the generation of pseudoword pairs with a com-
mon stem. The initial steps of the procedure employed in this study were analogous to the
steps used in the rst study. However, here, we did not use the Hungarian algorithm to
select verb–noun pairs of closest corpora frequency but instead utilized (as an input to the
Wuggy program) all nouns and verbs that were identied in the initial corpus and meet same
inclusion criteria as in Study 1. Similarly to Study 1, we only kept words containing stan-
dard Polish alphabet characters—verbs in the innitive form and common nouns in singular
nominative masculine, feminine, and neuter form. Next, based on the pseudowords gener-
ated by Wuggy, for each penultimate syllable, we constructed a list of all ultimate syllables
that could follow it. To that end we constructed a list of penultimate syllables that were
present in pseudowords generated by Wuggy. For each penultimate syllable we produced a
list of all plausible ultimate syllables that can follow it according to the Wuggy output. This
resulted in an exhaustive list of all orthographically and phonologically plausible ultimate
syllable substitutions conditioned on the penultimate syllable present in the Wuggy output.
All ultimate syllables of the obtained pseudowords were possible inectional noun/verb
endings because, as in Study 1, the Wuggy was congured to retain the last syllable of the
pseudowords generated. Each ultimate syllable clearly indicated the grammatical class of
pseudoword that it terminated. Based on these lists, more pseudowords were generated by
means of last-syllable substitution (all possible substitutions were used conditioned only on
the penultimate syllable). This approach minimized the probability of an accidental intro-
duction of impossible or very rare suxes to the generated pseudowords. Since we aimed
for the production of pseudoverb–pseudonoun pairs, for any further processing, only pseu-
dowords with stems that had been assigned with (jointly) at least one noun-indicating and at
least one verb-indicating sux were considered (we removed pseudowords with stem that
was present in only one grammatical category). To further ensure orthographic and phono-
logical plausibility of the pseudowords, we removed words that contained bigrams that are
extremely rare or impossible in Polish, for example, “” or “yx” (for the full list, please
refer to the SOM). Using the Grammatical Dictionary of Polish (Saloni et al., 2015), we also
removed any real words that could have resulted from the syllable substitution procedure.
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
Furthermore, we only kept words containing 6–8 characters. For the obtained pseudowords
the OLD20 metric was computed with reference to the Grammatical Dictionary of Polish
(Saloni et al., 2015), and to ensure that the resulting pseudowords were orthographically and
phonologically plausible but not too similar to real words, we removed any pseudowords
that had OLD20 below 2.0 or above 3.5. Additionally, because we intended to produce pairs
of stimuli that contained items of comparable properties, after the above ltering based on
OLD20, we only kept the pseudonoun and pseudoverb sets that shared a common stem. In
other words, a pseudonoun was removed from the set if no pseudoverb shared a stem with it,
and the other way around. For the same reason we only kept pseudonouns and pseudoverbs
that were able to form common stem-based pairs for which the dierence in OLD20 mea-
sure computed with reference to the Grammatical Dictionary of Polish (Saloni et al., 2015)
between potential pair elements was less than 0.5 and contained the same number of charac-
ters within potential pair items. Furthermore, based on the noun suxes, we assigned gram-
matical genders to the obtained pseudonouns and only kept words for which their suxes
allowed for unambiguous grammatical gender classication. At this stage, we obtained 385
pseudonouns (121 masculine, 203 feminine, and 61 neuter). Next, we randomly selected 50
nouns for each grammatical gender category. The obtained 150 pseudonouns were used to
select 265 pseudoverbs with which they shared a common stem. Altogether, these consti-
tuted a set of 415 pseudowords (e.g., grocunek, grocukać, rocezja, rocerać) that were evalu-
ated by human subjects. Some pseudoverbs shared stems with more than one pseudonoun
(e.g., mocejać, mocetła, mocezja), and some pseudonouns shared stems with more than one
pseudoverb (e.g., rorekia, rorerać, roregać).
Stimuli Evaluation Procedure
The goal of this procedure, analogous to that in Study 1, was to test the stimuli in terms of
association of OLD20 values with people’s pseudoword perceptions and extraction of gram-
matical class clearly indicative of a pseudoword being a pseudoverb or pseudonoun. In this
study, however, we used stimuli that shared a stem to more strongly emphasize grammatical
class dierences in the last syllable. Participants were invited to take part in an online study
and were randomly assigned to one of 12 lists that presented 35 pseudowords in a xed
random order. As in Study 1, before the stimuli evaluation task, participants were asked to
provide basic demographic information, including age, gender, education, foreign language
knowledge, and diagnosed language processing disorders, such as dyslexia or other lan-
guage impairments. We also asked whether Polish was their native language. Three ques-
tions were asked for each pseudoword. The rst concerned the possibility that a word could
be a real Polish word (i.e., whether a pseudoword was constructed in line with Polish lexical
rules; “Estimate probability in which poceda can be a Polish word”). The second question
asked about the similarity of the presented pseudoword to any real Polish word (e.g., “To
what extent does the word poceda resemble a word that exists in Polish?”). For both of
these questions, answers ranged from 0 (not at all) to 4 (very much). The third question
concerned the grammatical class of the presented pseudoword (e.g., “Which grammatical
class might the word poceda be?”). Participants had to select one of the presented options:
“verb,” “noun,” “adjective,” “other,” or “I don’t know.” Finally, we asked how carefully the
participants lled out the questionnaire, with answers ranging from 0 (careless) to 3 (very
carefully).
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
Participants
A total of 312 people participated in this study (228 women, 80 men, 2 people who refused
to indicate their gender, and 2 people who declared their gender to be “other”; Mage = 31.5
years, SDage = 9.51 years). We excluded 22 people from the data analysis due to their dec-
laration of diagnosed language processing disorders and four people based on their self-
evaluated low attention (0, or careless). We thus based our evaluation of the pseudowords
with dierent stems on the ratings of 286 people.
Resulting Word List for Study 2
In order to select stimuli, we applied the following criteria. First, as in Study 1, the accu-
racy of grammatical class identication had to be above 80% to ensure that words were
commonly recognized as either pseudoverbs or pseudonouns. Moreover, we only chose
pseudowords that were considered possible in Polish (ratings signicantly higher than 3), as
this indicated phonological acceptance. Furthermore, we only chose words that were rated
as signicantly less than 3 on the scale measuring similarity to Polish words, as this ensured
that the pseudowords were not too similar to any existing word. In addition, the OLD20
measure (M = 2.81 for pseudonouns and M = 2.84 for pseudoverbs) was used to select pseu-
dowords that were not too similar to or too distant from the real words on which they
were based and to construct pseudoword noun–verb pairs with similar OLD20 values. Next,
using the Hungarian algorithm, we paired pseudowords to establish noun–verb pairs with a
shared stem and the most similar properties. This resulted in a nal dataset of pseudowords
containing 121 pseudonoun and pseudoverb pairs constructed from 93 pseudoverbs and 47
pseudonouns. Pairs were not exclusive with respect to stem (one pseudoverb could share a
stem with more than one pseudonoun and one pseudonoun could share stem with more than
one pseudoverb). A summary of the basic properties of the pseudowords set is presented in
Table 2, and the dataset, including all stimuli and those selected for our list, is available in
the SOM.
The Spearman correlation between the similarity of pseudowords to real Polish words
and the OLD20 measure was signicant for pseudoverbs (r(91) = 0.31, p < .05) and non-
signicant for pseudonouns (r(45) = 0.23, p = .13). In the case of pseduoverbs, we obtained
a signicant result of a negative low correlation, which means the higher resemblance to a
real Polish word, the lower the OLD20 value. It has to be noted, however, that the sample
size for pseudoverbs was higher than in case of pseudonouns, likely driving the signicance
of the correlation coecient. When we compared the two correlation coecients using
a comparison of correlation from independent samples using Fisher’s Z-Transformation
method, the result indicated a nonsignicant dierence: z = 0.47, p = .64. This suggests that
pseudonouns and pseudoverbs are similarly unrelated to OLD20 (as in Study 1) and that the
signicance of the correlation between pseudoverbs and OLD20 is likely an artifact.
The Spearman correlation between OLD20 values and the possibility that the presented
pseudoword is a Polish word were calculated separately for the pseudonoun and pseudoverb
in each pair. The results indicated a nonsignicant relationship between OLD20 values and
the possibility that the presented pseudoword could be a Polish word for both pseudoverbs
(r(91) = 0.10, p = .36) and for pseudonouns (r(45) = 0.11, p = .46).
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
Additionally, the z-score of comparison between the correlation coecients for pseu-
doverbs and pseudonouns in relation to OLD20 indicated that the two correlation coe-
cients were similar: z = 0.05, p = .96.
General Discussion
In this paper, we have presented a pipeline for pseudoword generation and two sets of pseu-
doverbs and pseudonouns that can be used in language research. Especially we focused on
example applications in languages that use specic suxes from parts of speech. We also
demonstrated how measures of orthographic distance (e.g. OLD20) between the generated
pseudowords and real words can be used to obtain pseudowords with varying degree of
similarity to real words. By providing an open script, we create a possibility for research-
ers to use it to produce pseudowords mainly in inectional languages, in which the value
of a grammatical class is determined by the exponent located at the end of a word. As the
preparation of stimuli for psycho- and neurolinguistic research requires careful consider-
ation, we proposed a detailed analysis of specic language properties, such as real word fre-
quency, phonological congruence, rst-syllable frequency, orthographic distance to nearest
real words (OLD20), and grammatical class adequacy in Polish. In addition to controlling
the frequency of the real words used to create the pseudowords and the OLD20 measure,
we considered it necessary to also obtain human ratings for pseudoword evaluation. Instead
of using zero-one rating systems, such as “word” and “nonword,” as in the English Lexicon
Project lexical decision task (Balota et al., 2007), we used a 5-point scale, which allowed us
to achieve greater specicity in evaluating the prepared stimuli.
Overall results of our studies indicate that there is no systematic relation between average
human ratings considering similarity to existing Polish words and OLD20. In Study 1 the
correlation was nonsignicant for pseudonouns and for pseudoverbs and the two correla-
tion coecients were of similar magnitude. In Study 2, the correlation of pseudonouns and
OLD20 was also nonsignicant, but we observed a signicant correlation of pseudoverbs
and OLD20. One possible explanation for the discrepancies between human and computer-
ized evaluation of pseudowords is the eect of possible associations and interpretations that
a pseudoword may trigger in human subjects that are not captured by computerized assess-
ment. We however, treat this one signicant result with caution because the sample size of
pseudoverbs was twice as large as the sample of pseudonouns, likely driving the signi-
cance of the correlation coecient, and the comparison of the two correlation coecients
was not signicant. Furthermore, in terms of rating the possibility to be a real Polish word,
both pseudoverbs and pseudonouns were similarly not correlated to OLD20 values. Overall,
the pattern of results suggest that the presented pseudoverbs and pseudonouns are of similar
quality. Additionally, in Study 2 we found similar negative trends for correlation coecients
Variables Pseudonouns Pseudoverbs
M (SD) M (SD)
Similarity to real words 1.40 (0.48) 1.39 (0.39)
Possibility of being a real word 1.69 (0.49) 1.66 (0.34)
Correct identication (%) 90.69 (5.45) 91.38 (5.10)
Orthographic neighbors (OLD20) 2.82 (0.35) 2.83 (0.32)
Table 2 Properties of Study 2
dataset
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
of OLD20 and both question responses for each of pseudowords’ grammatical categories.
This indicates, that both for pseudonouns and verbs, the higher the OLD20 value the lower
the participants’ rating of pseudowords’ similarity to real Polish words and possibility to be
perceived as a potential Polish word. In other words, the less LD-operations (letter substi-
tution, deletion or insertion) are required to transform a real word into a pseudoword, the
more pseudoword is judged as resembling a real word. This result is in line with previous
research that has investigated the impact of the OLD20 measure and lemma frequency on
the results of lexical decision and word-naming task performance (Kresse et al., 2012). In
the presented task low-OLD20 scores of pseudowords elicited higher naming error rates but
shorter reaction time than words with high-OLD20 scores. Thus, the less distant a pseudo-
word to real word is, the more uently it can be perceived as a natural language unit, but
at the same time, the opportunity for naming error becomes greater. Bearing in mind the
research of Kresse et al., (2012) where OLD20 showed impact on task performance, pseu-
dowords with average OLD20 in the attached sets of pseudonouns and pseudoverbs allow
to exclude OLD20 impact on perception while measuring e.g. reaction times across the
grammatical categories. A nal conclusion regarding the two presented datasets is that even
when using highly reliable tools for pseudoword construction, such as Wuggy, it is useful to
test how the words are perceived by potential participants.
The presented approach allowed us to produce and share a unique set of stimuli to study
the impact of grammatical classes in psycholinguistic studies of language processing for
testing grammatical processing dierences between verbs and nouns. We used them as
examples of the application of a stimulus-producing algorithm to linguistic research. The
stimuli sets we created can be used by other researchers. The rst set is more suitable for
general pseudowords related to linguistic research, as it provides a variety of pseudowords
with dierent stems. This means it can be used in research regarding grammatical classes
but also that its items can be used as llers in other psycholinguistic studies. The second set
contains pseudowords that dier only in grammatical class exponent and, thus, provides an
opportunity to unify other pseudoword properties between pseudonouns and pseudoverbs.
This makes it more appropriate for specic grammatical class-related research.
To conclude, pseudowords provide a convenient tool for studying the morphosyntactic
properties of language processing because of the unique features they can be equipped with
or dislodged from. As articial language units, they might be a good source of knowledge
regarding language processing and its properties related to human behavior. They can be
used as stimuli for investigating language properties beyond semantics, such as iconicity of
sound (Köhler, 1929; Sapir, 1929) or morphological carriers of linguistic information that
have an impact on real-world feature perceptions (Adelman et al., 2018; Formanowicz et al.,
2017). However, pseudowords may also have drawbacks that researchers should be aware
of that may bias their results, for example, frequency of the rst syllable (Carreiras & Perea,
2004), phonotactic violations, or the inuence of the rst letter (Scaltritti et al., 2018) on the
perception of the pseudoword. Because pseudowords occupy a gray area between nonwords
and real words, the process of ne tuning their properties is a challenging multifaceted prob-
lem. However, pseudowords are immensely informative for the psychology of language.
Funding This research was funded by the OPUS 14–2017/27/B/HS6/01049 grant of the Polish National Sci-
ence Centre, which was awarded to Magdalena Formanowicz.
Open Access funding enabled and organized by Projekt DEAL.
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
Declarations
Compliance with Ethical Standards The authors declare no conict of interest. This research was funded by
the National Science Centre, Poland (OPUS 14: 2017/27/B/HS6/01049), and has been approved by the Ethi-
cal Committee of Psychology Department of Nicolaus Copernicus University in Toruń (decision no. 9/2018).
Prior to taking part in the studies presented in this article, participants provided informed consent.
Conflict of Interest The authors declare no conict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included in the
article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is
not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright
holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
Adelman, J. S., Estes, Z., & Cossu, M. (2018). Emotional sound symbolism: Languages rapidly signal valence
via phonemes. Cognition, 175(January), 122–130. doi:https://doi.org/10.1016/j.cognition.2018.02.007
Aguasvivas, J. A., Carreiras, M., Brysbaert, M., Mandera, P., Keuleers, E., & Duñabeitia, J. A. (2018).
SPALEX: A Spanish lexical decision database from a massive online data collection. Frontiers in Psy-
chology, 9, 2156
Álvarez, C. J., Carreiras, M., & De Vega, M. (2000). Syllable-frequency eect in visual word recognition:
Evidence of sequential-type processing. Psicológica, 21(2), 341–374
Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B. … Treiman, R. (2007).
The English lexicon project. Behavior Research Methods, 39(3), 445–459. doi:https://doi.org/10.3758/
BF03193014
Barca, L., & Pezzulo, G. (2012). Unfolding visual lexical decision in time. PLoS ONE, 7(4), 1–14. doi:https://
doi.org/10.1371/journal.pone.0035932
Brown, T. L., Carr, T. H., & Chaderjian, M. (1987). Orthography, familiarity, and meaningfulness recon-
sidered: Attentional strategies may aect the lexical sensitivity of visual code formation. Journal of
Experimental Psychology: Human Perception and Performance, 13(1), 127–139. doi:https://doi.
org/10.1037/0096-1523.13.1.127
Carreiras, M., Mechelli, A., & Price, C. J. (2006). Eect of word and syllable frequency on activation during
lexical decision and reading aloud. Human Brain Mapping, 27(12), 963–972
Carreiras, M., & Perea, M. (2004). Naming pseudowords in Spanish: Eects of syllable frequency. Brain and
Language, 90(1–3), 393–400
Chetail, F., & Mathey, S. (2009). The syllable frequency eect in visual recognition of French words: A study
in skilled and beginning readers. Reading and Writing, 22(8), 955–973
Coltheart, M., Rastle, K., Perry, C., Langdon, R., & Ziegler, J. (2001). DRC: A dual route cascaded model
of visual word recognition and reading aloud. Psychological Review, 108, 204–256. https://doi.
org/10.1037/0033-295X
De Simone, E., Beyersmann, E., Mulatti, C., Mirault, J., & Schmalz, X. (2021).Order among chaos: Cross-
linguistic dierences and developmental trajectories in pseudoword reading aloud using pronunciation
entropy. PlosOne, 16(5). doi: 10.1371/journal.pone.0251629
Dorner, G., & Harris, C. L. (1997). When pseudowords become words—Eects of learning on orthographic
similarity priming. In M. G. Shafto and P. Langley (Eds.), Proceedings of the Nineteenth Annual Con-
ference of the Cognitive Science Society (pp. 185–190). Lawrence Erlbaum Associates
Drabik, L., Kubiak-Sokół, A., Sobol, E., Wiśniakowska, L., Stankiewicz, A., & Naukowe, W., P. W. N. (Eds.).
(2018). Słownik języka polskiego PWN. Wydawnictwo Naukowe PWN SA
Dunaj, B. (2012). Słownik języka polskiego. IBIS
Duyck, W., Desmet, T., Verbeke, L. P. C., & Brysbaert, M. (2004). WordGen: A tool for word selection and
nonword generation in Dutch, English, German, and French. Behavior Research Methods, Instruments,
and Computers, 36(3), 488–499. doi:https://doi.org/10.3758/BF03195595
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
Erten, B. (2013). Adapting and testing psycholinguistic toolboxes for Turkish visual word recognition studies
(Master Thesis, Middle East Technical University)
Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Méot, A., & Pallier, C. (2010). The French
lexicon project: Lexical decision data for 38,840 French words and 38,840 pseudo words. Behavior
Research Methods, 42(2), 488–496. doi:https://doi.org/10.3758/BRM.42.2.488
Ferré, P., Ventura, D., Comesaña, M., & Fraga, I. (2015). The role of emotionality in the acquisition of new
concrete and abstract words. Frontiers in Psychology, 6, 976
Formanowicz, M., Roessel, J., Suitner, C., & Maass, A. (2017). Verbs as linguistic markers of agency:
The social side of grammar. European Journal of Social Psychology, 47(5), 566–579. doi:https://doi.
org/10.1002/ejsp.2231
Grainger, J., & Jacobs, A. M. (1996). Orthographic processing in visual word recognition: A multiple read-
out model. Psychological Review, 103, 518–565. doi:https://doi.org/10.1037/0033-295X.103.3.518
Harm, M. W., & Seidenberg, M. S. (2004). Computing the meanings of words in reading: Cooperative
division of labor between visual and phonological processes. Psychological Review, III, 662–720.
doi:https://doi.org/10.1037/0033-295X.111.3.662
Hasenäcker, J., Schröter, P., & Schroeder, S. (2017). Investigating developmental trajectories of morphemes
as reading units in German. Journal of Experimental Psychology: Learning, Memory, and Cognition,
43(7), 1093–1108
Hawelka, S., Schuster, S., Gagl, B., & Hutzler, F. (2013). Beyond single syllables: The eect of rst syllable
frequency and orthographic similarity on eye movements during silent reading. Language and Cogni-
tive Processes, 28(8), 1134–1153. doi:https://doi.org/10.1080/01690965.2012.696665
Heim, S., Eickho, S. B., Ischebeck, A. K., Supp, G., & Amunts, K. (2007). Modality-independent involve-
ment of the left BA 44 during lexical decision making. Brain Structure and Function, 212(1), 95–106
Heyman, T., Rensbergen, B., Van, Storms, G., Hutchison, K. A., & De Deyne, S. (2015). The inuence of
working memory load on semantic priming. Journal of Experimental Psychology: Learning, Memory,
and Cognition, 41(3), 911–920
Huntsman, L. A., & Lima, S. D. (2002). Orthographic neighbors and visual word recognition. Journal of
Psycholinguistic Research, 31(3), 289–306. doi:https://doi.org/10.1023/A:1015544213366
IEEE, The Open Group, and ISO/IEC JTC 1/SC22/WG15 (2018). Regular Expressions. Single UNIX® Spec-
ication, Version 4 (2018 Edition)
Imbir, K. K., Spustek, T., & Żygierewicz, J. (2015). Polish pseudo-words list: Dataset of 3,023 stimuli
with competent judges’ ratings. Frontiers in Psychology, 6(6), 1–3. doi:https://doi.org/10.3389/
fpsyg.2015.01395
Jakubíček, M., Kilgarri, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family.
In 7th International Corpus Linguistics Conference CL (pp. 125–127)
Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior Research
Methods, 42(3), 627–633. doi:https://doi.org/10.3758/BRM.42.3.627
Keuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice eects in large-scale visual word recognition
studies: A lexical decision study on 14,000 Dutch mono- and disyllabic words and nonwords. Frontiers
in Psychology, 1(11), 1–15. doi:https://doi.org/10.3389/fpsyg.2010.00174
Kieraś, W., & Woliński, M. (2017). Morfeusz 2–analizator i generator eksyjny dla języka polskiego. Język
Polski, 97(1), 75–83
Kilgarri, A., Rychlý, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In G. Williams & S. Vessier
(Eds.), Proceedings of the 11th Euralex International Congress (pp. 105–115). Université de Bretagne-
Sud, Faculté des Lettres et des Sciences Humaines
Kissler, J., & Herbert, C. (2013). Emotion, etmnooi, or emitoon? Faster lexical access to emotional than to
neutral words during reading. Biological Psychology, 92(3), 464–479
Kozea Community (2018). Pyphen (Release 0.9.5) [Computer software] Retrieved December 14, 2018, from
https://pyphen.org/
Köhler, W. (1929). Gestalt psychology. Liveright
Kresse, L., Kirschner, S., Dipper, S., & Belke, E. (2012). Towards exploring the specic inuences of word-
form frequency, lemma frequency and OLD20 on visual word recognition and reading aloud. Lexical
Resources in Psycholinguistic Research, 3, 9
Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics, 52(1),
7–21. doi:https://doi.org/10.1002/nav.20053
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for
30,000 English words. Behavior Research Methods, 44(4), 978–990. doi:https://doi.org/10.3758/
s13428-012-0210-4
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys-
ics Doklady, 10(8), 707–710
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Journal of Psycholinguistic Research
Lexical Computing CZ s.r.o (2015). plTenTen – Polish corpus from the web. [Corpus] Retrieved from https://
www.sketchengine.eu/pltenten-polish-corpus/
Longtin, C. M., & Meunier, F. (2005). Morphological decomposition in early visual word processing. Jour-
nal of Memory and Language, 53(1), 26–41
Medler, D. A., & Binder, J. R. (2005). MCWord: An on-line orthographic database of the English language.
Retrieved from www.neuro.mcw.edu/mcword/
Moravicsik, E. A. (2013). Introducing language typology. Cambridge University Press
Opitz, B. (2004). Brain correlates of language learning: The neuronal dissociation of rule-based versus
similarity-based learning. Journal of Neuroscience, 24(39), 8436–8440. doi:https://doi.org/10.1523/
JNEUROSCI.2220-04.2004
Perea, M., Rosa, E., & Gómez, C. (2005). The frequency eect for pseudowords in the lexical decision task.
Perception and Psychophysics, 67(2), 301–314. doi:https://doi.org/10.3758/BF03206493
Polański, K. (1999). Encyklopedia językoznawstwa ogólnego (Encyclopedia of general linguistics). Zakład
Narodowy im. Ossolińskich
Preti, E., Suttora, C., & Richetin, J. (2016). Can you hear what I feel? A validated prosodic set of angry,
happy, and neutral Italian pseudowords. Behavior Research Methods, 48(1), 259–271
Price, C. J., Wise, R. J. S., & Frackowiak, R. S. J. (1996). Demonstrating the implicit processing of visually
presented words and pseudowords. Cerebral Cortex, 6(1), 62–70
Rosinski, R. R., & Wheeler, K. E. (1972). Children’s use of orthographic structure in word discrimination.
Psychonomic Science, 26(2), 97–98
Saloni, Z., Woliński, M., Wołosz, R., Gruszczyński, W., & Skowrońska, D. (2015). Grammatical dictionary
of Polish. Retrieved from http://sgjp.pl/about/
Sapir, E. (1929). A study in phonetic symbolism. Journal of Experimental Psychology, 12, 225–239
Scaltritti, M., Dufau, S., & Grainger, J. (2018). Stimulus orientation and the rst-letter advantage. Acta Psy-
chologica, 183, 37–42
Shtereva, K., Hadzhiyska, B., Totev, T., & Mihaylova, M. S. (2020). Application of the Wuggy method
for generation of pseudo-words in the Bulgarian language. Knowledge International Journal, 43(6),
1219–1226
Siew, C. S. Q. (2018). The orthographic similarity structure of English words: Insights from network science.
Applied Network Science, 3(1), 13. doi:https://doi.org/10.1007/s41109-018-0068-1
Snyder, W. B. (1995). Language acquisition and language variation: The role of morphology (Doctoral dis-
sertation, Massachusetts Institute of Technology)
Sobol, E. (Ed.). (2002). Nowy słownik języka polskiego. Wydawnictwo Naukowe PWN
Solso, R. L., Barbuto, P. F., & Juel, C. L. (1979). Bigram and trigram frequencies and versatilities in the
English language. Behavior Research Methods and Instrumentation, 11(5), 475–484. doi:https://doi.
org/10.3758/BF03201360
Suchomel, V., & Pomikálek, J. (2012). Ecient web crawling for large text corpora. In Proceedings of the
seventh Web as Corpus Workshop (WAC7) (pp. 39–43)
Suen, C. Y. (1979). N-gram statistics for natural language understanding and text processing. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 164–172. doi:https://doi.org/10.1109/
TPAMI.1979.4766902
Trost, S. (2002). WordCreator. Retrieved from https://www.sttmedia.com/wordcreator
Tucker, B. V., & Brenner, D. (2017). Exploring the acoustic characteristics of individual variation. The Jour-
nal of the Acoustical Society of America, 141(5), 3579–3579
Tyler, L. K., Moss, H. E., Galpin, A., & Voice, J. K. (2002). Activating meaning in time: The role of
imageability and form-class. Language and Cognitive Processes, 17(5), 471–502. doi:https://doi.
org/10.1080/01690960143000290
Vigliocco, G., Vinson, D. P., Druks, J., Barber, H., & Cappa, S. F. (2011). Nouns and verbs in the brain:
A review of behavioural, electrophysiological, neuropsychological and imaging studies. Neuroscience
and Biobehavioral Reviews, 35(3), 407–426. doi:https://doi.org/10.1016/j.neubiorev.2010.04.007
Yarkoni, T., Balota, D., & Yap, M. (2008). Moving beyond Coltheart’s N: A new measure of orthographic sim-
ilarity. Psychonomic Bulletin and Review, 15(5), 971–979. doi:https://doi.org/10.3758/PBR.15.5.971
Ziegler, J. C., Besson, M., Jacobs, A. M., Nazir, T. A., & Carr, T. H. (1997). Word, pseudoword, and nonword
processing: A multitask comparison using event-related brain potentials. Journal of Cognitive Neurosci-
ence, 9(6), 758–775
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
1 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
onlineservice@springernature.com
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this work we propose the use of Entropy to measure variability in pronunciations in pseudowords reading aloud: pseudowords where participants give many different pronunciations receive higher Entropy values. Monolingual adults, monolingual children, and bilingual children proficient in different European languages varying in orthographic depth were tested. We predicted that Entropy values will increase with increasing orthographic depth. Moreover, higher Entropy was expected for younger than older children, as reading experience improves the knowledge of grapheme-phoneme correspondences (GPCs). We also tested if interference from a second language would lead to higher Entropy. Results show that orthographic depth affects Entropy, but only when the items are not strictly matched across languages. We also found that Entropy decreases across age, suggesting that GPC knowledge becomes refined throughout grades 2-4. We found no differences between bilingual and monolingual children. Our results indicate that item characteristics play a fundamental role in pseudoword pronunciation variability, that reading experience is associated with reduced variability in responses, and that in bilinguals’ knowledge of a second orthography does not seem to interfere with pseudoword reading aloud.
Article
Full-text available
Network science has been applied to study the structure of the mental lexicon, the part of long-term memory where all the words a person knows are stored. Here the tools of network science are used to study the organization of orthographic word-forms in the mental lexicon and how that might influence visual word recognition. An orthographic similarity network of the English language was constructed such that each node represented an English word, and undirected, unweighted edges were placed between words that differed by an edit distance of 1, a commonly used operationalization of orthographic similarity in psycholinguistics. The largest connected component of the orthographic language network had a small-world structure and a long-tailed degree distribution. Additional analyses were conducted using behavioral data obtained from a psycholinguistic database to determine if network science measures obtained from the orthographic language network could be used to predict how quickly and accurately people process written words. The present findings show that the structure of the mental lexicon influences lexical access in visual word recognition.
Article
Full-text available
A post-cued partial report target-in-string identification experiment examined the influence of stimulus orientation on the serial position functions for strings of five consonants or five symbols, with an aim to test different accounts of the first-letter advantage observed in prior research. Under one account, this phenomenon is driven by processing that is specific to horizontally arranged letter (and digit) strings. An alternative account explains the first-letter advantage in terms of attentional biases towards the beginning of letter strings. We observed a significant three-way interaction between stimulus type (letters vs. symbols), serial position (1-5), and orientation (horizontal vs. vertical) that was driven by a greater first-position advantage for letters than symbols when stimuli were presented horizontally compared with vertical presentation. These results provide support for the letter-specific processing account of the first-letter advantage, and further suggest that differences in visual complexity between letters and symbols play a minor role. Nevertheless, a first-position advantage for letters was observed in the vertical presentation condition, thus pointing to some role for attentional biases that operate independently of string orientation.
Article
Full-text available
Basic grammatical categories may carry social meanings irrespective of their semantic content. In a set of four studies, we demonstrate that verbs—a basic linguistic category present and distinguishable in most languages—are related to the perception of agency, a fundamental dimension of social perception. In an archival analysis of actual language use in Polish and German, we found that targets stereotypically associated with high agency (men and young people) are presented in the immediate neighborhood of a verb more often than non-agentic social targets (women and older people). Moreover, in three experiments using a pseudo-word paradigm, verbs (but not adjectives and nouns) were consistently associated with agency (but not with communion). These results provide consistent evidence that verbs, as grammatical vehicles of action, are linguistic markers of agency. In demonstrating meta-semantic effects of language, these studies corroborate the view of language as a social tool and an integral part of social perception.
Article
Full-text available
The developmental trajectory of the use of morphemes is still unclear. We investigated the emergence of morphological effects on visual word recognition in German in a large sample across the complete course of reading acquisition in elementary school. To this end, we analyzed lexical decision data on a total of 1,152 words and pseudowords from a large cross-sectional sample of German children from the beginning of Grade 2 through 6, and a group of adults. We expand earlier evidence by (a) explicitly investigating processing differences between compounds, prefixes and suffixes, (b) taking into account vocabulary knowledge as an indicator for interindividual differences. Results imply that readers of German are sensitive to morphology in very early stages of reading acquisition with trajectories depending on morphological type and vocabulary knowledge. Facilitation from compound structure comes early in development, followed by facilitation from suffixes and prefixes later on in development. This indicates that stems and different types of affixes involve distinct processing mechanisms in beginning readers. Furthermore, children with higher vocabulary knowledge benefit earlier in development and to a greater extent from morphology. Our results specify the development and functional role of morphemes as reading units.
Article
Studies of the acoustic properties of words often analyze a small subset of words across a large population of speakers. Much of the previous research has not investigated the individual variation produced by a single speaker in large sets of words. The present study analyzes the individual variation produced by a male Western Canadian English speaker, who produced 26,800 English words and 9,600 pseudo-words. All pseudo-words were phonotactically licit and were generated using the software package Wuggy (Keuleers & Brysbaert, 2010). Each word has been force-aligned using the Penn Forced Aligner (Yuan & Liberman, 2008) and then hand corrected by trained phoneticians. We investigate the formant space, word pitch contours, segmental duration, and other acoustic characteristics relevant to classes of segments (such as center of gravity for fricatives). An acoustic comparison is performed between the words and pseudo-words. We explore the acoustic variation of the individual segments produced by this speaker and investigate his individual speech patterns. Finally, we consider the value of delving deeply into productions of a single speaker rather than relying on averaged summaries across a sample a large sample.