Language modelling in large vocabulary speech recogni-tion has traditionally been based on words. A lexicon of the most common words of the language in question is created and the recogniser is limited to consider only the words in the lexicon. In Finnish, however, it is more diffi-cult to create an extensive lexicon, since the compounding of words, numerous inflections and suffixes increase the
... [Show full abstract] number of commonly used word forms considerably. The problem is that reasonably sized lexica lack many com-mon words, and for very large lexica, it is hard to estimate a reliable language model. We have previously reported a new approach for im-proving the recognition of inflecting or compounding lan-guages in large vocabulary continuous speech recognition tasks. Significant reductions in error rates have been ob-tained by replacing a traditional word lexicon with a lex-icon based on morpheme-like word fragments learnt di-rectly from data. In this paper, we evaluate these so called statistical morphs further, and compare them to grammat-ical morphs and very large word lexica using n-gram lan-guage models of different orders. When compared to the best word model, the morph models seem to be clearly more effective with respect to entropy, and give 30% rel-ative error-rate reductions in a Finnish recognition task. Furthermore, the statistical morphs seem to be slightly better than the rule-based grammatical morphs.