Figure 1 - uploaded by Béatrice Daille
Content may be subject to copyright.
Parameters setting and compound splitting  

Parameters setting and compound splitting  

Source publication
Article
Full-text available
Compounding is present in a large variety of languages in different proportions. Compound rate in the text obviously depends on the language, but also on the genre and the domain. Scientific and technical texts are especially conducive to compounding, even in the languages that are not traditionally admitted as highly compounding ones. In this arti...

Citations

... Language(s) Approach POS scope Parent no. Khaitan et al. (2009) en Split-point Any Any Fritzinger and Fraser (2010) de Split-point Any 2 Henrich and Hinrichs (2011) de Valid-output Nominal 2 Clouet and Daille (2014) en, ru Valid-output Any 2 Riedl and Biemann (2016) de, nl, en Split-point Any Any Krotova et al. (2020) de Split-point Nominal Any cs Valid-output Any Any Vodolazsky and Petrov (2021) ru Valid-output Any 2 Table 1: Comparison of various compound splitters described in the literature, sorted by year of publication. ...
Conference Paper
Full-text available
We present PaReNT (Parent Retrieval Neural Tool), a deep-learning-based multilingual tool performing parent retrieval and word-formation classification in English, German, Dutch, Spanish, French, Russian, and Czech. Parent retrieval refers to determining the lexeme or lexemes the input lexeme was based on (e.g. 'darkness' is traced back to 'dark'; 'waterfall' decomposes into 'water' and 'fall'). Additionally, PaReNT performs word-formation classification, which determines the input lexeme as a compound (e.g. 'proofread'), a derivative (e.g. 'deescalate') or as an unmotivated word (e.g. 'dog'). These seven languages are selected from three major branches of the Indo-European language family (Germanic, Romance, Slavic). Data is aggregated from a range of word-formation resources, as well as Wiktionary, to train and test the tool. The tool is based on a custom-architecture hybrid transformer block-enriched sequence-to-sequence neural network utilizing both a character-based and semantic representation of the input lexemes, with two output modules-one decoder-based dedicated to parent retrieval, and one classifier-based for word-formation classification. PaReNT achieves a mean accuracy of 0.62 in parent retrieval and a mean balanced accuracy of 0.74 in word-formation classification.
... Morphological variants TermSuite implements Compost, a multilingual splitter (Loginova Clouet andDaille, 2014) that makes the decision as to whether the term composed of one graphic unit, is a SWT or a compound, and for compounds, it gives one or several candidate analyses ranked by their scores. We only keep the best split. ...