January 2013
·
62 Reads
·
7 Citations
Computer Speech & Language
Since Sag et al. (2002) highlighted a key problem that had been underappreciated in the past in natural language processing (NLP), namely idiosyncratic multiword expressions (MWEs) such as idioms, quasi-idioms, clichés, quasi-clichés, institutionalized phrases, proverbs and old sayings, and how to deal with them, many attempts have been made to extract these expressions from corpora and construct a lexicon of them. However, no extensive, reliable solution has yet been realized. This paper presents an overview of a comprehensive lexicon of Japanese multiword expressions (Japanese MWE Lexicon: JMWEL), which has been compiled in order to realize linguistically precise and wide-coverage natural Japanese processing systems. The JMWEL is characterized by significant notational, syntactic, and semantic diversity as well as a detailed description of the syntactic functions, structures, and flexibilities of MWEs. The lexicon contains about 111,000 header entries written in kana (phonetic characters) and their almost 820,000 variants written in kana and kanji (ideographic characters). The paper demonstrates the JMWEL's validity, supported mainly by comparing the lexicon with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08 generated by Google Inc. (Kudo and Kazawa, 2009). The present work is an attempt to provide a tentative answer for Japanese, from outside statistical empiricism, to the question posed by Church (2011): “How many multiword expressions do people know?”