[Show abstract][Hide abstract] ABSTRACT: In pursuing machine understanding of human language, highly accurate syntactic analysis is a crucial step. In this work, we focus on dependency grammar, which models syntax by encoding transparent predicate-argument structures. Recent advances in dependency parsing have shown that employing higher-order subtree structures in graph-based parsers can substantially improve the parsing accuracy. However, the inefficiency of this approach increases with the order of the subtrees. This work explores a new reranking approach for dependency parsing that can utilize complex subtree representations by applying efficient subtree selection methods. We demonstrate the effectiveness of the approach in experiments conducted on the Penn Treebank and the Chinese Treebank. Our system achieves the best performance among known supervised systems evaluated on these datasets, improving the baseline accuracy from 91.88% to 93.42% for English, and from 87.39% to 89.25% for Chinese.
IEEE/ACM Transactions on Audio, Speech, and Language Processing 07/2014; 22(7):1208-1218. DOI:10.1109/TASLP.2014.2327295
[Show abstract][Hide abstract] ABSTRACT: The focus of recent studies on Chinese word segmentation, part-of-speech (POS) tagging and parsing has been shifting from words to characters. However, existing methods have not yet fully utilized the potentials of Chinese characters. In this paper, we investigate the usefulness of character-level part-of-speech in the task of Chinese morphological analysis. We propose the first tagset designed for the task of character-level POS tagging. We propose a method that performs character-level POS tagging jointly with word segmentation and word-level POS tagging. Through experiments, we demonstrate that by introducing character-level POS information, the performance of a baseline morphological analyzer can be significantly improved.
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); 06/2014
[Show abstract][Hide abstract] ABSTRACT: This article proposes a new distortion model for phrase-based statistical machine translation. In decoding, a distortion model estimates the source word position to be translated next (subsequent position; SP) given the last translated source word position (current position; CP). We propose a distortion model that can simultaneously consider the word at the CP, the word at an SP candidate, the context of the CP and an SP candidate, relative word order among the SP candidates, and the words between the CP and an SP candidate. These considered elements are called rich context. Our model considers rich context by discriminating label sequences that specify spans from the CP to each SP candidate. It enables our model to learn the effect of relative word order among SP candidates as well as to learn the effect of distances from the training data. In contrast to the learning strategy of existing methods, our learning strategy is that the model learns preference relations among SP candidates in each sentence of the training data. This leaning strategy enables consideration of all of the rich context simultaneously. In our experiments, our model had higher BLUE and RIBES scores for Japanese-English, Chinese-English, and German-English translation compared to the lexical reordering models.
ACM Transactions on Asian Language Information Processing 02/2014; 13(1):2:1--2:21. DOI:10.1145/2537128
[Show abstract][Hide abstract] ABSTRACT: Many knowledge acquisition tasks are tightly dependent on fundamental analysis technologies, such as part of speech (POS) tagging and parsing. Dependency parsing, in particular, has been widely employed for the acquisition of knowledge related to predicate-argument structures. For such tasks, the dependency parsing performance can determine quality of acquired knowledge, regardless of target languages. Therefore, reducing dependency parsing errors and selecting high quality dependencies is of primary importance. In this study, we present a language-independent approach for automatically selecting high quality dependencies from automatic parses. By considering several aspects that affect the accuracy of dependency parsing, we created a set of features for supervised classification of reliable dependencies. Experimental results on seven languages show that our approach can effectively select high quality dependencies from dependency parses.
[Show abstract][Hide abstract] ABSTRACT: We propose a method for automatically acquiring knowledge about case alternations between the passive/causative and active voices. Our method leverages large lexical case frames obtained from a large Web corpus, and several alternation patterns. We then use the acquired knowledge to a case alternation task and show its usefulness.
[Show abstract][Hide abstract] ABSTRACT: This paper presents a simple but effective approach to unknown word processing in Japanese morphological analysis, which handles 1) unknown words that are derived from words in a pre-defined lexicon and 2) unknown onomatopoeias. Our approach leverages derivation rules and onomatopoeia patterns, and correctly recognizes certain types of unknown words. Experiments revealed that our approach recognized about 4,500 unknown words in 100,000 Web sentences with only roughly 80 harmful side effects and a 6% loss in speed.
[Show abstract][Hide abstract] ABSTRACT: The Chinese and Japanese languages share Chinese characters. Since the Chinese characters in Japanese originated from ancient China, many common Chinese characters exist between these two languages. Since Chinese characters contain significant semantic information and common Chinese characters share the same meaning in the two languages, they can be quite useful in Chinese-Japanese machine translation (MT). We therefore propose a method for creating a Chinese character mapping table for Japanese, traditional Chinese, and simplified Chinese, with the aim of constructing a complete resource of common Chinese characters. Furthermore, we point out two main problems in Chinese word segmentation for Chinese-Japanese MT, namely, unknown words and word segmentation granularity, and propose an approach exploiting common Chinese characters to solve these problems. We also propose a statistical method for detecting other semantically equivalent Chinese characters other than the common ones and a method for exploiting shared Chinese characters in phrase alignment. Results of the experiments carried out on a state-of-the-art phrase-based statistical MT system and an example-based MT system show that our proposed approaches can improve MT performance significantly, thereby verifying the effectiveness of shared Chinese characters for Chinese-Japanese MT.
ACM Transactions on Asian Language Information Processing 10/2013; 12(4). DOI:10.1145/2523057.2523059
[Show abstract][Hide abstract] ABSTRACT: This paper proposes new distortion models for phrase-based SMT. In decoding, a distortion model estimates the source word position to be translated next (NP) given the last translated source word position (CP). We propose a distortion model that can consider the word at the CP, a word at an NP candidate, and the context of the CP and the NP candidate simultaneously. Moreover, we propose a further improved model that considers richer context by discriminating label sequences that specify spans from the CP to NP candidates. It enables our model to learn the effect of relative word order among NP candidates as well as to learn the effect of distances from the training data. In our experiments, our model improved 2.9 BLEU points for Japanese-English and 2.6 BLEU points for Chinese-English translation compared to the lexical reordering models.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we propose the use of spans in addition to edges in noun compound analysis. A span is a sequence of words that can represent a noun compound. Compared with edges, spans have good properties in terms of semi-supervised parsing. They can be reliably extracted from a huge amount of unannotated text. In addition, while the combinations of edges such as sibling and grandparent interactions are, in general, difficult to handle in parsing, it is quite easy to utilize spans with arbitrary width. We show that spans can be incorporated straightforwardly into the standard chart-based parsing algorithm. We create a semi-supervised discriminative parser that combines edge and span features. Experiments show that span features improve accuracy and that further gain is obtained when they are combined with edge features.
[Show abstract][Hide abstract] ABSTRACT: This article proposes a predicate-argument structure based Textual Entailment Recognition system exploiting wide-coverage lexical knowledge. Different from conventional machine learning approaches where several features obtained from linguistic analysis and resources are utilized, our proposed method regards a predicate-argument structure as a basic unit, and performs the matching/alignment between a text and hypothesis. In matching between predicate-arguments, wide-coverage relations between words/phrases such as synonym and is-a are utilized, which are automatically acquired from a dictionary, Web corpus, and Wikipedia.
ACM Transactions on Asian Language Information Processing 12/2012; 11(4). DOI:10.1145/2382593.2382598
[Show abstract][Hide abstract] ABSTRACT: Sentence compression is important in a wide range of applications in natural language processing. Previous approaches of Japanese sentence compression can be divided into two groups. Word-based methods extract a subset of words from a sentence to shorten it, while bunsetsubased methods extract a subset of bunsetsu (where a bunsetsu is a text unit that consists of content words and following function words). Basically, bunsetsu-based methods perform better than word-based methods. However, bunsetsu-based methods have the disadvantage that they cannot drop unimportant words from each bunsetsu because they have to follow constraints under which each bunsetsu is treated as a unit. In this paper, we propose a novel compression method to overcome this disadvantage. Our method relaxes the constraints using Lagrangian relaxation and shortens each bunsetsu if it contains unimportant words. Experimental results show that our method effectively compresses a sentence while preserving its important information and grammaticality.
[Show abstract][Hide abstract] ABSTRACT: One of the main issues in a word alignment task is the difficulty of handling function words that do not have direct translations which we call unique function words. They are often aligned to some words in the other language incorrectly. This is prominent in language pairs with very different sentence structures. In this paper, we propose a novel approach for handling unique function words. The proposed model monolingually derives unique function words from bilingually generated treelet pairs. The monolingual derivation prevents incorrect alignments for unique function words. The derivation probabilities are estimated from a large monolingual corpus, which is much easier to acquire than a parallel corpus. Also, the proposed alignment model uses semantic-head dependency trees where dependency relations between words become similar in each language. Experimental results on an English-Japanese corpus show that the proposed model achieves better alignment and translation quality compared with the baseline models.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present a Web spam detection algorithm that relies on link analysis. The method consists of three steps: (1) decomposition of web graphs in densely connected sub graphs and calculation of the features for each sub graph, (2) use of SVM classifiers to identify sub graphs composed of Web spam, and (3) propagation of predictions over web graphs by a biased Page Rank algorithm to expand the scope of identification. We performed experiments on a public benchmark. An empirical study of the core structure of web graphs suggests that highly ranked non-spam hosts can be identified by viewing the coreness of the web graph elements.
Proceedings of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2011, Campus Scientifique de la Doua, Lyon, France, August 22-27, 2011; 08/2011