[show abstract][hide abstract] ABSTRACT: Gene/protein recognition and normalization is an important preliminary step for many biological text mining tasks. In this paper, we present a multistage gene normalization system which consists of four major subtasks: pre-processing, dictionary matching, ambiguity resolution and filtering. For the first subtask, we apply the gene mention tagger developed in our earlier work, which achieves an F-score of 88.42% on the BioCreative II GM testing set. In the stage of dictionary matching, the exact matching and approximate matching between gene names and the EntrezGene lexicon have been combined. For the ambiguity resolution subtask, we propose a semantic similarity disambiguation method based on Munkres' Assignment Algorithm. At the last step, a filter based on Wikipedia has been built to remove the false positives. Experimental results show that the presented system can achieve an F-score of 90.1%, outperforming most of the state-of-the-art systems.
PLoS ONE 01/2013; 8(12):e81956. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: According to the characteristics of transliterated names in Chinese texts, a method of automatic recognition of Chinese transliterated names combining support vector machines (SVMs) with rules is proposed. The attributes of feature vectors based on characters are extracted. A training set is established and the machine learning models of automatic identification of transliterated names are obtained by testing polynomial Kernel functions; the knowledge cannot be acquired completely if we only use the machine learning model, which will affect the recall. Through careful error analysis, the base of recognition-rules is constructed as post-processing steps to overcome the shortcoming of machine learning model. The results show that the method is efficient for identifying transliterated names from Chinese texts
Neural Networks and Brain, 2005. ICNN&B '05. International Conference on; 11/2005
[show abstract][hide abstract] ABSTRACT: This paper presents a method of processing Chinese syntactic category ambiguity with support vector machines (SVMs): extracting
the word itself, candidate part-of-speech (POS) tags, the pair of candidate POS tags and their probability and context information
as the features of the word vector. A training set is established. The machine learning models of disambiguation based on
support vector machines are obtained using polynomial kernel functions. The testing results show that this method is efficient.
The paper also gives the results obtained with neural networks for comparison.
Advances in Neural Networks - ISNN 2005, Second International Symposium on Neural Networks, Chongqing, China, May 30 - June 1, 2005, Proceedings, Part II; 01/2005