Article

Incorporating rich background knowledge for gene named entity classification and recognition.

Department of Computer Science and Engineering, Dalian University of Technology, Dalian, PR China.
BMC Bioinformatics (impact factor: 2.75). 08/2009; 10:223. DOI:10.1186/1471-2105-10-223 pp.223
Source: PubMed

ABSTRACT BACKGROUND: Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n-grams, and morphology patterns, have played a central part. However, this type of feature tends to cause extreme sparseness in feature space. As a result, out-of-vocabulary (OOV) terms in the training data are not modeled well due to lack of information. RESULTS: We propose a general framework for gene named entity representation, called feature coupling generalization (FCG). The basic idea is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data. We examine its performance in a named entity classification task, which is designed to remove non-gene entries in a large dictionary derived from online resources. The results show that new features generated by FCG outperform lexical features by 5.97 F-score and 10.85 for OOV terms. Also in this framework each extension yields significant improvements and the sparse lexical features can be transformed into both a lower dimensional and more informative representation. A forward maximum match method based on the refined dictionary produces an F-score of 86.2 on BioCreative 2 GM test set. Then we combined the dictionary with a conditional random field (CRF) based gene mention tagger, achieving an F-score of 89.05, which improves the performance of the CRF-based tagger by 4.46 with little impact on the efficiency of the recognition system. A demo of the NER system is available at http://202.118.75.18:8080/bioner.

0 0
 · 
0 Bookmarks
 · 
43 Views

Full-text (2 Sources)

View
8 Downloads
Available from
15 Feb 2013

Keywords

BioCreative 2 GM test
 
biomedical literature
 
cause extreme sparseness
 
co-occurrence information
 
conditional random field
 
extension yields significant improvements
 
feature coupling generalization
 
general framework
 
great success
 
higher level features
 
indicative features
 
large dictionary
 
lexical features
 
named entity classification task
 
NER system
 
new features
 
non-gene entries
 
recognition system
 
refined dictionary
 
sparse lexical features