[show abstract][hide abstract] ABSTRACT: Community based question and answering (cQA) services provide a convenient way for online users to share and exchange information and knowledge, which is highly valuable for information seeking. User interest and dedication act as the motivation to promote the interactive process of question and answering. In this paper, we aim to address a key issue about cQA systems: routing newly asked questions to appropriate users that may potentially provide answer with high quality. We incorporate answer quality and answer content to build a probabilistic question routing model. Our proposed model is capable of 1) differentiating and quantifying the authority of users for different topic or category; 2) routing questions to users with expertise. Experimental results based on a large collection of data from Wenwen demonstrate that our model is effective and has promising performance.
[show abstract][hide abstract] ABSTRACT: Patenting is one of the most important ways to protect company's core business concepts and proprietary technologies. Analyzing large volume of patent data can uncover the potential competitive or collaborative relations among companies in certain areas, which can provide valuable information to develop strategies for intellectual property (IP), R&D, and marketing. In this paper, we present a novel topic-driven patent analysis and mining system. Instead of merely searching over patent content, we focus on studying the heterogeneous patent network derived from the patent database, which is represented by several types of objects (companies, inventors, and technical content) jointly evolving over time. We design and implement a general topic-driven framework for analyzing and mining the heterogeneous patent network. Specifically, we propose a dynamic probabilistic model to characterize the topical evolution of these objects within the patent network. Based on this modeling framework, we derive several patent analytics tools that can be directly used for IP and R&D strategy planning, including a heterogeneous network co-ranking method, a topic-level competitor evolution analysis algorithm, and a method to summarize the search results. We evaluate the proposed methods on a real-world patent database. The experimental results show that the proposed techniques clearly outperform the corresponding baseline methods.
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining; 01/2012
[show abstract][hide abstract] ABSTRACT: Due to the high cost of manual curation of key aspects from the scientific literature, automated methods for assisting this process are greatly desired. Here, we report a novel approach to facilitate MeSH indexing, a challenging task of assigning MeSH terms to MEDLINE citations for their archiving and retrieval.
Unlike previous methods for automatic MeSH term assignment, we reformulate the indexing task as a ranking problem such that relevant MeSH headings are ranked higher than those irrelevant ones. Specifically, for each document we retrieve 20 neighbor documents, obtain a list of MeSH main headings from neighbors, and rank the MeSH main headings using ListNet-a learning-to-rank algorithm. We trained our algorithm on 200 documents and tested on a previously used benchmark set of 200 documents and a larger dataset of 1000 documents.
Tested on the benchmark dataset, our method achieved a precision of 0.390, recall of 0.712, and mean average precision (MAP) of 0.626. In comparison to the state of the art, we observe statistically significant improvements as large as 39% in MAP (p-value <0.001). Similar significant improvements were also obtained on the larger document set.
Experimental results show that our approach makes the most accurate MeSH predictions to date, which suggests its great potential in making a practical impact on MeSH indexing. Furthermore, as discussed the proposed learning framework is robust and can be adapted to many other similar tasks beyond MeSH indexing in the biomedical domain. All data sets are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/indexing.
Journal of the American Medical Informatics Association 05/2011; 18(5):660-7. · 3.57 Impact Factor
[show abstract][hide abstract] ABSTRACT: MOTIVATION: Linking gene mentions in an article to entries of biological databases can facilitate indexing and querying biological literature greatly. Due to the high ambiguity of gene names, this task is particularly challenging. Manual annotation for this task is cost expensive, time consuming and labor intensive. Therefore, providing assistive tools to facilitate the task is of high value. RESULTS: We developed GeneTUKit, a document-level gene normalization software for full-text articles. This software employs both local context surrounding gene mentions and global context from the whole full-text document. It can normalize genes of different species simultaneously. When participating in BioCreAtIvE III, the system obtained good results among 37 runs: the system was ranked first, fourth and seventh in terms of TAP-20, TAP-10 and TAP-5, respectively on the 507 full-text test articles. Availability and implementation: The software is available at http://www.qanswers.net/GeneTUKit/.
[show abstract][hide abstract] ABSTRACT: Although the goal of traditional text summarization is to generate summaries with diverse information, most of those applications
have no explicit definition of the information structure. Thus, it is difficult to generate truly structure-aware summaries
because the information structure to guide summarization is unclear. In this paper, we present a novel framework to generate
guided summaries for product reviews. The guided summary has an explicitly defined structure which comes from the important
aspects of products. The proposed framework attempts to maximize expected aspect satisfaction during summary generation. The
importance of an aspect to a generated summary is modeled using Labeled Latent Dirichlet Allocation. Empirical experimental
results on consumer reviews of cars show the effectiveness of our method.
Journal of Computer Science and Technology 01/2011; 26:676-684. · 0.48 Impact Factor
[show abstract][hide abstract] ABSTRACT: Though news readers can easily access a large number of news articles from the Internet, they can be overwhelmed by the quantity of information available, making it hard to get a concise, global picture of a news topic. In this paper we propose a novel method to address this problem. Given a set of articles for a given news topic, the proposed method models theme variation through time and identifies the breakpoints, which are time points when decisive changes occur. For each breakpoint, a brief summary is automatically constructed based on articles associated with the particular time point. Summaries are then ordered chronologically to form a timeline overview of the news topic. In this fashion, readers can easily track various news topics efficiently. We have conducted experiments on 15 popular topics in 2010. Empirical experiments show the effectiveness of our approach and its advantages over other approaches.
11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k).
We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively.
By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.
[show abstract][hide abstract] ABSTRACT: A large number of protein-protein interactions (PPIs) have buried in massive biomedical articles published over the years. This leads to the development of automatic PPI extraction methods. However, existing methods based on supervised machine learning still face some challenges: (1) the feature space exploited in these methods is very sparse; and (2) the data used for training are imbalanced with respect to categories to be classified. In this paper, we first construct rich and compact features to alleviate the issue of feature sparseness. With these features, our method outperforms baselines by up to an F-score of 9.58% on the original AIMed corpus. Furthermore, we propose a data sampling strategy based on under-sampling to address the class imbalance problem. In order to re-balance data distribution, samples of the majority class are removed according to the prediction results iteratively. By this means, our method achieves a further 2.49% improvement in F-score on the original AIMed corpus.
4th International Conference on Biomedical Engineering and Informatics, BMEI 2011, Shanghai, China, October 15-17, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: Linear-chain Conditional Random Fields (CRF) has been applied to perform the Named Entity Recognition (NER) task in many biomedical text mining and information extraction systems. However, the linear-chain CRF cannot capture long distance dependency, which is very common in the biomedical literature. In this paper, we propose a novel study of capturing such long distance dependency by defining two principles of constructing skip-edges for a skip-chain CRF: linking similar words and linking words having typed dependencies. The approach is applied to recognize gene/protein mentions in the literature. When tested on the BioCreAtIvE II Gene Mention dataset and GENIA corpus, the approach contributes significant improvements over the linear-chain CRF. We also present in-depth error analysis on inconsistent labeling and study the influence of the quality of skip edges on the labeling performance.
[show abstract][hide abstract] ABSTRACT: The TAC 2010 Guided Summarization task re-quires participants to generate coherent sum-maries with the guidance of predefined cate-gories and aspects. In this paper, we present our two extractive summarization systems. In the first system, we employ a topic model -Labeled LDA to model the aspects. The corre-spondence between the aspects and the topics in Labeled LDA is established through identi-fying indicative words for each aspect. After training and inference of Labeled LDA, we get the salience scores of concepts (named entities and bigrams) from topic concept distributions. Then we use an Integer Linear Programming (ILP) based maximal coverage method to gen-erate summaries. In the other system which also uses ILP and maximal coverage during sentence extraction, the salience of concepts is obtained using a pairwise learning to rank algorithm -RankNet. The training samples are constructed based on the human annotated Pyramid data.
[show abstract][hide abstract] ABSTRACT: In this paper, we focus on object feature based review summarization. Different from most of previous work with linguistic rules or statistical methods, we formulate the review mining task as a joint structure tagging problem. We propose a new machine learning framework based on Conditional Random Fields (CRFs). It can employ rich features to jointly extract positive opinions, negative opinions and object features for review sentences. The linguistic structure can be naturally integrated into model representation. Besides linear-chain structure, we also investigate conjunction structure and syntactic tree structure in this framework. Through extensive experiments on movie review and product review data sets, we show that structure-aware models outperform many state-of-the-art approaches to review mining.
COLING 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2010, Beijing, China; 01/2010
[show abstract][hide abstract] ABSTRACT: As online community question answering (cQA) portals like Yahoo! Answers and Baidu Zhidao have attracted over hundreds of millions of questions, how to utilize these questions and accordant answers becomes increasingly important for cQA websites. Prior approaches focus on using information retrieval techniques to provide a ranked list of questions based on their similarities to the query. Due to the high variance of question quality and answer quality, users have to spend lots of time on finding the truly best answers from retrieved results. In this paper, we develop an answer retrieval and summarization system which directly provides an accurate and comprehensive answer summary instead of a list of similar questions to user's query. To fully explore the information of relations between queries and questions, between questions and answers, and between answers and sentences, we propose a new probabilistic scoring model to distinguish high-quality answers from low-quality answers. By fully exploiting these relations, we summarize answers using a maximum coverage model. Experiment results on the data extracted from Chinese cQA websites demonstrate the efficacy of our proposed method.