[Show abstract][Hide abstract] ABSTRACT: Knowledge graph embedding refers to projecting entities and relations in
knowledge graph into continuous vector spaces. State-of-the-art methods, such
as TransE, TransH, and TransR build embeddings by treating relation as
translation from head entity to tail entity. However, previous models can not
deal with reflexive/one-to-many/many-to-one/many-to-many relations properly, or
lack of scalability and efficiency. Thus, we propose a novel method, flexible
translation, named TransF, to address the above issues. TransF regards relation
as translation between head entity vector and tail entity vector with flexible
magnitude. To evaluate the proposed model, we conduct link prediction and
triple classification on benchmark datasets. Experimental results show that our
method remarkably improve the performance compared with several
[Show abstract][Hide abstract] ABSTRACT: Prior knowledge has been shown very useful to address many natural language
processing tasks. Many approaches have been proposed to formalise a variety of
knowledge, however, whether the proposed approach is robust or sensitive to the
knowledge supplied to the model has rarely been discussed. In this paper, we
propose three regularization terms on top of generalized expectation criteria,
and conduct extensive experiments to justify the robustness of the proposed
methods. Experimental results demonstrate that our proposed methods obtain
remarkable improvements and are much more robust than baselines.
[Show abstract][Hide abstract] ABSTRACT: In sentiment analysis, aspect-level review analysis has been an important task because it can catalogue, aggregate, or summarize various opinions according to a product's properties. In this paper, we explore a new concept for aspect-level review analysis, latent sentiment explanations, which are defined as a set of informative aspect-specific sentences whose polarities are consistent with that of the review. In other words, sentiment explanations best represent a review in terms of both aspect and polarity. We formulate the problem as a structure learning problem, and sentiment explanations are modeled with latent variables. Training samples are automatically identified through a set of pre-defined aspect signature terms (i.e., without manual annotation on samples), which we term the way weakly supervised. Our major contributions lie in two folds: first, we formalize the use of aspect signature terms as weak supervision in a structural learning framework, which remarkably promotes aspect-level analysis; second, the performance of aspect analysis and document-level sentiment classification are mutually enhanced through joint modeling. The proposed method is evaluated on restaurant and hotel reviews respectively, and experimental results demonstrate promising performance in both document-level and aspect-level sentiment analysis.
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management; 10/2013
[Show abstract][Hide abstract] ABSTRACT: Patents are critically important for a company to protect its core business concepts and proprietary technologies. Effective patent mining in massive patent databases not only provides business enterprises with valuable insights to develop strategies for research and development, intellectual property management, and product marketing, but also helps patent offices to improve efficiency and optimize their patent examination processes. This paper describes the patent mining problem of automatically discovering core patents (i.e., novel and influential patents in a domain). In addition, the value of core patent mining is illustrated by revealing the potential competitive relationships among companies in their core patents. The work addresses the unique patent vocabulary usage which is not considered in traditional word-based statistical methods with a topic-based temporal mining approach that quantifies a patent??s novelty and influence through topic activeness variations. Tests of this method on real-world patent portfolios show the effectiveness of this approach over state-of-the-art methods.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present a structural learning model for joint sentiment classification and aspect analysis of text at various levels of granularity. Our model aims to identify highly informative sentences that are aspect-specific in online custom reviews. The primary advantages of our model are two-fold: first, it performs document-level and sentence-level sentiment polarity classification jointly; second, it is able to find informative sentences that are closely related to some respects in a review, which may be helpful for aspect-level sentiment analysis such as aspect-oriented summarization. The proposed method was evaluated with 9,000 Chinese restaurant reviews. Preliminary experiments demonstrate that our model obtains promising performance.
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2; 07/2012
[Show abstract][Hide abstract] ABSTRACT: Most participatory web sites collect overall ratings (e.g., five stars) of products from their customers, reflecting the overall assessment of the products. However, it is more useful to present ratings of product features (such as price, battery, screen, and lens of digital cameras) to help customers make effective purchase decisions. Unfortunately, only a very few web sites have collected feature ratings. In this paper, we propose a novel approach to accurately estimate feature ratings of products. This approach selects user reviews that extensively discuss specific features of the products (called specialized reviews), using information distance of reviews on the features. Experiments on both annotated and real data show that overall ratings of the specialized reviews can be used to represent their feature ratings. The average of these overall ratings can be used by recommender systems to provide feature-specific recommendations that can better help users make purchasing decisions.
Knowledge and Information Systems 01/2012; DOI:10.1007/s10115-012-0495-8 · 2.64 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Community based question and answering (cQA) services provide a convenient way for online users to share and exchange information and knowledge, which is highly valuable for information seeking. User interest and dedication act as the motivation to promote the interactive process of question and answering. In this paper, we aim to address a key issue about cQA systems: routing newly asked questions to appropriate users that may potentially provide answer with high quality. We incorporate answer quality and answer content to build a probabilistic question routing model. Our proposed model is capable of 1) differentiating and quantifying the authority of users for different topic or category; 2) routing questions to users with expertise. Experimental results based on a large collection of data from Wenwen demonstrate that our model is effective and has promising performance.
[Show abstract][Hide abstract] ABSTRACT: Patents are critical for a company to protect its core technologies. Effective patent mining in massive patent databases can provide companies with valuable insights to develop strategies for IP management and marketing. In this paper, we study a novel patent mining problem of automatically discovering core patents (i.e., patents with high novelty and influence in a domain). We address the unique patent vocabulary usage problem, which is not considered in traditional word-based statistical methods, and propose a topic-based temporal mining approach to quantify a patent's novelty and influence. Comprehensive experimental results on real-world patent portfolios show the effectiveness of our method.
Proceedings of the 21st ACM international conference on Information and knowledge management; 01/2012
[Show abstract][Hide abstract] ABSTRACT: Patenting is one of the most important ways to protect company's core business concepts and proprietary technologies. Analyzing large volume of patent data can uncover the potential competitive or collaborative relations among companies in certain areas, which can provide valuable information to develop strategies for intellectual property (IP), R&D, and marketing. In this paper, we present a novel topic-driven patent analysis and mining system. Instead of merely searching over patent content, we focus on studying the heterogeneous patent network derived from the patent database, which is represented by several types of objects (companies, inventors, and technical content) jointly evolving over time. We design and implement a general topic-driven framework for analyzing and mining the heterogeneous patent network. Specifically, we propose a dynamic probabilistic model to characterize the topical evolution of these objects within the patent network. Based on this modeling framework, we derive several patent analytics tools that can be directly used for IP and R&D strategy planning, including a heterogeneous network co-ranking method, a topic-level competitor evolution analysis algorithm, and a method to summarize the search results. We evaluate the proposed methods on a real-world patent database. The experimental results show that the proposed techniques clearly outperform the corresponding baseline methods.
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining; 01/2012
[Show abstract][Hide abstract] ABSTRACT: An enormous number of gene-disease associations (GDA) are buried in millions of research articles published over the years, and the number is growing. Extracting them automatically is a challenging bioinformatics task. Although previous works have shown that supervised learning methods are superior for this task, the performance still relies on manually labeled training data. In this paper, we propose a solution to learn from plenty of labeled protein-protein interaction (PPI) data, and utilize the learned knowledge to help the extraction of GDA. In particular, a support vector machine modified for corpus weighting (SVM-CW) was applied to weight labeled PPI data, in order to allow knowledge to be effectively transferred from the PPI domain data to the GDA domain. The experimental results show that our solution can make full use of labeled PPI data and improve the performance of GDA extraction.
[Show abstract][Hide abstract] ABSTRACT: Although the goal of traditional text summarization is to generate summaries with diverse information, most of those applications
have no explicit definition of the information structure. Thus, it is difficult to generate truly structure-aware summaries
because the information structure to guide summarization is unclear. In this paper, we present a novel framework to generate
guided summaries for product reviews. The guided summary has an explicitly defined structure which comes from the important
aspects of products. The proposed framework attempts to maximize expected aspect satisfaction during summary generation. The
importance of an aspect to a generated summary is modeled using Labeled Latent Dirichlet Allocation. Empirical experimental
results on consumer reviews of cars show the effectiveness of our method.
Journal of Computer Science and Technology 07/2011; 26(4):676-684. DOI:10.1007/s11390-011-1167-y · 0.64 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Due to the high cost of manual curation of key aspects from the scientific literature, automated methods for assisting this process are greatly desired. Here, we report a novel approach to facilitate MeSH indexing, a challenging task of assigning MeSH terms to MEDLINE citations for their archiving and retrieval.
Unlike previous methods for automatic MeSH term assignment, we reformulate the indexing task as a ranking problem such that relevant MeSH headings are ranked higher than those irrelevant ones. Specifically, for each document we retrieve 20 neighbor documents, obtain a list of MeSH main headings from neighbors, and rank the MeSH main headings using ListNet-a learning-to-rank algorithm. We trained our algorithm on 200 documents and tested on a previously used benchmark set of 200 documents and a larger dataset of 1000 documents.
Tested on the benchmark dataset, our method achieved a precision of 0.390, recall of 0.712, and mean average precision (MAP) of 0.626. In comparison to the state of the art, we observe statistically significant improvements as large as 39% in MAP (p-value <0.001). Similar significant improvements were also obtained on the larger document set.
Experimental results show that our approach makes the most accurate MeSH predictions to date, which suggests its great potential in making a practical impact on MeSH indexing. Furthermore, as discussed the proposed learning framework is robust and can be adapted to many other similar tasks beyond MeSH indexing in the biomedical domain. All data sets are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/indexing.
Journal of the American Medical Informatics Association 05/2011; 18(5):660-7. DOI:10.1136/amiajnl-2010-000055 · 3.93 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: MOTIVATION: Linking gene mentions in an article to entries of biological databases can facilitate indexing and querying biological literature greatly. Due to the high ambiguity of gene names, this task is particularly challenging. Manual annotation for this task is cost expensive, time consuming and labor intensive. Therefore, providing assistive tools to facilitate the task is of high value. RESULTS: We developed GeneTUKit, a document-level gene normalization software for full-text articles. This software employs both local context surrounding gene mentions and global context from the whole full-text document. It can normalize genes of different species simultaneously. When participating in BioCreAtIvE III, the system obtained good results among 37 runs: the system was ranked first, fourth and seventh in terms of TAP-20, TAP-10 and TAP-5, respectively on the 507 full-text test articles. Availability and implementation: The software is available at http://www.qanswers.net/GeneTUKit/.
[Show abstract][Hide abstract] ABSTRACT: A large number of protein-protein interactions (PPIs) have buried in massive biomedical articles published over the years. This leads to the development of automatic PPI extraction methods. However, existing methods based on supervised machine learning still face some challenges: (1) the feature space exploited in these methods is very sparse; and (2) the data used for training are imbalanced with respect to categories to be classified. In this paper, we first construct rich and compact features to alleviate the issue of feature sparseness. With these features, our method outperforms baselines by up to an F-score of 9.58% on the original AIMed corpus. Furthermore, we propose a data sampling strategy based on under-sampling to address the class imbalance problem. In order to re-balance data distribution, samples of the majority class are removed according to the prediction results iteratively. By this means, our method achieves a further 2.49% improvement in F-score on the original AIMed corpus.
4th International Conference on Biomedical Engineering and Informatics, BMEI 2011, Shanghai, China, October 15-17, 2011; 01/2011
[Show abstract][Hide abstract] ABSTRACT: Though news readers can easily access a large number of news articles from the Internet, they can be overwhelmed by the quantity of information available, making it hard to get a concise, global picture of a news topic. In this paper we propose a novel method to address this problem. Given a set of articles for a given news topic, the proposed method models theme variation through time and identifies the breakpoints, which are time points when decisive changes occur. For each breakpoint, a brief summary is automatically constructed based on articles associated with the particular time point. Summaries are then ordered chronologically to form a timeline overview of the news topic. In this fashion, readers can easily track various news topics efficiently. We have conducted experiments on 15 popular topics in 2010. Empirical experiments show the effectiveness of our approach and its advantages over other approaches.
11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011; 01/2011
[Show abstract][Hide abstract] ABSTRACT: In the past few years, sentiment analysis and opinion mining becomes a popular and important task. These studies all assume that their opinion resources are real and trustful. However, they may encounter the faked opinion or opinion spam problem. In this paper, we study this issue in the context of our product review mining system. On product review site, people may write faked reviews, called review spam, to promote their products, or defame their competitors' products. It is important to identify and filter out the review spam. Previous work only focuses on some heuristic rules, such as helpfulness voting, or rating deviation, which limits the performance of this task. In this paper, we exploit machine learning methods to identify review spam. Toward the end, we manually build a spam collection from our crawled reviews. We first analyze the effect of various features in spam identification. We also observe that the review spammer consistently writes spam. This provides us another view to identify review spam: we can identify if the author of the review is spammer. Based on this observation, we provide a twoview semi-supervised method, co-training, to exploit the large amount of unlabeled data. The experiment results show that our proposed method is effective. Our designed machine learning methods achieve significant improvements in comparison to the heuristic baselines.
IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011; 01/2011
[Show abstract][Hide abstract] ABSTRACT: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k).
We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively.
By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.
[Show abstract][Hide abstract] ABSTRACT: Linear-chain Conditional Random Fields (CRF) has been applied to perform the Named Entity Recognition (NER) task in many biomedical text mining and information extraction systems. However, the linear-chain CRF cannot capture long distance dependency, which is very common in the biomedical literature. In this paper, we propose a novel study of capturing such long distance dependency by defining two principles of constructing skip-edges for a skip-chain CRF: linking similar words and linking words having typed dependencies. The approach is applied to recognize gene/protein mentions in the literature. When tested on the BioCreAtIvE II Gene Mention dataset and GENIA corpus, the approach contributes significant improvements over the linear-chain CRF. We also present in-depth error analysis on inconsistent labeling and study the influence of the quality of skip edges on the labeling performance.
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing; 07/2010