Minlie Huang

Tsinghua University, Peping, Beijing, China

Are you Minlie Huang?

Claim your profile

Publications (51)43.82 Total impact

  • Lei Fang, Minlie Huang, Xiaoyan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: In sentiment analysis, aspect-level review analysis has been an important task because it can catalogue, aggregate, or summarize various opinions according to a product's properties. In this paper, we explore a new concept for aspect-level review analysis, latent sentiment explanations, which are defined as a set of informative aspect-specific sentences whose polarities are consistent with that of the review. In other words, sentiment explanations best represent a review in terms of both aspect and polarity. We formulate the problem as a structure learning problem, and sentiment explanations are modeled with latent variables. Training samples are automatically identified through a set of pre-defined aspect signature terms (i.e., without manual annotation on samples), which we term the way weakly supervised. Our major contributions lie in two folds: first, we formalize the use of aspect signature terms as weak supervision in a structural learning framework, which remarkably promotes aspect-level analysis; second, the performance of aspect analysis and document-level sentiment classification are mutually enhanced through joint modeling. The proposed method is evaluated on restaurant and hotel reviews respectively, and experimental results demonstrate promising performance in both document-level and aspect-level sentiment analysis.
    Proceedings of the 22nd ACM international conference on Conference on information & knowledge management; 10/2013
  • Lei Fang, Minlie Huang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a structural learning model for joint sentiment classification and aspect analysis of text at various levels of granularity. Our model aims to identify highly informative sentences that are aspect-specific in online custom reviews. The primary advantages of our model are two-fold: first, it performs document-level and sentence-level sentiment polarity classification jointly; second, it is able to find informative sentences that are closely related to some respects in a review, which may be helpful for aspect-level sentiment analysis such as aspect-oriented summarization. The proposed method was evaluated with 9,000 Chinese restaurant reviews. Preliminary experiments demonstrate that our model obtains promising performance.
    Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2; 07/2012
  • Source
    Lei Fang, Minlie Huang, Xiaoyan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: Community based question and answering (cQA) services provide a convenient way for online users to share and exchange information and knowledge, which is highly valuable for information seeking. User interest and dedication act as the motivation to promote the interactive process of question and answering. In this paper, we aim to address a key issue about cQA systems: routing newly asked questions to appropriate users that may potentially provide answer with high quality. We incorporate answer quality and answer content to build a probabilistic question routing model. Our proposed model is capable of 1) differentiating and quantifying the authority of users for different topic or category; 2) routing questions to users with expertise. Experimental results based on a large collection of data from Wenwen demonstrate that our model is effective and has promising performance.
    01/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Most participatory web sites collect overall ratings (e.g., five stars) of products from their customers, reflecting the overall assessment of the products. However, it is more useful to present ratings of product features (such as price, battery, screen, and lens of digital cameras) to help customers make effective purchase decisions. Unfortunately, only a very few web sites have collected feature ratings. In this paper, we propose a novel approach to accurately estimate feature ratings of products. This approach selects user reviews that extensively discuss specific features of the products (called specialized reviews), using information distance of reviews on the features. Experiments on both annotated and real data show that overall ratings of the specialized reviews can be used to represent their feature ratings. The average of these overall ratings can be used by recommender systems to provide feature-specific recommendations that can better help users make purchasing decisions.
    Knowledge and Information Systems 01/2012; · 2.23 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Patenting is one of the most important ways to protect company's core business concepts and proprietary technologies. Analyzing large volume of patent data can uncover the potential competitive or collaborative relations among companies in certain areas, which can provide valuable information to develop strategies for intellectual property (IP), R&D, and marketing. In this paper, we present a novel topic-driven patent analysis and mining system. Instead of merely searching over patent content, we focus on studying the heterogeneous patent network derived from the patent database, which is represented by several types of objects (companies, inventors, and technical content) jointly evolving over time. We design and implement a general topic-driven framework for analyzing and mining the heterogeneous patent network. Specifically, we propose a dynamic probabilistic model to characterize the topical evolution of these objects within the patent network. Based on this modeling framework, we derive several patent analytics tools that can be directly used for IP and R&D strategy planning, including a heterogeneous network co-ranking method, a topic-level competitor evolution analysis algorithm, and a method to summarize the search results. We evaluate the proposed methods on a real-world patent database. The experimental results show that the proposed techniques clearly outperform the corresponding baseline methods.
    Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Patents are critical for a company to protect its core technologies. Effective patent mining in massive patent databases can provide companies with valuable insights to develop strategies for IP management and marketing. In this paper, we study a novel patent mining problem of automatically discovering core patents (i.e., patents with high novelty and influence in a domain). We address the unique patent vocabulary usage problem, which is not considered in traditional word-based statistical methods, and propose a topic-based temporal mining approach to quantify a patent's novelty and influence. Comprehensive experimental results on real-world patent portfolios show the effectiveness of our method.
    Proceedings of the 21st ACM international conference on Information and knowledge management; 01/2012
  • Source
    Minlie Huang, Aurélie Névéol, Zhiyong Lu
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to the high cost of manual curation of key aspects from the scientific literature, automated methods for assisting this process are greatly desired. Here, we report a novel approach to facilitate MeSH indexing, a challenging task of assigning MeSH terms to MEDLINE citations for their archiving and retrieval. Unlike previous methods for automatic MeSH term assignment, we reformulate the indexing task as a ranking problem such that relevant MeSH headings are ranked higher than those irrelevant ones. Specifically, for each document we retrieve 20 neighbor documents, obtain a list of MeSH main headings from neighbors, and rank the MeSH main headings using ListNet-a learning-to-rank algorithm. We trained our algorithm on 200 documents and tested on a previously used benchmark set of 200 documents and a larger dataset of 1000 documents. Tested on the benchmark dataset, our method achieved a precision of 0.390, recall of 0.712, and mean average precision (MAP) of 0.626. In comparison to the state of the art, we observe statistically significant improvements as large as 39% in MAP (p-value <0.001). Similar significant improvements were also obtained on the larger document set. Experimental results show that our approach makes the most accurate MeSH predictions to date, which suggests its great potential in making a practical impact on MeSH indexing. Furthermore, as discussed the proposed learning framework is robust and can be adapted to many other similar tasks beyond MeSH indexing in the biomedical domain. All data sets are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/indexing.
    Journal of the American Medical Informatics Association 05/2011; 18(5):660-7. · 3.57 Impact Factor
  • Source
    Minlie Huang, Jingchen Liu, Xiaoyan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: MOTIVATION: Linking gene mentions in an article to entries of biological databases can facilitate indexing and querying biological literature greatly. Due to the high ambiguity of gene names, this task is particularly challenging. Manual annotation for this task is cost expensive, time consuming and labor intensive. Therefore, providing assistive tools to facilitate the task is of high value. RESULTS: We developed GeneTUKit, a document-level gene normalization software for full-text articles. This software employs both local context surrounding gene mentions and global context from the whole full-text document. It can normalize genes of different species simultaneously. When participating in BioCreAtIvE III, the system obtained good results among 37 runs: the system was ranked first, fourth and seventh in terms of TAP-20, TAP-10 and TAP-5, respectively on the 507 full-text test articles. Availability and implementation: The software is available at http://www.qanswers.net/GeneTUKit/.
    Bioinformatics 02/2011; 27(7):1032-3. · 5.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.
    BMC Bioinformatics 01/2011; 12 Suppl 8:S2. · 3.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Though news readers can easily access a large number of news articles from the Internet, they can be overwhelmed by the quantity of information available, making it hard to get a concise, global picture of a news topic. In this paper we propose a novel method to address this problem. Given a set of articles for a given news topic, the proposed method models theme variation through time and identifies the breakpoints, which are time points when decisive changes occur. For each breakpoint, a brief summary is automatically constructed based on articles associated with the particular time point. Summaries are then ordered chronologically to form a timeline overview of the news topic. In this fashion, readers can easily track various news topics efficiently. We have conducted experiments on 15 popular topics in 2010. Empirical experiments show the effectiveness of our approach and its advantages over other approaches.
    11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011; 01/2011
  • BMC Bioinformatics, special issue on BioCreative III. 01/2011; -:-.
  • Source
    Feng Jin, Minlie Huang, Xiaoyan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: Although the goal of traditional text summarization is to generate summaries with diverse information, most of those applications have no explicit definition of the information structure. Thus, it is difficult to generate truly structure-aware summaries because the information structure to guide summarization is unclear. In this paper, we present a novel framework to generate guided summaries for product reviews. The guided summary has an explicitly defined structure which comes from the important aspects of products. The proposed framework attempts to maximize expected aspect satisfaction during summary generation. The importance of an aspect to a generated summary is modeled using Labeled Latent Dirichlet Allocation. Empirical experimental results on consumer reviews of cars show the effectiveness of our method.
    Journal of Computer Science and Technology 01/2011; 26:676-684. · 0.48 Impact Factor
  • Source
    Fangtao Li, Minlie Huang, Yi Yang, Xiaoyan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: In the past few years, sentiment analysis and opinion mining becomes a popular and important task. These studies all assume that their opinion resources are real and trustful. However, they may encounter the faked opinion or opinion spam problem. In this paper, we study this issue in the context of our product review mining system. On product review site, people may write faked reviews, called review spam, to promote their products, or defame their competitors' products. It is important to identify and filter out the review spam. Previous work only focuses on some heuristic rules, such as helpfulness voting, or rating deviation, which limits the performance of this task. In this paper, we exploit machine learning methods to identify review spam. Toward the end, we manually build a spam collection from our crawled reviews. We first analyze the effect of various features in spam identification. We also observe that the review spammer consistently writes spam. This provides us another view to identify review spam: we can identify if the author of the review is spammer. Based on this observation, we provide a twoview semi-supervised method, co-training, to exploit the large amount of unlabeled data. The experiment results show that our proposed method is effective. Our designed machine learning methods achieve significant improvements in comparison to the heuristic baselines.
    IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011; 01/2011
  • Hongtao Zhang, Minlie Huang, Xiaoyan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: A large number of protein-protein interactions (PPIs) have buried in massive biomedical articles published over the years. This leads to the development of automatic PPI extraction methods. However, existing methods based on supervised machine learning still face some challenges: (1) the feature space exploited in these methods is very sparse; and (2) the data used for training are imbalanced with respect to categories to be classified. In this paper, we first construct rich and compact features to alleviate the issue of feature sparseness. With these features, our method outperforms baselines by up to an F-score of 9.58% on the original AIMed corpus. Furthermore, we propose a data sampling strategy based on under-sampling to address the class imbalance problem. In order to re-balance data distribution, samples of the majority class are removed according to the prediction results iteratively. By this means, our method achieves a further 2.49% improvement in F-score on the original AIMed corpus.
    4th International Conference on Biomedical Engineering and Informatics, BMEI 2011, Shanghai, China, October 15-17, 2011; 01/2011
  • Source
    Jingchen Liu, Minlie Huang, Xiaoyan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: Linear-chain Conditional Random Fields (CRF) has been applied to perform the Named Entity Recognition (NER) task in many biomedical text mining and information extraction systems. However, the linear-chain CRF cannot capture long distance dependency, which is very common in the biomedical literature. In this paper, we propose a novel study of capturing such long distance dependency by defining two principles of constructing skip-edges for a skip-chain CRF: linking similar words and linking words having typed dependencies. The approach is applied to recognize gene/protein mentions in the literature. When tested on the BioCreAtIvE II Gene Mention dataset and GENIA corpus, the approach contributes significant improvements over the linear-chain CRF. We also present in-depth error analysis on inconsistent labeling and study the influence of the quality of skip edges on the labeling performance.
    Proceedings of the 2010 Workshop on Biomedical Natural Language Processing; 07/2010
  • Source
    Feng Jin, Minlie Huang, Xiaoyan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a comparative study on two key problems existing in extractive summarization: the ranking problem and the selection problem. To this end, we presented a systematic study of comparing different learning-to-rank algorithms and comparing different selection strategies. This is the first work of providing systematic analysis on these problems. Experimental results on two benchmark datasets demonstrate three findings: (1) pairwise and listwise learning-to-rank algorithms outperform the baselines significantly; (2) there is no significant difference among the learning-to-rank algorithms; and (3) the integer linear programming selection strategy generally outperformed Maximum Marginal Relevance and Diversity Penalty strategies.
    COLING 2010, 23rd International Conference on Computational Linguistics, Posters Volume, 23-27 August 2010, Beijing, China; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper address the problem of entity linking. Specifically, given an entity mentioned in unstructured texts, the task is to link this entity with an entry stored in the existing knowledge base. This is an important task for information extraction. It can serve as a convenient gateway to encyclopedic information, and can greatly improve the web users' experience. Previous learning based solutions mainly focus on classification framework. However, it's more suitable to consider it as a ranking problem. In this paper, we propose a learning to rank algorithm for entity linking. It effectively utilizes the relationship information among the candidates when ranking. The experiment results on the TAC 2009 dataset demonstrate the effectiveness of our proposed framework. The proposed method achieves 18.5% improvement in terms of accuracy over the classification models for those entities which have corresponding entries in the Knowledge Base. The overall performance of the system is also better than that of the state-of-the-art methods.
    Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 2-4, 2010, Los Angeles, California, USA; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we focus on object feature based review summarization. Different from most of previous work with linguistic rules or statistical methods, we formulate the review mining task as a joint structure tagging problem. We propose a new machine learning framework based on Conditional Random Fields (CRFs). It can employ rich features to jointly extract positive opinions, negative opinions and object features for review sentences. The linguistic structure can be naturally integrated into model representation. Besides linear-chain structure, we also investigate conjunction structure and syntactic tree structure in this framework. Through extensive experiments on movie review and product review data sets, we show that structure-aware models outperform many state-of-the-art approaches to review mining.
    COLING 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2010, Beijing, China; 01/2010
  • Source
    Feng Jin, Minlie Huang, Xiaoyan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: The TAC 2010 Guided Summarization task re-quires participants to generate coherent sum-maries with the guidance of predefined cate-gories and aspects. In this paper, we present our two extractive summarization systems. In the first system, we employ a topic model -Labeled LDA to model the aspects. The corre-spondence between the aspects and the topics in Labeled LDA is established through identi-fying indicative words for each aspect. After training and inference of Labeled LDA, we get the salience scores of concepts (named entities and bigrams) from topic concept distributions. Then we use an Integer Linear Programming (ILP) based maximal coverage method to gen-erate summaries. In the other system which also uses ILP and maximal coverage during sentence extraction, the salience of concepts is obtained using a pairwise learning to rank algorithm -RankNet. The training samples are constructed based on the human annotated Pyramid data.
    01/2010;
  • Fangtao Li, Minlie Huang, Xiaoyan Zhu
    Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010; 01/2010