Article

Learning Document-Level Semantic Properties from Free-Text Annotations.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents a new method for inferring the semantic properties of documents by lever- aging free-text keyphrase annotations. Such annotations are becoming increasingly abundant due to the recent dramatic growth in semi-structured, user-generated online content. One especially relevant domain is product reviews, which are often annotated by their authors with pros/cons keyphrases such as "a real bargain" or "good value." These annotations are representative of the underlying semantic properties; however, unlike expert annotations, they are noisy: lay authors may use different labels to denote the same property, and some labels may be missing. To learn using such noisy annotations, we find a hidden paraphrase structure which clusters the keyphrases. The paraphrase structure is linked with a latent topic model of the review texts, enabling the sys- tem to predict the properties of unannotated documents and to effectively aggregate the semantic properties of multiple reviews. Our approach is implemented as a hierarchical Bayesian model with joint inference. We find that joint inference increases the robustness of the keyphrase clustering and encourages the latent topics to correlate with semantically meaningful properties. Multiple evalua- tions demonstrate that our model substantially outperforms alternative approaches for summarizing single and multiple documents into a set of semantically salient keyphrases.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Then, given an aspect, a word is extracted according to another multinomial distribution, controlled by another Dirichlet prior β . Among existing works employing these models are the extraction of global aspects ( such as the brand of a product) and local aspects (such as the property of a product [62]), the extraction of key phrases [6], the rating of multi-aspects [63], and the summarization of aspects and sentiments [33]. [67] employed the maximum entropy method to train a switch variable based on POS tags of words and used it to separate aspect and sentiment words. ...
... Let us assume that the input layer has dimension L × d where L is the length of the sentence. Then the convolution operation given by (6) will result in a hidden layer of Z groups each of dimension (L − k + 1)×(d − d + 1). These learned kernel weights are shared among all hidden units in a particular group. ...
Preprint
Subjectivity detection is the task of identifying objective and subjective sentences. Objective sentences are those which do not exhibit any sentiment. So, it is desired for a sentiment analysis engine to find and separate the objective sentences for further analysis, e.g., polarity detection. In subjective sentences, opinions can often be expressed on one or multiple topics. Aspect extraction is a subtask of sentiment analysis that consists in identifying opinion targets in opinionated text, i.e., in detecting the specific aspects of a product or service the opinion holder is either praising or complaining about.
... Then, given an aspect, a word is extracted according to another multinomial distribution, controlled by another Dirichlet prior β . Among existing works employing these models are the extraction of global aspects ( such as the brand of a product) and local aspects (such as the property of a product [62]), the extraction of key phrases [6], the rating of multi-aspects [63], and the summarization of aspects and sentiments [33]. [67] employed the maximum entropy method to train a switch variable based on POS tags of words and used it to separate aspect and sentiment words. ...
... Let us assume that the input layer has dimension L × d where L is the length of the sentence. Then the convolution operation given by (6) will result in a hidden layer of Z groups each of dimension (L − k + 1)×(d − d + 1). These learned kernel weights are shared among all hidden units in a particular group. ...
Chapter
Subjectivity detection is the task of identifying objective and subjective sentences. Objective sentences are those which do not exhibit any sentiment. So, it is desired for a sentiment analysis engine to find and separate the objective sentences for further analysis, e.g., polarity detection. In subjective sentences, opinions can often be expressed on one or multiple topics. Aspect extraction is a subtask of sentiment analysis that consists in identifying opinion targets in opinionated text, i.e., in detecting the specific aspects of a product or service the opinion holder is either praising or complaining about.
... The unsupervised topic models extract incoherent topics as well along with the coherent topics [1][2][3][4][5]. Different extensions of topic models are proposed to improve their accuracy. ...
... LDA holds the advantage of identifying and aggregating topics together, where they are separate steps in other approaches, for example, dictionary based and relation based. The extensions of unsupervised LDA used for aspect and sentiment terms extraction as topics are [1][2][3][4][5]. The hybrid topic models used for topic extraction and sentiment analysis are [10][11][12][13][14][15]. ...
Article
Full-text available
Lifelong machine learning (LML) models learn with experience maintaining a knowledge-base, without user intervention. Unlike traditional single-domain models they can easily scale up to explore big data. The existing LML models have high data dependency, consume more resources, and do not support streaming data. This paper proposes online LML model (OAMC) to support streaming data with reduced data dependency. With engineering the knowledge-base and introducing new knowledge features the learning pattern of the model is improved for data arriving in pieces. OAMC improves accuracy as topic coherence by 7% for streaming data while reducing the processing cost to half.
... A topic is basically an aspect. Topic models have been used for the task by many researchers [5,10,16,21,23,25,26,32,33,37]. However, none of these models mines must-links or cannot-links automatically to help modeling. ...
... where λ w ,w is the promotion matrix in Equation 5. n k,w refers to the number of times that term w appears under topic k. β is the predefined Dirichlet hyper-parameter. ...
Conference Paper
Full-text available
Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results. In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed algorithm mines two forms of knowledge: must-link (meaning that two words should be in the same topic) and cannot-link (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitivity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines.
... We argue that placing minimal requirements on user-provided supervision is critical in work that seeks to summarize the content of the quickly typed opinions that characterize mobile and social online reviews. (Titov and McDonald 2008) and (Mcauliffe and Blei 2008) incorporate some supervision in the form of scores for each aspect of the entity being reviewed, and (Branavan et al. 2009) jointly model review text and user-defined keyphrases. ...
Article
Mobile and location-based social media applications provide platforms for users to share brief opinions about products, venues, and services. These quickly typed opinions, or microreviews, are a valuable source of current sentiment on a wide variety of subjects. However, there is currently little research on how to mine this information to present it back to users in easily consumable way. In this paper, we introduce the task of microsummarization, which combines sentiment analysis, summarization, and entity recognition in order to surface key content to users. We explore unsupervised and supervised methods for this task, and find we can reliably extract relevant entities and the sentiment targeted towards them using crowdsourced labels as supervision. In an end-to-end evaluation, we find our best-performing system is vastly preferred by judges over a traditional extractive summarization approach. This work motivates an entirely new approach to summarization, incorporating both sentiment analysis and item extraction for modernized, at-a-glance presentation of public opinion.
... Topic models have been extended in variety of ways and lead to models such as supervised models, hybrid models, transfer learning models, semi-supervised models, knowledge-based models and lifelong learning models [19,24,25]. In supervised models, the documents in training data are tagged with manually provided set of topic labels [26,27]. ...
Article
Full-text available
Topic models extract latent concepts from texts in the form of topics. Lifelong topic models extend topic models by learning topics continuously based on accumulated knowledge from the past which is updated continuously as new information becomes available. Hierarchical topic modeling extends topic modeling by extracting topics and organizing them into a hierarchical structure. In this study, we combine the two and introduce hierarchical lifelong topic models. Hierarchical lifelong topic models not only allow to examine the topics at different levels of granularity but also allows to continuously adjust the granularity of the topics as more information becomes available. A fundamental issue in hierarchical lifelong topic modeling is the extraction of rules that are used to preserve the hierarchical structural information among the rules and will continuously update based on new information. To address this issue, we introduce a network communities based rule mining approach for hierarchical lifelong topic models (NHLTM). The proposed approach extracts hierarchical structural information among the rules by representing textual documents as graphs and analyzing the underlying communities in the graph. Experimental results indicate improvement of the hierarchical topic structures in terms of topic coherence that increases from general to specific topics.
... For example, in the task of finding product features or aspects from online reviews for opinion mining, most products are in there are not even more than 100 comments (documents) in the review site [4]. As we will see in the experimental section, given 100 reviews, the classic theme model LDA produces very poor results [5]. ...
Article
Full-text available
Topic modeling has been widely used to mine topics from documents. However, a key disadvantage of topic modeling is that it requires large amounts of data (e.g. thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have as many documents. Given a small number of documents, the topic generated by the classic topic model LDA is very poor. Even in the case of a large amount of data, unsupervised learning of the topic model will still produce unsatisfactory results. In recent years, knowledge-based topic models have been proposed which require human users to provide some prior domain knowledge to guide the model to generate better topics. Our research takes a completely different approach. We recommend learn like humans, i.e. retain the results learned in the past and use them to help future learning. When faced with a new task, we first mine some reliable (transcendental) knowledge from past learning/modeling results, and then use it to guide model inference to produce more coherent topics. This approach is possible because there is readily available big data on the Web. The algorithm mines two forms of knowledge: the chain must be (Meaning two words should be in the same topic) and cannot be linked (meaning two words should not be in the same topic). Two issues of automatic knowledge mining are discussed, namely the problem of erroneous knowledge and knowledge transferability. Use from the experimental results of the review documents of 100 product areas show that the method proposed in this paper has a significant improvement over the state-of-the-art baseline.
... • Empleando métodos no supervisados basados en la detección de tópicos presentes en un documento según la Asignación Latente de Dirichlet (Latent Dirichlet Allocation; LDA) [39] y otras propuestas derivadas de esta [40]- [42]. ...
Article
Full-text available
Aspect extraction in textual opinions is a very important task within the sentiment analysis or opinion mining, which allows achieving greater accuracy when analyzing information, and thus, contributing to decision making. Deep learning includes several algorithms or strategies that have obtained relevant results in various natural language processing tasks. There are several review papers on sentiment analysis that address deep learning as one of the existing techniques for extracting aspects; however, there are no review papers focus exclusively on to the use of deep learning in sentiment analysis. The main objective of this review paper is to offer a critical and comparative analysis of the main proposals and revision works that employ deep learning strategies for aspect extraction, by focusing on the representation approaches, principal models, obtained results and data sets used in experiments. In our proposal, the analysis of 53 papers published during the period 2011 to 2018 by highlighting their main successes, fissures, and research challenges is made. Finally, we propose some future research directions.
... Vários modelos de tópicos, que são extensões de LDA, têm sido propostos para realizar extração de aspectos. Branavan et al. (2009) propuseram um método que faz uso das descrições de aspectos encontradas nos campos prós e contras dos relatos. O modelo consiste em duas partes: a primeira agrupa as frases-chaves prós e contras em categorias de aspectos baseado na distribuição de similaridade; e a segunda cria um modelo de tópico que indica os aspectos no texto do relato. ...
Article
Este trabalho apresenta uma análise comparativa entre as principais abordagens usadas na tarefa de Extração de Aspectos em comentários sobre produtos e serviços em web sites. Adaptações de quatro métodos de extração de aspectos foram implementadas e avaliadas usando dois Corpora distintos: um em português e outro em inglês. Nos experimentos realizados foi observado que a abordagem usando aprendizagem supervisionada (redes neurais convolucionais) obteve melhores resultados que as demais.
... Branavan et al. [11] use Pros and Cons keyphrases that reviewers annotate in their reviews i.e. Pros: "innovative story" or Cons: "Cheap cgi", to find aspects in the detailed review text. ...
Thesis
Online reviewing websites help users decide what to buy or places to go. These platforms allow users to express their opinions using numerical ratings as well as textual comments. The numerical ratings give a coarse idea of the service. On the other hand, textual comments give full details which is tedious for users to read. In this dissertation, we develop novel methods and algorithms to generate personalized, aspect-based summaries of movie reviews for a given user. The first problem we tackle is extracting a set of related words to an aspect from movie reviews. Our evaluation shows that our method is able to extract even unpopular terms that represent an aspect, such as compound terms or abbreviations, as opposed to the methods from the related work. We then study the problem of annotating sentences with aspects, and propose a new method that annotates sentences based on a similarity between the aspect signature and the terms in the sentence. The third problem we tackle is the generation of personalized, aspect-based summaries. We propose an optimization algorithm to maximize the coverage of the aspects the user is interested in and the representativeness of sentences in the summary subject to a length and similarity constraints. Finally, we perform three user studies that show that the approach we propose outperforms the state of art method for generating summaries.
... • Empleando métodos no supervisados basados en la detección de tópicos presentes en un documento según la Asignación Latente de Dirichlet (Latent Dirichlet Allocation; LDA) [39] y otras propuestas derivadas de esta [40]- [42]. ...
Article
Full-text available
Aspect extraction in textual opinions is a very important task within the sentiment analysis or opinion mining, which allows achieving greater accuracy when analyzing information, and thus, contributing to decision making. Deep learning includes several algorithms or strategies that have obtained relevant results in various natural language processing tasks. There are several review papers on sentiment analysis that address deep learning as one of the existing techniques for extracting aspects; however, there are no review papers focus exclusively on to the use of deep learning in sentiment analysis. The main objective of this review paper is to offer a critical and comparative analysis of the main proposals and revision works that employ deep learning strategies for aspect extraction, by focusing on the representation approaches, principal models, obtained results and data sets used in experiments. In our proposal, the analysis of 53 papers published during the period 2011 to 2018 by highlighting their main successes, fissures, and research challenges is made. Finally, we propose some future research directions.
... At (Pang and Lee, 2005), they changed polarity from binary (positive and negative) to a multi-point scale (one to five stars). Data sources used for models are generally collected from tweets, blogs and, reviews about movies and products like (Pang and Lee, 2005) and (Branavan et al., 2009). Sentiment analysis can also be useful for politics. ...
Preprint
Due to the increased availability of online reviews, sentiment analysis had been witnessed a booming interest from the researchers. Sentiment analysis is a computational treatment of sentiment used to extract and understand the opinions of authors. While many systems were built to predict the sentiment of a document or a sentence, many others provide the necessary detail on various aspects of the entity (i.e. aspect-based sentiment analysis). Most of the available data resources were tailored to English and the other popular European languages. Although Persian is a language with more than 110 million speakers, to the best of our knowledge, there is not any public dataset on aspect-based sentiment analysis in Persian. This paper provides a manually annotated Persian dataset, Pars-ABSA, which is verified by 3 native Persian speakers. The dataset consists of 5114 positive, 3061 negative and 1827 neutral data samples from 5602 unique reviews. Moreover, as a baseline, this paper reports the performance of some state-of-the-art aspect-based sentiment analysis methods with a focus on deep learning, on Pars-ABSA. The obtained results are impressive compared to similar English state-of-the-art.
... Aspect-sentiment joint model is based on pLSA model to extract topics and distribute them further among aspects, positive sentiment and negative sentiment models [15]. MGLDA traverses on both global and local topics at the document and sliding window level [16]. The global topics focus on entities and brands while the local topics capture aspects of them. ...
Conference Paper
Full-text available
Aspect-based Sentiment Analysis (ABSA) explores the strong and weak aspects of a product. There are many online platforms that allow users to review commercial products while others to aggregate those opinions across millions of reviews at the aspect level. Such analysis is of high regard to potential customers and manufacturers to make profitable decisions. However, the existing ABSA models do not highlight the reasons behind the strengths and weaknesses of the aspects. Moving a step forward, opinion reason mining explores the reasons for the aspects being appreciated or criticized. We propose opinion reason mining framework ORMFW that uses topic model to generate aspects as groups of aspect terms which are refined using paradigmatic word associations. Polarity is evaluated for each aspect using a dictionary based approach. Furthermore, it incorporates syntagmatic word associations to map the aspects to their respective reason terms against a sentiment polarity. Results on twitter dataset reveal that the proposed ORMFW framework efficiently and effectively identifies the prominent opinion reasons in relation to their aspects.
... [2,3,4,5] break the hypothesis that both documents and terms are exchangeable; Sammut, C. and G.I. Webb [6] developed a different kind of Dirichlet Process so that LDA could utilize a nonparametric Bayesian method to carry out; Yang Bao et al. [7] proposed a sent-LDA model and used it to discover and quantify risk types simultaneously. As for its application, LDA performs well in sentiment analysis field [8,9,10], topics clustering [11,12] and sequential text mining [13] et al. ...
Article
Full-text available
The financial system plays a crucial role in development of countries, which makes its stability be heated around the world. However, traditional indices which help measure financial stability such as quantile, leverage ratio and liquidity have instinctive shortcomings. For example, these digital indices are usually unavailable and one-sided. Therefore, finding a new approach to quantifying and visualizing financial stability is necessary and desirable. Different from digital data, textual data is more available and usually implies more abundant information and intuitional senses. Since textual data is not visualized, it is vital to find out what texts talk about. Latent Dirichlet Allocation (LDA) is one of the most effective approaches to achieve above goal. In order to apply LDA to measure financial stability, China Financial Stability Report is selected to make an empirical analysis. The results are as follows. Firstly, it is reasonable that LDA model can be applied to analyze China Financial Stability Report. Secondly, dividing core terms of every topic into basic terms and particular terms, we can draw pictures of every embranchment in finance. And we can analyze topics rank of 5 years or in every single year, so that a designing matrix comes into being and we can study financial stability tendencies. At last, the macro-environment in finance can be depicted easily using word cloud.
... 16 Traditional unstructured (free-text) reports can be more detailed, explicit, and representative of real-world findings, but they can be incomplete, unclear, and not easily converted into a computable format. 5 Therefore, in healthcare, to increase message clarity, standardize healthcare practice, and increase data interoperability across different systems, clinicians and pathologists are increasingly using structured reporting formats. 9,17,18,23 In 2005, the World Small Animal Veterinary Association (WSAVA) Gastrointestinal (GI) International Standardization Group took the responsibility of standardizing the histologic evaluation of the gastrointestinal tract of cats and dogs (https://goo.gl/zidmVH). ...
Article
The histologic evaluation of gastrointestinal (GI) biopsies is the standard for diagnosis of a variety of GI diseases (e.g., inflammatory bowel disease [IBD] and alimentary lymphoma [ALA]). The World Small Animal Veterinary Association (WSAVA) Gastrointestinal International Standardization Group proposed a reporting standard for GI biopsies consisting of a defined set of microscopic features. We compared the machine classification accuracy of free-text microscopic findings with those represented in the WSAVA format with a diagnosis of IBD and ALA. Unstructured free-text duodenal biopsy pathology reports from cats (n = 60) with a diagnosis of IBD (n = 20), ALA (n = 20), or normal (n = 20) were identified. Biopsy samples from these cases were then scored following the WSAVA guidelines to create a set of structured reports. Three supervised machine-learning algorithms were trained using the structured and then the unstructured reports. Diagnosis classification accuracy for the 3 algorithms was compared using the structured and unstructured reports. Using naive Bayes and neural networks, unstructured information-based models achieved higher diagnostic accuracy (0.90 and 0.88, respectively) compared to the structured information-based models (0.74 and 0.72, respectively). Results suggest that discriminating diagnostic information was lost using current WSAVA microscopic guideline features. Addition of free-text features (number of plasma cells) increased WSAVA auto-classification performance. The methodologies reported in our study represent a way of identifying candidate microscopic features for use in structured histopathology reports.
... It has been applied to many kinds of documents, including scientific abstracts [4,20] and newspaper archives [45]. Topic models have also served as a powerful technique to discover patterns of words and semantic structure in otherwise unstructured collections in different fields, such as natural language processing [5,21,[42][43][44], opinion mining [6,[29][30][31]33], information retrieval [12,40,45,46], topic segmentation [7,25] and collaborative filtering [13,14,23,27]. ...
Conference Paper
Full-text available
Determining appropriate statistical distributions for modeling text corpora is important for accurate estimation of numerical characteristics. Based on the validity of the test on a claim that the data conforms to Poisson distribution we propose Poisson decomposition model (PDM), a statistical model for modeling count data of text corpora, which can straightly capture each document's multidimensional numerical characteristics on topics. In PDM, each topic is represented as a parameter vector with multidimensional Poisson distribution, which can be easily normalized to multinomial term probabilities and each document is represented as measurements on topics and thereby reduced to a measurement vector on topics. We use gradient descent methods and sampling algorithm for parameter estimation. We carry out extensive experiments on the topics produced by our models. The results demonstrate our approach can extract more coherent topics and is competitive in document clustering by using the PDM-based features, compared to PLSI and LDA.
... Although unstructured (free-text) reports can be more fluid and explicit of findings, they are not easily converted into a structured, computable data format. 7 Recording standards will help ensure consistent reporting of patients' information (such as physicians' observations, diagnoses, and treatment) to improve data integration and uses. Standards have to be followed when recording patients' information in order to automate computer-based technology tools that can assist in a variety of hospital processes and secondary uses, such as reminders, procedures, and decision-making activities. ...
Article
Much effort has been invested in standardizing medical terminology for representation of medical knowledge, storage in electronic medical records, retrieval, reuse for evidence-based decision making, and for efficient messaging between users. We only focus on those efforts related to the representation of clinical medical knowledge required for capturing diagnoses and findings from a wide range of general to specialty clinical perspectives (e.g., internists to pathologists). Standardized medical terminology and the usage of structured reporting have been shown to improve the usage of medical information in secondary activities, such as research, public health, and case studies. The impact of standardization and structured reporting is not limited to secondary activities; standardization has been shown to have a direct impact on patient healthcare.
... For our experiments we use the dataset built by Branavan et al. [5] that contains camera reviews. This is a publicly available dataset that contains 12,586 camera reviews extracted from Epinions.com 1 . ...
Conference Paper
Social media have become a popular platform for people to share their opinions and emotions. Analyzing opinions that are posted on the web is very important since they influence future decisions of organizations and people. Comparative opinion mining is a subfield of opinion mining that deals with identifying and extracting information that is expressed in a comparative form. Due to the fact that there is a huge amount of opinions posted online everyday, analyzing comparative opinions from a temporal perspective is an important application that needs to be explored. This study introduces the idea of integrating temporal elements in comparative opinion mining. Different type of results can be obtained from the temporal analysis, including trend analysis, competitive analysis as well as burst detection. In our study we show that temporal analysis of comparative opinion mining provides more current and relevant information to users compared to standard opinion mining.
... To date, most of the relevant research work on aspect recognition has concentrated on the first sub-task. Many types of methods have been proposed, including rule-based [14, 20– 22], supervised [23][24][25], and topic-model-based [8, 26, 27] methods, for the extraction of aspects. However, only a few studies have been performed on aspect clustering. ...
Article
Full-text available
Product aspect recognition is a key task in fine-grained opinion mining. Current methods primarily focus on the extraction of aspects from the product reviews. However, it is also important to cluster synonymous extracted aspects into the same category. In this paper, we focus on the problem of product aspect clustering. The primary challenge is to properly cluster and generalize aspects that have similar meanings but different representations. To address this problem, we learn two types of background knowledge for each extracted aspect based on two types of effective aspect relations: relevant aspect relations and irrelevant aspect relations, which describe two different types of relationships between two aspects. Based on these two types of relationships, we can assign many relevant and irrelevant aspects into two different sets as the background knowledge to describe each product aspect. To obtain abundant background knowledge for each product aspect, we can enrich the available information with background knowledge from the Web. Then, we design a hierarchical clustering algorithm to cluster these aspects into different groups, in which aspect similarity is computed using the relevant and irrelevant aspect sets for each product aspect. Experimental results obtained in both camera and mobile phone domains demonstrate that the proposed product aspect clustering method based on two types of background knowledge performs better than the baseline approach without the use of background knowledge. Moreover, the experimental results also indicate that expanding the available background knowledge using the Web is feasible.
... (Jiménez-Zafra et al., 2015) proposed a syntactic approach for identifying the words that modify each aspect. (Branavan et al., 2009;He et al., 2012;Mei et al., 2007) used topic or category information. (Saias, 2015) used a 3-class classifier and some handcrafted features to perform ABSA. ...
... In recent years, it has drawn a lot of attentions. For example, (Branavan et al., 2009;He et al., 2012;Mei et al., 2007) used topic or category information. (Lin and He, 2009;Jo and Oh, 2011) presented LDA-based models, which incorporate aspect and sentiment analysis together to model sentiments towards different aspects. ...
... Titov and McDonald (2008) proposed the multi-grain topic models to discover global and local aspects. Branavan et al. (2008) proposed a method which first clustered the key-phrases in Pros and Cons into some aspect categories based on distributional similarity, then built a topic model modeling the topics or aspects. Zhao et al. (2010) proposed the MaxEnt-LDA (a Maximum Entropy and LDA combination) hybrid model to jointly discover both aspect words and aspectspecific opinion words, which can leverage syntactic features to separate aspects and sentiment words. ...
... In Korea, text and data mining technique is proposed to read and classify large amount of online petition [22]. Text classification is one of the most important text mining techniques that automatically assign unseen digital documents to suitable predefined categories [2,16]. By using machine learning approaches, user-friendly results can be delivered to meet users' requirements. ...
Article
Full-text available
In this paper, a novel dimensionality reduction algorithm named locality alignment discriminant analysis (LADA) for visualizing regional English is proposed. In the LADA algorithm, the proposed intrinsic graph or penalty graph measures the similarities between each pairwise textual slices, which can better characterize the intra-class compactness and inter-class separability; the projection matrix obtained by the proposed method is orthogonal, which can eliminate the redundancy between different projection directions, and is more effective for preserving the intrinsic geometry and improving the discriminating ability. To evaluate the performance of the algorithm, a regional written English corpus is designed and collected. Consequently, articles are split into slices and then transformed into 140-dimensional data points by 140 text style markers. Finally, variations existing in the regional written English are attempted to be recognized with our proposed LADA. The similarity among different types of English can be observed by the data plots. The results of visualization and numerical comparison indicate that LADA outperforms other existing algorithms in handling regional English data, as the proposed LADA can better preserve the local discriminative information embedded in the data, which is suitable for pattern classification.
... Some existing works employing these models include the extraction of global aspects (such as the brand of a product) and local aspects (such as the property of a product) [99], the extraction of key phrases [10], the rating of multi-aspects [106] and the summarization of aspects and sentiments [61]. [113] employed Maximum-Entropy to train a switch variable based on POS tags of words and use it to separate aspect and sentiment words. ...
Conference Paper
Full-text available
Hitherto, sentiment analysis has been mainly based on algorithms relying on the textual representation of online reviews and microblogging posts. Such algorithms are very good at retrieving texts, splitting them into parts, checking the spelling, and counting their words. But when it comes to interpreting sentences and extracting opinionated information, their capabilities are known to be very limited. Current approaches to sentiment analysis are mainly based on supervised techniques relying on manually labeled samples, such as movie or product reviews, where the overall positive or negative attitude was explicitly indicated. However, opinions do not occur only at document-level, nor they are limited to a single valence or target. Contrary or complementary attitudes toward the same topic or multiple topics can be present across the span of a review. In order to overcome this and many other issues related to sentiment analysis, we propose a novel framework, termed concept-level sentiment analysis (CLSA) model, which takes into account all the natural-language-processing tasks necessary for extracting opinionated information from text, namely: microtext analysis, semantic parsing, subjectivity detection, anaphora resolution, sarcasm detection, topic spotting, aspect extraction, and polarity detection.
... Product aspect categorization is a very challenging task for aspect-oriented opinion mining in the real applications since people usually use different words to describe the same aspect in the reviews. Some lexical-based analysis methods [38,[41][42][43] are employed in product-feature (PF) categorization, such as synonym and association rule [42], lexical-similarity statistic [38], and lexical-based clustering [41]. Product-feature categorization based on lexical comparison usually is not comprehensive enough to capture the underlying semantic distribution of various product-feature terms. ...
Article
Full-text available
When providing customers with a personalized shopping experience, there is tremendous value in understanding and applying social data shared by those consumers. Understanding this data and how best to generate business value from it is the core challenge of many businesses today. Friends, family, and experts alike influence consumers in their shopping preferences and purchase decisions. Yet, the ability of a business to analyze data on such influence, and recommend products and services that best respond to its customers' needs or aspirations, is typically limited by fragmented capabilities; a business relies heavily on the use of spreadsheets, manual market analysis, isolated software, or reactive messaging. This paper offers a solution to this fragmentary approach by introducing a social analytics platform for smarter commerce. This platform provides a holistic understanding of the customer by making use of social and enterprise data to present recommendations and related opinions, and to isolate influencers so as to ultimately provide customers with a personalized shopping experience. The functionality described in this paper is in the context of the retail industry but can be applied to other industries. The paper describes the architecture of the social analytics platform and the various analytics components currently implemented as part of the platform.
... One of the basic and most widely used models is Latent Dirichlet Allocation (LDA) (Blei et al., 2003). LDA can learn a predefined number of topics and has been widely applied in its extended forms in sentiment analysis and many other tasks (Mei et al., 2007; Branavan et al., 2008; Lin and He, 2009; Zhao et al., 2010; Wang et al., 2010; Brody and Elhadad, 2010; Jo and Oh, 2011; Moghaddam and Ester, 2011; Sauper et al., 2011; Mukherjee and Liu, 2012; He et al., 2012). The Dirichlet Processes Mixture (DPM) model is a non-parametric extension of LDA (Teh et al., 2006), which can estimate the number of topics inherent in the data itself. ...
Conference Paper
Full-text available
This paper proposes a technique to leverage topic based sentiments from Twitter to help predict the stock market. We first utilize a con-tinuous Dirichlet Process Mixture model to learn the daily topic set. Then, for each topic we derive its sentiment according to its opin-ion words distribution to build a sentiment time series. We then regress the stock index and the Twitter sentiment time series to predict the market. Experiments on real-life S&P100 Index show that our approach is effective and performs better than existing state-of-the-art non-topic based methods.
Chapter
This chapter introduces the main topics and objectives discussed in subsequent sections, covering various aspects of opinion mining and sentiment analysis. It addresses different challenges and proposes novel methods. Section 3.1 highlights the need to filter out irrelevant information and personal attacks in online discussions to focus on evaluative opinion sentences. It proposes an unsupervised method utilizing natural language processing techniques and machine learning algorithms to automatically filter and classify sentences with evaluative opinions. This section calls for further research to explore more precise and efficient methods for identifying evaluative opinions and their application in sentiment polarity analysis. Section 3.3 presents a novel subproblem in opinion mining, focusing on grouping feature expressions in product reviews. It argues for the necessity of user supervision in practical applications and proposes an EM formulation enhanced with soft constraints to achieve accurate opinion summaries. This section showcases the competence and generality of the proposed method through experimental results from various domains and languages. Section 3.1 introduces the use of topic modeling, specifically the LDA method, for sentiment mining. It extends the LDA method to handle large-scale constraints and proposes two methods for automatically extracting constraints to guide the topic modeling process. The constrained-LDA model and extracted constraints are then applied to group product features, demonstrating superior performance compared to other methods. Section 3.3 addresses the challenge of grouping synonyms in opinion mining, proposing an efficient method based on similarity measurement. Experimental results from different domains validate the effectiveness of the method. Section 3.3 focuses on feature extraction for sentiment classification and compares the impact of different types of features through experimental analysis. This section provides an in-depth study of all feature types and discusses key problems associated with feature extraction algorithms. Section 3.2 explores the use of unsupervised learning methods for sentiment classification, emphasizing their advantages in classifying opinionated texts at different levels and for feature-based opinion mining. This section presents an empirical investigation of unsupervised sentiment classification of Chinese reviews and proposes an algorithm to remove domain-specific sentiment noise words. Section 3.6 introduces the use of substring-group features for sentiment classification through a transductive learning-based algorithm. Experimental results in multiple languages demonstrate the effectiveness of the algorithm and highlight the superiority of the “tfidf-c” approach for term weighting. Therefore, this chapter provides a comprehensive overview of various aspects of opinion mining and sentiment analysis, and proposes several innovative methods to address different challenges. From filtering and classifying evaluative opinion sentences to grouping product features, and utilizing topic modeling and similarity measurement for sentiment mining, this chapter covers a wide range of topics and techniques. The practicality and superiority of the proposed methods are demonstrated through empirical experiments. These studies lay a foundation for further advancements in opinion mining and sentiment analysis, and hold significant value in practical applications.
Article
User-generated content, particularly online product reviews by customers, provide marketers with rich data of customer evaluations of product attributes. This study proposes, benchmarks, and validates a new approach for inferring attribute-level evaluations from user-generated content. Moreover, little is known about whether and when insights from product reviews gained in such a way are consistent with traditional research methods, such as conjoint analysis and satisfaction driver analysis. To provide first insights into this question, the authors apply their approach to a dataset with almost one million product reviews from 52 product categories and run conjoint and satisfaction driver analyses for these categories. Results indicate that the consistency between methods largely varies across product categories. Initial exploratory analyses suggest that consistency might be higher for categories characterized by low experience qualities, high hedonic value, and high customer willingness to post online reviews—but further work will be necessary to validate these findings.
Chapter
Economic research examining how business interventions may empower and affect managers’ ability to make decisions more effectively focuses largely on test scores although the interventions may be considered alongside several other outcomes. This paper examines how a business intervention based on a natural language user interface (NLUI) may affect business decisions and routines, therefore, managers’ economic behaviour, pointing at the rule-governed character of natural language. The NLUI is designed to implement a real-effort lab experiment where subjects play the role of managers in defining appropriate business research actions, while the experimenter provides task performance measurement, including heuristics and test scores. By’naturally’ exploiting the NLUI to discover patterns on business data and identify appropriate business actions to find out target markets, we measure the effects of the intervention on subjects’ decision-making and their approaches to collect information from market research. We aim to measure the NLUI effects on dual managerial outcome, namely, the ability to make predictions and categorise. We find that the business intervention positively affects subjects’ ability to categorise, as well as the ability to learn novel categories and recognise new instances within them. Thereby, we find statistically significant effects of the NLUI on subjects’ performance, which allows us to continue with this experimental activity by extending its scope and applications, diversifying business operations.
Article
The real-time and dissemination characteristics of network information make net-mediated public opinion become more and more important food safety early warning resources, but the data of petabyte (PB) scale growth also bring great difficulties to the research and judgment of network public opinion, especially how to extract the event role of network public opinion from these data and analyze the sentiment tendency of public opinion comment. First, this article takes the public opinion of food safety network as the research point, and a BLSTM-CRF model for automatically marking the role of event is proposed by combining BLSTM and conditional random field organically. Second, the Attention mechanism based on vocabulary in the field of food safety is introduced, the distance-related sequence semantic features are extracted by BLSTM, and the emotional classification of sequence semantic features is realized by using CNN. A kind of Att-BLSTM-CNN model for the analysis of public opinion and emotional tendency in the field of food safety is proposed. Finally, based on the time series, this article combines the role extraction of food safety events and the analysis of emotional tendency and constructs a net-mediated public opinion early warning model in the field of food safety according to the heat of the event and the emotional intensity of the public to food safety public opinion events.
Article
Nowadays, the Internet has penetrated into all aspects of people’s lives. A large number of online customer reviews have been accumulated in several product forums, which are valuable resources to be analyzed. However, these customer reviews are unstructured textual data, in which a lot of ambiguities exist, so analyzing them is a challenging task. At present, the effective deep semantic or fine-grained analysis of customer reviews is rare in the existing literature, and the analysis quality of most studies is also low. Therefore, in this paper a fine-grained opinion mining method is introduced to extract the detailed semantic information of opinions from multiple perspectives and aspects from Chinese automobile reviews. The conditional random field (CRF) model is used in this method, in which semantic roles are divided into two groups. One group relates to the objects being reviewed, which includes the roles of manufacturer, the brand, the type, and the aspects of cars. The other group of semantic roles is about the opinions of the objects, which includes the sentiment description, the aspect value, the conditions of opinions and the sentiment tendency. The overall framework of the method includes three major steps. The first step distinguishes the relevant sentences with the irrelevant sentences in the reviews. At the second step the relevant sentences are further classified into different aspects. At the third step fine-grained semantic roles are extracted from sentences of each aspect. The data used in the training process is manually annotated in fine granularity of semantic roles. The features used in this CRF model include basic word features, part-of-speech (POS) features, position features and dependency syntactic features. Different combinations of these features are investigated. Experimental results are analyzed and future directions are discussed.
Conference Paper
Aspect-based Sentiment Analysis (ABSA) aggregates the user opinions at the aspects level. Therefore, it offers a detailed analysis of the product by highlighting its strong and weak aspects. Potential customers and manufacturers highly regard such analysis to make profitable future decisions. However, the existing models do not provide reasons for an aspect being praised or criticized. Such information may help users to assess if the reasons mentioned by reviewers in support or against an aspect of a product are aligned with their priorities. We propose an approach that weighs implicit aspect terms beyond implying aspects and suggesting their polarity. The proposed approach makes use of linguistic associations to identify prominent implicit aspect terms for an aspect. They are presented as possible reasons for an aspect to attain a polarity score. The results are evaluated on online twitter data which indicate effective exploration of opinion reasons.
Article
We present a novel statistical analysis of legislative rhetoric in the US Senate that sheds a light on hidden patterns in the behaviour of Senators as a function of their time in office. Using natural language processing, we create a novel comprehensive data set based on the speeches of all Senators who served on the US Senate Committee on Energy and Natural Resources in 2001–2011. We develop a new measure of congressional speech, based on Senators’ attitudes towards the dominant energy interests. To evaluate intrinsically dynamic formation of groups among Senators, we adopt a model‐free unsupervised space–time data mining algorithm that has been proposed in the context of tracking dynamic clusters in environmental georeferenced data streams. Our approach based on a two‐stage hybrid supervised–unsupervised learning methodology is innovative and data driven and transcends conventional disciplinary borders. We discover that legislators become much more alike after the first few years of their term, regardless of their partisanship and campaign promises.
Article
Full-text available
With the increasing popularity of online e-commerce services, a large volume of online reviews have been constantly generated by users. In this paper, we propose to study the problem of inferring functional labels using online review text. Functional labels summarize and highlight the main characteristics of a business, which can serve as bridges between the consumption needs and the service functions. We consider two kinds of semantic similarities: lexical similarity and embedding similarity, which characterize the relatedness in two different perspectives. To measure the lexical similarity, we use the classic probabilistic ranking formula, i.e., BM25; to measure the embedding similarity, we propose an extended embedding model which can incorporate weak supervised information derived from review text. These two kinds of similarities compensate each other and capture the semantic relatedness in a more comprehensive way. We construct a test collection consisting of four different domains based on a Yelp dataset and consider multiple baseline methods for comparison. Extensive experiments have shown that the proposed methods are very effective.
Conference Paper
Aspect-level sentiment analysis or opinion mining consists of several core sub-tasks: aspect extraction, opinion identification, polarity classification, and separation of general and aspect-specific opinions. Various topic models have been proposed by researchers to address some of these sub-tasks. However, there is little work on modeling all of them together. In this paper, we first propose a holistic fine-grained topic model, called the JAST (Joint Aspect-based Sentiment Topic) model, that can simultaneously model all of above problems under a unified framework. To further improve it, we incorporate the idea of lifelong machine learning and propose a more advanced model, called the LAST (Lifelong Aspect-based Sentiment Topic) model. LAST automatically mines the prior knowledge of aspect, opinion, and their correspondence from other products or domains. Such knowledge is automatically extracted and incorporated into the proposed LAST model without any human involvement. Our experiments using reviews of a large number of product domains show major improvements of the proposed models over state-of-the-art baselines.
Article
Full-text available
This paper applies the stereotype change theory to help bridge a major literature gap on co-branding partner selection: why both identical and highly different brand pairs often fail. We argue that, given that a primary goal of establishing a co-branding alliance is to positively revise consumers’ beliefs about important attributes of the allying brands, the case of no belief-revision can lead to a failure of the alliance. We show that both an identical and a highly incongruent partnership in terms of attribute-level difference can fail due to the lack of belief-revision. We report that a moderately incongruent brand pair is a promising decision on co-branding partner selection. In doing so, our research contributes to the explanation of why the two “extreme” types of co-branding alliances may fail from the perspective of consumer evaluation. For brand managers, we offer a normative guideline for co-branding partner selection.
Conference Paper
Feedback Evaluation is a necessary part of any institute to maintain and monitor the academic quality of the system. Traditionally, a questionnaire based system is used to evaluate the performance of teachers of an institute. Here, we propose an automatic evaluation system based on sentiment analysis, which shall be more versatile and meaningful than existing system. In our proposed system, feedback is collected in the form of running text and sentiment analysis is performed to identify important aspects along with the orientations using supervised and semi supervised machine learning techniques.
Chapter
Opinion mining or sentiment analysis is the computational study of people’s opinions, appraisals, attitudes, and emotions toward entities such as products, services, organizations, individuals, events, and their different aspects. It has been an active research area in natural language processing and Web mining in recent years. Researchers have studied opinion mining at the document, sentence and aspect levels. Aspect-level (called aspect-based opinion mining) is often desired in practical applications as it provides the detailed opinions or sentiments about different aspects of entities and entities themselves, which are usually required for action. Aspect extraction and entity extraction are thus two core tasks of aspect-based opinion mining. In this chapter, we provide a broad overview of the tasks and the current state-of-the-art extraction techniques.
Article
Full-text available
Traditional machine learning techniques follow a single shot learning approach. It includes all supervised, semi-supervised, transfer learning, hybrid and unsupervised techniques having a single target domain known prior to analysis. Learning from one task is not carried to the next task, therefore, they cannot scale up to big data having many unknown domains. Lifelong learning models are tailored for big data having a knowledge module that is maintained automatically. The knowledge-base grows with experience where knowledge from previous tasks helps in current task. This paper surveys topic models leading the discussion to knowledge-based topic models and lifelong learning models. The issues and challenges in learning knowledge, its abstraction, retention and transfer are elaborated. The state-of-the art models store word pairs as knowledge having positive or negative co-relations called must-links and cannot-links. The need for innovative ideas from other research fields is stressed to learn more varieties of knowledge to improve accuracy and reveal more semantic structures from within the data.
Book
Full-text available
It has now been 50 years since the publication of Luhn's seminal paperon automatic summarization. During these years the practical need forautomatic summarization has become increasingly urgent and numer-ous papers have been published on the topic. As a result, it has become harder to find a single reference that gives an overview of past efforts or a complete view of summarization tasks and necessary system com-ponents. This article attempts to fill this void by providing a com-prehensive overview of research in summarization, including the more traditional efforts in sentence extraction as well as the most novel recent approaches for determining important content, for domain and genre specific summarization and for evaluation of summarization. We also discuss the challenges that remain open, in particular the need for lan-guage generation and deeper semantic understanding of language that would be necessary for future advances in the field.
Article
There have been many assistant applications on mobile devices, which could help people obtain rich Web content such as user-generated data (e.g., reviews, posts, blogs, and tweets). However, online communities and social networks are expanding rapidly and it is impossible for people to browse and digest all the information via simple search interface. To help users obtain information more efficiently, both the interface for data access and the information representation need to be improved. An intuitive and personalized interface, such as a dialogue system, could be an ideal assistant, which engages a user in a continuous dialogue to garner the user's interest and capture the user's intent, and assists the user via speech-navigated interactions. In addition, there is a great need for a type of application that can harvest data from the Web, summarize the information in a concise manner, and present it in an aggregated yet natural way such as direct human dialogue. This thesis, therefore, aims to conduct research on a universal framework for developing speech-based interface that can aggregate user-generated Web content and present the summarized information via speech-based human-computer interaction. To accomplish this goal, several challenges must be met. Firstly, how to interpret users' intention from their spoken input correctly? Secondly, how to interpret the semantics and sentiment of user-generated data and aggregate them into structured yet concise summaries? Lastly, how to develop a dialogue modeling mechanism to handle discourse and present the highlighted information via natural language? This thesis explores plausible approaches to tackle these challenges. We will explore a lexicon modeling approach for semantic tagging to improve spoken language understanding and query interpretation. We will investigate a parse-and-paraphrase paradigm and a sentiment scoring mechanism for information extraction from unstructured user-generated data. We will also explore sentiment-involved dialogue modeling and corpus-based language generation approaches for dialogue and discourse. Multilingual prototype systems in multiple domains have been implemented for demonstration.
Chapter
Aspect recognition and clustering is important for many sentiment analysis tasks. To date, many algorithms for recognizing product aspects have been explored, however, limited work have been done for clustering the product aspects. In this paper, we focus on the problem of product aspect clustering. Two effective aspect relations: relevant aspect relation and irrelevant aspect relation are proposed to describe the relationships between two aspects. According to these two relations, we can explore many relevant and irrelevant aspects into two different sets as background knowledge to describe each product aspect. Then, a hierarchical clustering algorithm is designed to cluster these aspects into different groups, in which aspect similarity computation is conducted with the relevant aspect set and irrelevant aspect set of each product aspect. Experimental results on camera domain demonstrate that the proposed method performs better than the baseline without using the two aspect relations, and meanwhile proves that the two aspect relations are effective.
Conference Paper
In Chap. 9, we studied the extraction of structured data from Web pages. The Web also contains a huge amount of information in unstructured texts. Analyzing these texts is of great importance as well and perhaps even more important than extracting structured data because of the sheer volume of valuable information of almost any imaginable type contained in text. In this chapter, we only focus on mining opinions which indicate positive or negative sentiments. The task is technically challenging and practically very useful. For example, businesses always want to find public or consumer opinions on their products and services. Potential customers also want to know the opinions of existing users before they use a service or purchase a product.
Conference Paper
Full-text available
Merchants selling products on the Web often ask their customers to review the products that they have purchased and the associated services. As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. For a popular product, the number of reviews can be in hundreds or even thousands. This makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. It also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. For the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. In this research, we aim to mine and to summarize all the customer reviews of a product. This summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. We do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. Our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. This paper proposes several novel techniques to perform these tasks. Our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques.
Conference Paper
Full-text available
We present a novel Bayesian model for semi-supervised part-of-speech tagging. Our model extends the Latent Dirichlet Allocation model and incorporates the intuition that words' distributions over tags, p(t|w), are sparse. In addition we introduce a model for determining the set of possible tags of a word which captures important dependencies in the ambiguity classes of words. Our model outperforms the best previously proposed model for this task on a standard dataset.
Conference Paper
Full-text available
This paper describes the PASCAL Net- work of Excellence Recognising Textual Entailment (RTE) Challenge benchmark 1. The RTE task is defined as recognizing, given two text fragments, whether the meaning of one text can be inferred (en- tailed) from the other. This application- independent task is suggested as capturing major inferences about the variability of semantic expression which are commonly needed across multiple applications. The Challenge has raised noticeable attention in the research community, attracting 17 submissions from diverse groups, sug- gesting the generic relevance of the task.
Conference Paper
Full-text available
In this paper, we present a system that automatically extracts the pros and cons from online reviews. Although many ap- proaches have been developed for ex- tracting opinions from text, our focus here is on extracting the reasons of the opinions, which may themselves be in the form of either fact or opinion. Leveraging online review sites with author-generated pros and cons, we propose a system for aligning the pros and cons to their sen- tences in review texts. A maximum en- tropy model is then trained on the result- ing labeled set to subsequently extract pros and cons from online review sites that do not explicitly provide them. Our experimental results show that our result- ing system identifies pros and cons with 66% precision and 76% recall.
Conference Paper
Full-text available
Most current statistical natural language process- ing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sam- pling, a simple Monte Carlo method used to per- form approximate inference in factored probabilis- tic models. By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorpo- rate non-local structure while preserving tractable inference. We use this technique to augment an existing CRF-based information extraction system with long-distance dependency models, enforcing label consistency and extraction template consis- tency constraints. This technique results in an error reduction of up to 9% over state-of-the-art systems on two established information extraction tasks.
Article
We present a multi-document summarizer, MEAD, which generates summaries using cluster centroids produced by a topic detection and tracking system. We describe two new techniques, a centroid-based summarizer, and an evaluation scheme based on sentence utility and subsumption. We have applied this evaluation to both single and multiple document summaries. Finally, we describe two user studies that test our models of multi-document summarization.
Article
Many intuitively appealing methods have been suggested for clustering data, however, interpretation of their results has been hindered by the lack of objective criteria. This article proposes several criteria which isolate specific aspects of the performance of a method, such as its retrieval of inherent structure, its sensitivity to resampling and the stability of its results in the light of new data. These criteria depend on a measure of similarity between two different clusterings of the same set of data; the measure essentially considers how each pair of data points is assigned in each clustering.
Conference Paper
Topic models, such as latent Dirichlet allocation (LDA), have been an ef- fective tool for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vo- cabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about sports is more likely to also be about health than international finance. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution (1). We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets.
Conference Paper
Developing better methods for segment- ing continuous text into words is impor- tant for improving the processing of Asian languages, and may shed light on how hu- mans learn to segment speech. We pro- pose two new Bayesian word segmenta- tion methods that assume unigram and bi- gram models of word dependencies re- spectively. The bigram model greatly out- performs the unigram model (and previous probabilistic models), demonstrating the importance of such dependencies for word segmentation. We also show that previous probabilistic models rely crucially on sub- optimal search procedures.
Conference Paper
Consumers have to often wade through a large number of on-line reviews in order to make an informed product choice. We introduce OPINE, an unsupervised, high-precision information extraction system which mines product reviews in order to build a model of product features and their evaluation by reviewers.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
Similarity is an important and widely used concept. Previous definitions of similarity are tied to a particular application or a form of knowledge representation. We present an informationtheoretic definition of similarity that is applicable as long as there is a probabilistic model. We demonstrate how our definition can be used to measure the similarity in a number of different domains.
Article
Sl,ai,isl,ica,1 significance, kest,ing of diflkn'ences in values of metrics like recall, i)rccision mM batmined F-s(x)re is a ne(:cssm'y tmrt of eml)iri(:a.l ua.t;ural bmguage 1)ro(;easing. Unfi)rtunat,ely we inertly used tests of'ten ulnlerestimake i,he significm ce mM so a.re less likely to detect, difihrences l,hat exist between difM'eni techniques. This 1111deresl;illla(;ioll comes from an independcn(;e asSmnl)tion that is offten violated. Wc l)oint, out sonic ltse]'Hl l.es(,s (,]mL do nol, make lhis assmnl)- tion, including contput;a, tionally--iltcnsive domizai,ion tests.
  • Andrew Gelman
  • John B Carlin
  • Hal S Stern
  • Donald B Rubin
Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2004. Bayesian Data Analysis. Texts in Statistical Science. Chapman & Hall/CRC, 2nd edi- tion.
Participative Web and User-Created Content: Web 2.0, Wikis and Social Networking
  • Graham Vickery
  • Sacha Wunsch-Vincent
Graham Vickery and Sacha Wunsch-Vincent. 2007. Participative Web and User-Created Content: Web 2.0, Wikis and Social Networking. OECD Publishing.
Order out of chaos: What is the best way to tag, bag, and sort data? Give it to the unorganized masses
  • Bruce Sterling
Bruce Sterling. 2005. Order out of chaos: What is the best way to tag, bag, and sort data? Give it to the unorganized masses. http://www.wired.com/ wired/archive/13.04/view.html?pg=4.