Publications

  • Aditya Joshi, Pushpak Bhattacharyya, Abhijit Mishra
    [Show abstract] [Hide abstract]
    ABSTRACT: (To be published)
    Association For Computational Linguistics Conference 2014; 01/2014
  • Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 08/2013
  • Source
    Subhabrata Mukherjee, Pushpak Bhattacharyya
  • Source
    Subhabrata Mukherjee, Pushpak Bhattacharyya
  • Source
    Subhabrata Mukherjee, Pushpak Bhattacharyya
  • Source
    Subhabrata Mukherjee, Pushpak Bhattacharyya
  • Source
  • Source
    Subhabrata Mukherjee, Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a weakly supervised system, YouCat, for categorizing Youtube videos into different genres like Comedy, Horror, Romance, Sports and Technology The system takes a Youtube video url as input and gives it a belongingness score for each genre. The key aspects of this work can be summarized as: (1) Unlike other genre identification works, which are mostly supervised, this system is mostly unsupervised, requiring no labeled data for training. (2) The system can easily incorporate new genres without requiring labeled data for the genres. (3) YouCat extracts information from the video title, meta description and user comments (which together form the video descriptor). (4) It uses Wikipedia and WordNet for concept expansion. (5) The proposed algorithm with a time complexity of O(|W|) (where (|W|) is the number of words in the video descriptor) is efficient to be deployed in web for real-time video categorization. Experimentations have been performed on real world Youtube videos where YouCat achieves an F-score of 80.9%, without using any labeled training set, compared to the supervised, multiclass SVM F-score of 84.36% for single genre prediction. YouCat performs better for multi-genre prediction with an F-Score of 90.48%. Weak supervision in the system arises out of the usage of manually constructed WordNet and genre description by a few root words.
    COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012, Mumbai, India; 06/2013
  • Source
    Subhabrata Mukherjee, Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: : In this paper, we present a novel approach to identify feature specific expressions of opinion in product reviews with different features and mixed emotions. The objective is realized by identifying a set of potential features in the review and extracting opinion expressions about those features by exploiting their associations. Capitalizing on the view that more closely associated words come together to express an opinion about a certain feature, dependency parsing is used to identify relations between the opinion expressions. The system learns the set of significant relations to be used by dependency parsing and a threshold parameter which allows us to merge closely associated opinion expressions. The data requirement is minimal as this is a one time learning of the domain independent parameters. The associations are represented in the form of a graph which is partitioned to finally retrieve the opinion expression describing the user specified feature. We show that the system achieves a high accuracy across all domains and performs at par with state-of-the-art systems despite its data limitations.
    Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I; 06/2013
  • Source
    Subhabrata Mukherjee, Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a lightweight method for using discourse relations for polarity detection of tweets. This method is targeted towards the web-based applications that deal with noisy, unstructured text, like the tweets, and cannot afford to use heavy linguistic resources like parsing due to frequent failure of the parsers to handle noisy data. Most of the works in micro-blogs, like Twitter, use a bag-of-words model that ignores the discourse particles like but, since, although etc. In this work, we show how the discourse relations like the connectives and conditionals can be used to incorporate discourse information in any bag-of-words model, to improve sentiment classification accuracy. We also probe the influence of the semantic operators like modals and negations on the discourse relations that affect the sentiment of a sentence. Discourse relations and corresponding rules are identified with minimal processing - just a list look up. We first give a linguistic description of the various discourse relations which leads to conditions in rules and features in SVM. We show that our discourse-based bag-of-words model performs well in a noisy medium (Twitter), where it performs better than an existing Twitter-based application. Furthermore, we show that our approach is beneficial to structured reviews as well, where we achieve a better accuracy than a state-of-the-art system in the travel review domain. Our system compares favorably with the state-of-the-art systems and has the additional attractiveness of being less resource intensive.
    COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December 2012, Mumbai, India; 06/2013
  • Source
    Subhabrata Mukherjee, Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes a weakly supervised system for sentiment analysis in the movie review domain. The objective is to classify a movie review into a polarity class, positive or negative, based on those sentences bearing opinion on the movie alone, leaving out other irrelevant text. Wikipedia incorporates the world knowledge of movie-specific features in the system which is used to obtain an extractive summary of the review, consisting of the reviewer’s opinions about the specific aspects of the movie. This filters out the concepts which are irrelevant or objective with respect to the given movie. The proposed system, WikiSent, does not require any labeled data for training. It achieves a better or comparable accuracy to the existing semi-supervised and unsupervised systems in the domain, on the same dataset. We also perform a general movie review trend analysis using WikiSent.
    Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I; 06/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we introduce a new WordNet based similarity metric, SenSim, which incorporates sentiment content (i.e., degree of positive or negative sentiment) of the words being compared to measure the similarity between them. The proposed metric is based on the hypothesis that knowing the sentiment is beneficial in measuring the similarity. To verify this hypothesis, we measure and compare the annotator agreement for 2 annotation strategies: 1) sentiment information of a pair of words is considered while annotating and 2) sentiment information of a pair of words is not considered while annotating. Interannotator correlation scores show that the agreement is better when the two annotators consider sentiment information while assigning a similarity score to a pair of words. We use this hypothesis to measure the similarity between a pair of words. Specifically, we represent each word as a vector containing sentiment scores of all the content words in the WordNet gloss of the sense of that word. These sentiment scores are derived from a sentiment lexicon. We then measure the cosine similarity between the two vectors. We perform both intrinsic and extrinsic evaluation of SenSim and compare the performance with other widely used WordNet similarity metrics.
    n Proceedings of the International Conference on Global Wordnets (GWC 2011), Matsue, Japan, Jan, 2012; 06/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present TwiSent, a sentiment analysis system for Twitter. Based on the topic searched, TwiSent collects tweets pertaining to it and categorizes them into the different polarity classes positive, negative and objective. However, analyzing micro-blog posts have many inherent challenges compared to the other text genres. Through TwiSent, we address the problems of 1) Spams pertaining to sentiment analysis in Twitter, 2) Structural anomalies in the text in the form of incorrect spellings, nonstandard abbreviations, slangs etc., 3) Entity specificity in the context of the topic searched and 4) Pragmatics embedded in text. The system performance is evaluated on manually annotated gold standard data and on an automatically annotated tweet set based on hashtags. It is a common practise to show the efficacy of a supervised system on an automatically annotated dataset. However, we show that such a system achieves lesser classification accurcy when tested on generic twitter dataset. We also show that our system performs much better than an existing system.
    Proceedings of the 21st ACM international conference on Information and knowledge management; 06/2013
  • Source
    Subhabrata Mukherjee, Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: Our day-to-day life has always been influenced by what people think. Ideas and opinions of others have always affected our own opinions. The explosion of Web 2.0 has led to increased activity in Podcasting, Blogging, Tagging, Contributing to RSS, Social Bookmarking, and Social Networking. As a result there has been an eruption of interest in people to mine these vast resources of data for opinions. Sentiment Analysis or Opinion Mining is the computational treatment of opinions, sentiments and subjectivity of text. In this report, we take a look at the various challenges and applications of Sentiment Analysis. We will discuss in details various approaches to perform a computational treatment of sentiments and opinions. Various supervised or data-driven techniques to SA like Na\"ive Byes, Maximum Entropy, SVM, and Voted Perceptrons will be discussed and their strengths and drawbacks will be touched upon. We will also see a new dimension of analyzing sentiments by Cognitive Psychology mainly through the work of Janyce Wiebe, where we will see ways to detect subjectivity, perspective in narrative and understanding the discourse structure. We will also study some specific topics in Sentiment Analysis and the contemporary works in those areas.
    04/2013;
  • A. R. Balamurali, Mitesh M. Khapra, Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: Recently there has been a lot of interest in Cross Language Sentiment Analysis (CLSA) using Machine Translation (MT) to facilitate Sentiment Analysis in resource deprived languages. The idea is to use the annotated resources of one language (say, L1) for performing Sentiment Analysis in another language (say, L2) which does not have annotated resources. The success of such a scheme crucially depends on the availability of a MT system between L1 and L2. We argue that such a strategy ignores the fact that a Machine Translation system is much more demanding in terms of resources than a Sentiment Analysis engine. Moreover, these approaches fail to take into account the divergence in the expression of sentiments across languages. We provide strong experimental evidence to prove that even the best of such systems do not outperform a system trained using only a few polarity annotated documents in the target language. Having a very large number of documents in L1 also does not help because most Machine Learning approaches converge (or reach a plateau) after a certain training size (as demonstrated by our results). Based on our study, we take the stand that languages which have a genuine need for a Sentiment Analysis engine should focus on collecting a few polarity annotated documents in their language instead of relying on CLSA.
    Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2; 03/2013
  • Source
    Aditya Joshi, Kashyap Popat, Shubham Gautam, Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: News headlines exhibit stylistic peculiarities. The goal of our translation engine 'Making Headlines in Hindi' is to achieve automatic translation of English news headlines to Hindi while retaining the Hindi news headline styles. There are two central modules of our engine: the modified translation unit based on Moses and a co-occurrence-based post-processing unit. The modified translation unit provides two machine translation (MT) models: phrase-based and factor-based (both using in-domain data). In addition, a co-occurrence-based post-processing option may be turned on by a user. Our evaluation shows that this engine handles some linguistic phenomena observed in Hindi news headlines.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present TwiSent, a sentiment analysis system for Twitter. Based on the topic searched, TwiSent collects tweets pertaining to it and categorizes them into the different polarity classes positive, negative and objective. However, analyzing micro-blog posts have many inherent challenges compared to the other text genres. Through TwiSent, we address the problems of 1) Spams pertaining to sentiment analysis in Twitter, 2) Structural anomalies in the text in the form of incorrect spellings, nonstandard abbreviations, slangs etc., 3) Entity specificity in the context of the topic searched and 4) Pragmatics embedded in text. The system performance is evaluated on manually annotated gold standard data and on an automatically annotated tweet set based on hashtags. It is a common practise to show the efficacy of a supervised system on an automatically annotated dataset. However, we show that such a system achieves lesser classification accurcy when tested on generic twitter dataset. We also show that our system performs much better than an existing system.
    09/2012;
  • Subhabrata Mukherjee, Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a novel approach to identify feature specific expressions of opinion in product reviews with different features and mixed emotions. The objective is realized by identifying a set of potential features in the review and extracting opinion expressions about those features by exploiting their associations. Capitalizing on the view that more closely associated words come together to express an opinion about a certain feature, dependency parsing is used to identify relations between the opinion expressions. The system learns the set of significant relations to be used by dependency parsing and a threshold parameter which allows us to merge closely associated opinion expressions. The data requirement is minimal as this is a one time learning of the domain independent parameters. The associations are represented in the form of a graph which is partitioned to finally retrieve the opinion expression describing the user specified feature. We show that the system achieves a high accuracy across all domains and performs at par with state-of-the-art systems despite its data limitations.
    09/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we introduce a new WordNet based similarity metric, SenSim, which incorporates sentiment content (i.e., degree of positive or negative sentiment) of the words being compared to measure the similarity between them. The proposed metric is based on the hypothesis that knowing the sentiment is beneficial in measuring the similarity. To verify this hypothesis, we measure and compare the annotator agreement for 2 annotation strategies: 1) sentiment information of a pair of words is considered while annotating and 2) sentiment information of a pair of words is not considered while annotating. Inter-annotator correlation scores show that the agreement is better when the two annotators consider sentiment information while assigning a similarity score to a pair of words. We use this hypothesis to measure the similarity between a pair of words. Specifically, we represent each word as a vector containing sentiment scores of all the content words in the WordNet gloss of the sense of that word. These sentiment scores are derived from a sentiment lexicon. We then measure the cosine similarity between the two vectors. We perform both intrinsic and extrinsic evaluation of SenSim and compare the performance with other widely usedWordNet similarity metrics.
    09/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Generic rule-based systems for Information Extraction (IE) have been shown to work reasonably well out-of-the-box, and achieve state-of-the-art accuracy with further domain customization. However, it is generally recognized that manually building and customizing rules is a complex and labor intensive process. In this paper, we discuss an approach that facilitates the process of building customizable rules for Named-Entity Recognition (NER) tasks via rule induction, in the Annotation Query Language (AQL). Given a set of basic features and an annotated document collection, our goal is to generate an initial set of rules with reasonable accuracy, that are interpretable and thus can be easily refined by a human developer. We present an efficient rule induction process, modeled on a four-stage manual rule development process and present initial promising results with our system. We also propose a simple notion of extractor complexity as a first step to quantify the interpretability of an extractor, and study the effect of induction bias and customization of basic features on the accuracy and complexity of induced rules. We demonstrate through experiments that the induced rules have good accuracy and low complexity according to our complexity measure.
    Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning; 07/2012

23 Following View all

118 Followers View all