Publications

  • Source
    Abhijit Mishra · Diptesh Kanojia · Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: Sarcasm understandability or the ability to understand tex-tual sarcasm depends upon readers' language proficiency, social knowledge, mental state and attentiveness. We introduce a novel method to predict the sarcasm understandability of a reader. Presence of incongruity in textual sarcasm often elicits distinctive eye-movement behavior by human readers. By recording and analyzing the eye-gaze data, we show that eye-movement patterns vary when sarcasm is understood vis-` a-vis when it is not. Motivated by our observations, we propose a system for sarcasm understandability prediction using supervised machine learning. Our system relies on readers' eye-movement parameters and a few textual features, thence, is able to predict sarcasm understandability with an F-score of 93%, which demonstrates its efficacy. The availability of inexpensive embedded-eye-trackers on mobile devices creates avenues for applying such research which benefits web-content creators, review writers and social media analysts alike.
    Full-text · Conference Paper · Feb 2016
  • Source
    Diptesh Kanojia · Raj Dabre · Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: India is a country with 22 officially recognized languages and 17 of these have WordNets, a crucial resource. Web browser based interfaces are available for these WordNets, but are not suited for mobile devices which deters people from effectively using this resource. We present our initial work on developing mobile applications and browser extensions to access WordNets for In-dian Languages. Our contribution is two fold: (1) We develop mobile applications for the An-droid, iOS and Windows Phone OS platforms for Hindi, Marathi and San-skrit WordNets which allow users to search for words and obtain more information along with their translations in English and other Indian languages. (2) We also develop browser extensions for English, Hindi, Marathi, and San-skrit WordNets, for both Mozilla Fire-fox, and Google Chrome. We believe that such applications can be quite helpful in a classroom scenario, where students would be able to access the WordNets as dictionaries as well as lexical knowledge bases. This can help in overcoming the language barrier along with furthering language understanding .
    Full-text · Conference Paper · Jan 2016
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper reports the work of creating bilingual mappings in English for certain synsets of Hindi wordnet, the need for doing this, the methods adopted and the tools created for the task. Hindi wordnet, which forms the foundation for other In-dian language wordnets, has been linked to the English WordNet. To maximize linkages, an important strategy of using direct and hypernymy linkages has been followed. However, the hypernymy linkages were found to be inadequate in certain cases and posed a challenge due to sense granularity of language. Thus, the idea of creating bilingual mappings was adopted as a solution. A bilingual mapping means a linkage between a concept in two different languages, with the help of translation and/or transliteration. Such mappings retain meaningful representations , while capturing semantic similarity at the same time. This has also proven to be a great enhancement of Hindi wordnet and can be a crucial resource for multilingual applications in natural language processing , including machine translation and cross language information retrieval.
    Full-text · Conference Paper · Jan 2016
  • Source
    Diptesh Kanojia · Shehzaad Dhuliawala · Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: WordNet has proved to be immensely useful for Word Sense Disambiguation, and thence Machine translation, Information Retrieval and Question Answering. It can also be used as a dictionary for educational purposes. The semantic nature of concepts in a Word-Net motivates one to try to express this meaning in a more visual way. In this paper, we describe our work of enriching IndoWordNet with image acquisitions from the OpenClipArt library. We describe an approach used to enrich WordNets for eighteen Indian languages. Our contribution is three fold: (1) We develop a system, which, given a synset in English, finds an appropriate image for the synset. The system uses the OpenclipArt library (OCAL) to retrieve images and ranks them. (2) After retrieving the images, we map the results along with the linkages between Princeton WordNet and Hindi Word-Net, to link several synsets to corresponding images. We choose and sort top three images based on our ranking heuristic per synset. (3) We develop a tool that allows a lexicographer to manually evaluate these images. The top images are shown to a lexicographer by the evaluation tool for the task of choosing the best image representation. The lexicographer also selects the number of relevant images. Using our system, we obtain an Average Precision (P @ 3) score of 0.30.
    Full-text · Conference Paper · Jan 2016
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present TransChat , an open source, cross platform, Indian language Instant Messaging (IM) application that facilitates cross lingual textual communication over English and multiple Indian Languages. The application is a client-server IM architecture based chat system with multiple Statistical Machine Translation (SMT) engines working towards efficient translation and transmission of messages. TransChat allows users to select their preferred language and internally, selects appropriate translation engine based on the input configuration. For translation quality enhancement, necessary pre-and post-processing steps are applied on the input and output chat-texts. We demonstrate the efficacy of TransChat through a series of qualitative evaluations that test-(a) The usability of the system (b) The quality of the translation output. In a multilingual country like India, such applications can help overcome language barrier in domains like tourism, agriculture and health.
    Full-text · Conference Paper · Dec 2015
  • Source
    Naman Gupta · Aditya Joshi · Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: A bottleneck for medical domain Temporal Expression Recognition (TER) is the availability of data. An open-domain TER system may not be able to capture domain-specific expressions, while domain-specific TER may be cumbersome to implement. We present a novel neural network based medical TER system that uses corpora from news and medical domains. Thus, it serves as a middle ground between an open-domain and a domain-specific TER. We show that our system outperforms state-of-art open-domain baselines, and gets close to domain-specific skylines. Thus, our system proves to be a promising alternative for domain specific TER for domains where data may be limited.
    Full-text · Conference Paper · Dec 2015
  • Source
    Diptesh Kanojia · Aditya Joshi · Pushpak Bhattacharyya · Mark James Carman
    [Show abstract] [Hide abstract]
    ABSTRACT: Parallel corpora are often injected with bilingual dictionaries for improved Indian language machine translation (MT). In absence of such dictionaries, a coarse dictionary may be required. This paper demonstrates the use of a multilingual topic model for creating coarse dictionaries for English-Hindi MT. We compare our approaches with: (a) a baseline with no additional dictionary injection, and (b) a corpus with a good quality dictionary. Our results show that the existing Cartesian product approach which is used to create the pseudo-parallel data results in a degradation on tourism and health datasets, for English-Hindi MT. Our paper points to the fact that existing Cartesian approach using multilingual topics (devised for European languages) may be detrimental for Indian language MT. On the other hand, we present an alternate 'sentential' approach that leads to a slight improvement. However, our sen-tential approach (using a parallel corpus injected with a coarse dictionary) outper-forms a system trained using parallel corpus and a good quality dictionary.
    Full-text · Conference Paper · Dec 2015
  • Source
    Anupam Khattri · Aditya Joshi · Pushpak Bhattacharyya · Mark James Carman
    [Show abstract] [Hide abstract]
    ABSTRACT: Sarcasm understanding may require information beyond the text itself, as in the case of 'I absolutely love this restaurant!' which may be sarcastic, depending on the contextual situation. We present the first quantitative evidence to show that historical tweets by an author can provide additional context for sarcasm detection. Our sarcasm detection approach uses two components: a contrast-based predictor (that identifies if there is a sentiment contrast within a target tweet), and a historical tweet-based predictor (that identifies if the sentiment expressed towards an entity in the target tweet agrees with sentiment expressed by the author towards that entity in the past).
    Full-text · Conference Paper · Sep 2015
  • Source
    Aditya Joshi · Anoop Kunchukuttan · Pushpak Bhattacharyya · Mark James Carman
    [Show abstract] [Hide abstract]
    ABSTRACT: Sarcasm detection is a recent innovation in sentiment analysis research. However, there has been no attention to sarcasm generation. We present a sarcasm-generation module for chatbots. The uniqueness of 'SarcasmBot' is that it generates a sarcastic response for a user input. SarcasmBot is a sarcasm generation module that implements eight rule-based sarcasm generators, each of which generates a certain type of sarcastic expression. One of these sarcasm generators is selected at run-time, based on properties of user input such as question type, number of entities, etc. We evaluate our sarcasm-generation module in two ways: (a) a qualitative evaluation on three parameters: coherence , grammatical correctness and sarcastic nature, where all scores are above 0.69 out of 1, and (b) a comparative evaluation between SarcasmBot and ALICE, where a majority of our human evaluators are able to identify the output of SarcasmBot among two outputs, in 70.97% of test examples.
    Full-text · Conference Paper · Aug 2015
  • Source
    Aditya Joshi · Abhijit Mishra · Balamurali Ar · Pushpak Bhattacharyya · Mark James Carman
    [Show abstract] [Hide abstract]
    ABSTRACT: Alcohol abuse may lead to unsociable behavior such as crime, drunk driving, or privacy leaks. We introduce automatic drunk-texting prediction as the task of identifying whether a text was written when under the influence of alcohol. We experiment with tweets labeled using hashtags as distant supervision. Our clas-sifiers use a set of N-gram and stylistic features to detect drunk tweets. Our observations present the first quantitative evidence that text contains signals that can be exploited to detect drunk-texting.
    Full-text · Conference Paper · Jul 2015
  • Source
    Aditya Joshi · Vinita Sharma · Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: The relationship between context incon-gruity and sarcasm has been studied in linguistics. We present a computational system that harnesses context incongruity as a basis for sarcasm detection. Our statistical sarcasm classifiers incorporate two kinds of incongruity features: explicit and implicit. We show the benefit of our incon-gruity features for two text forms-tweets and discussion forum posts. Our system also outperforms two past works (with F-score improvement of 10-20%). We also show how our features can capture inter-sentential incongruity.
    Full-text · Conference Paper · Jul 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: An acid test for any new Word Sense Disam-biguation (WSD) algorithm is its performance against the Most Frequent Sense (MFS). The field of WSD has found the MFS baseline very hard to beat. Clearly, if WSD researchers had access to MFS values, their striving to better this heuristic will push the WSD frontier. However, getting MFS values requires sense annotated corpus in enormous amounts, which is out of bounds for most languages, even if their WordNets are available. In this paper, we propose an unsupervised method for MFS detection from the untagged corpora, which exploits word embeddings. We compare the word embedding of a word with all its sense embeddings and obtain the predominant sense with the highest similarity. We observe significant performance gain for Hindi WSD over the WordNet First Sense (WFS) baseline. As for English, the SemCor baseline is bettered for those words whose frequency is greater than 2. Our approach is language and domain independent.
    Full-text · Conference Paper · Jun 2015
  • Source
    Hanumant Redkar · Sudha Bhingardive · Diptesh Kanojia · Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: WordNet is an online lexical resource which expresses unique concepts in a language. English WordNet is the first WordNet which was developed at Princeton University. Over a period of time, many language WordNets were developed by various organizations all over the world. It has always been a challenge to store the WordNet data. Some WordNets are stored using file system and some WordNets are stored using different database models. In this paper, we present the World WordNet Database Structure which can be used to efficiently store the WordNet information of all languages of the World. This design can be adapted by most language WordNets to store information such as synset data, semantic and lexical relations, ontology details, language specific features, linguistic information, etc. An attempt is made to develop Application Programming Interfaces to manipulate the data from these databases. This database structure can help in various Natural Language Processing applications like Multilingual Information Retrieval, Word Sense Disambiguation, Machine Translation, etc.
    Full-text · Conference Paper · Jan 2015
  • Source
    Diptesh Kanojia · Manish Srivastava · Raj Dabre · Pushpak Bhattacharyya
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a Parallel Corpora Management tool that aides parallel corpora generation for the task of Machine Translation (MT). It takes source and target text of a corpus for any language pair in text file format, or zip archives containing multiple corresponding text files. Then, it provides with a helpful interface to lexicographers for manual translation / validation, and gives out the corrected text files as output. It provides various dictionary references as help within the interface which increase the productivity and efficiency of a lexicographer. It also provides automatic translation of the source sentence using an integrated MT system. The tool interface includes a corpora management system which facilitates maintenance of parallel corpora by assigning roles such as manager , lexicographer etc. We have designed a novel tool that provides aides like references to various dictionary sources such as Wordnets, Shabdkosh, Wikitionary etc. We also provide manual word alignment correction which is visualized in the tool and can lead to its gamification in the future , thus, providing a valuable source of word / phrase alignments.
    Full-text · Conference Paper · Dec 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present our work on developing fifteen Hierarchical Phrase Based Statistical Machine Translation (HPB-SMT) systems for five Indian language pairs namely Bengali-Hindi, English-Hindi, Marathi-Hindi, Tamil-Hindi, and Telugu-Hindi, in three domains each, HEALTH, TOURISM and GENERAL. We named them PanchBhoota, as these systems are elemental in nature. We used a very simple approach to train, tune, and test them using cdec toolkit. We hope that this work will motivate Indian Language Machine Translation researchers to look deeper into the field of HPBSMT which is known to perform better than Phrase Based Statistical Machine Translation.
    Full-text · Conference Paper · Dec 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: WordNet is a large lexical resource expressing distinct concepts in a language. Synset is a basic building block of the WordNet. In this paper, we introduce a web based lexicographer's interface 'Synskarta' which is developed to create synsets from source language to target language with special reference to Sanskrit WordNet. We focus on introduction and implementation of Synskarta and how it can help to overcome the limitations of the existing system. Further, we highlight the features , advantages, limitations and user evaluations of the same. Finally, we mention the scope and enhancements to the Synskarta and its usefulness in the entire IndoWordNet community.
    Full-text · Conference Paper · Dec 2014
  • Aditya Joshi · Abhijit Mishra · Pushpak Bhattacharyya

    No preview · Conference Paper · Jun 2014
  • Shubham Gautam · Pushpak Bhattacharyya

    No preview · Conference Paper · Jun 2014

  • No preview · Conference Paper · Jun 2014
  • Source
    Diptesh Kanojia · Pushpak Bhattacharyya · Raj Dabre · Siddhartha Gunti · Manish Shrivastava
    [Show abstract] [Hide abstract]
    ABSTRACT: The task of Word Sense Disambiguation (WSD) incorporates in its definition the role of 'context'. We present our work on the development of a tool which allows for automatic acquisition and ranking of 'context clues' for WSD. These clue words are extracted from the contexts of words appearing in a large monolin-gual corpus. These mined collection of contex-tual clues form a discrimination net in the sense that for targeted WSD, navigation of the net leads to the correct sense of a word given its context. Utilizing this resource we intend to develop efficient and light weight WSD based on look up and navigation of memory-resident knowledge base, thereby avoiding heavy computation which often prevents incorporation of any serious WSD in MT and search. The need for large quantities of sense marked data too can be reduced.
    Full-text · Conference Paper · Jan 2014

27 Following View all

217 Followers View all