About
96
Publications
15,391
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,241
Citations
Introduction
Now, I am a post-doctorate researcher at Saarland University. Currently, I am working on Multi-modal Post-editing (MMPE) of the EU-DFG project in Saarland University, Germany. I have completed my PhD from Saarland University in Nov, 2017. My previous project was to investigate ideal translation workflow in a hybrid machine translation framework. I worked as an Early Stage Researcher of the EU project EXPloiting Empirical appRoaches to Translation (EXPERT) sponsored by Marie Curie
Before my PhD, I did my Masters in Computer Technology from Jadavpur University, and also served as a Research Engineer in Jadavpur University on project " English to Indian Language Machine translation (EILMT) Phase I & Phase II ", Ministry of Communications & Information Technology, Government of India.
Current institution
Additional affiliations
December 2017 - April 2019
November 2008 - September 2013
November 2013 - present
Education
November 2013 - September 2016
July 2010 - June 2013
Publications
Publications (96)
Sarcasm detection in unimodal or multimodal setting is a very complex task. Sarcasm, emotion, and sentiment are related to each other, and hence any multitask model could be an effective way to leverage the interdependence among these tasks. In order to better represent these clandestine associations, we avoid solely relying on traditional machine...
In document-level neural machine translation (DocNMT), multi-encoder approaches are common in encoding context and source sentences. Recent studies \cite{li-etal-2020-multi-encoder} have shown that the context encoder generates noise and makes the model robust to the choice of context. This paper further investigates this observation by explicitly...
Recognizing humor in meme data is a challenging task in natural language processing (NLP) and computer vision (CV) due to the complexity and variability of humor. With the explosive growth of Internet memes on social media platforms such as Facebook, Twitter, and Instagram, this task has become more important. However, there have been few studies t...
Language identification (LID) is a crucial preliminary process in the field of Automatic Speech Recognition (ASR) that involves the identification of a spoken language from audio samples. Contemporary systems that can process speech in multiple languages require users to expressly designate one or more languages prior to utilisation. The LID task a...
Recent studies have shown that the multi-encoder models are agnostic to the choice of context, and the context encoder generates noise which helps improve the models in terms of BLEU score. In this paper, we further explore this idea by evaluating with context-aware pronoun translation test set by training multi-encoder models trained on three diff...
In this paper, we hypothesize that sarcasm detection is closely associated with the emotion present in memes. Thereafter, we propose a deep multitask model to perform these two tasks in parallel, where sarcasm detection is treated as the primary task, and emotion recognition is considered an auxiliary task. We create a large-scale dataset consistin...
Hate speech is now a frequent occurrence on social media. Recently, the majority of study was devoted to identifying hate speech in languages with abundant resources (e.g., English). However, relatively few works are developed for languages with limited resources (e.g., Hindi, the third most widely used language on earth). In this study, Hindi Hate...
Image Captioning as a task that has seen major updates over time. In recent methods, visual-linguistic grounding of the image-text pair is leveraged. This includes either generating the textual description of the objects and entities present within the image in constrained manner, or generating detailed description of these entities as a paragraph....
Language Identification (LID), a recommended initial step to Automatic Speech Recognition (ASR), is used to detect a spoken language from audio specimens. In state-of-the-art systems capable of multilingual speech processing, however, users have to explicitly set one or more languages before using them. LID, therefore, plays a very important role i...
In this paper, we analyze a wide range of physiological, behavioral, performance, and subjective measures to estimate cognitive load (CL) during post-editing (PE) of machine translated (MT) text. To the best of our knowledge, the analyzed feature set comprises the most diverse set of features from a variety of modalities that has been investigated...
More and more professional translators are switching to the use of post-editing (PE) to increase productivity and reduce errors. Even though PE requires significantly less text production, current computer-aided translation (CAT) interfaces still heavily focus on traditional mouse and keyboard input and ignore other interaction modalities to suppor...
Current advances in machine translation (MT) increase the need for translators to switch from traditional translation to post-editing (PE) of machine-translated text, a process that saves time and reduces errors. This affects the design of translation interfaces, as the task changes from mainly generating text to correcting errors within otherwise...
The shift from traditional translation to post-editing (PE) of machine-translated (MT) text can save time and reduce errors, but it also affects the design of translation interfaces, as the task changes from mainly generating text to correcting errors within otherwise helpful translation proposals. Since this paradigm shift offers potential for mod...
In this paper we present the UDS-DFKI system submitted to the Similar Language Translation shared task at WMT 2019. The first edition of this shared task featured data from three pairs of similar languages: Czech and Polish, Hindi and Nepali, and Portuguese and Spanish. Participants could choose to participate in any of these three tracks and submi...
This paper describes strategies to improve an existing web-based computer-aided translation (CAT) tool entitled CATaLog Online. CATaLog Online provides a post-editing environment with simple yet helpful project management tools. It offers translation suggestions from translation memories (TM), machine translation (MT), and automatic post-editing (A...
In automatic post-editing (APE) it makes sense to condition post-editing (pe) decisions on both the source (src) and the machine translated text (mt) as input. This has led to multi-source encoder based APE approaches. A research challenge now is the search for architectures that best support the capture, preparation and provision of src and mt inf...
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task wa...
In this paper, we develop a model that uses a wide range of physiological and behavioral sensor data to estimate perceived cognitive load (CL) during post-editing (PE) of machine translated (MT) text. By predicting the subjectively reported perceived CL, we aim to quantify the extent of demands placed on the mental resources available during PE. Th...
Current advances in machine translation increase the need for translators to switch from traditional translation to post-editing (PE) of machine-translated text, a process that saves time and improves quality. This affects the design of translation interfaces, as the task changes from mainly generating text to correcting errors within otherwise hel...
Current advances in machine translation increase the need for translators to switch from traditional translation to post-editing of machine-translated text, a process that saves time and improves quality. Human and artificial intelligence need to be integrated in an efficient way to leverage the advantages of both for the translation task. This pap...
This paper presents our English-German Automatic Post-Editing (APE) system submitted to the APE Task organized at WMT 2018 (Chatterjee et al., 2018). The proposed model is an extension of the transformer architecture: two separate self-attention-based encoders encode the machine translation output (mt) and the source (src), followed by a joint en-c...
In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi. We investigate the performance of individual features and combine the output of single classifiers to maximize performance. The system com...
In this paper we present the first neural-based machine translation system trained to translate between standard national varieties of the same language. We take the pair Brazilian - European Portuguese as an example and compare the performance of this method to a phrase-based statistical machine translation system. We report a performance improvem...
In this paper we combine two strands of machine translation (MT) research: automatic post-editing (APE) and multi-engine (system combination) MT. APE systems learn a target-language-side second stage MT system from the data produced by human corrected output of a first stage MT system, to improve the output of the first stage MT in what is essentia...
We present a free web-based CAT tool called CATaLog Online which provides a novel and user-friendly online CAT environment for post-editors/translators. The goal is to support distributed translation where teams of translators work simultaneously on different sections of the same text, reduce post-editing time and effort, improve the post-editing e...
We present a neural network based automatic post-editing (APE) system to improve raw machine translation (MT) output. Our neural model of APE (NNAPE) is based on a bidirectional recurrent neu-ral network (RNN) model and consists of an encoder that encodes an MT output into a fixed-length vector from which a de-coder provides a post-edited (PE) tran...
This paper proposes a hybrid word alignment model for Phrase-Based Statistical Machine Translation (PB-SMT). The proposed hybrid word alignment model provides most informative alignment links, which are offered by both unsupervised and semi-supervised word alignment models. Two unsupervised word alignment models, namely GIZA++ and Berkeley aligner,...
This paper reports a classification based approach to translation memory (TM) cleaning as part of our participation in the shared task on cleaning translation memories organized in NLP4TM-2016. The classification task is based on how much the target TM segment is proper translation of the source TM segment. Among the three subtasks proposed in the...
This paper explores how translations of unmatched parts of an input sentence can be discovered and inserted into Translation Memory (TM) suggestions generated by a Computer Aided Translation (CAT) tool using a parse tree and part of speech (POS) tags to form a new translation which is more suitable for post-editing. CATaLog (Nayek et al., 2015) is...
This paper presents CATaLog online, a new web-based MT and TM post-editing tool. CATaLog online is a freeware software that can be used through a web browser and it requires only a simple registration. The tool features a number of editing and log functions similar to the desktop version of CATaLog enhanced with several new features that we describ...
Forest to String Based Statistical Machine Translation (FSBSMT) is a forest-based tree sequence to string translation model for syntax based statistical machine translation. The model automatically learns tree sequence to string translation rules from a given word alignment estimated on a source-side-parsed bilingual parallel corpus. This paper pre...
This paper presents CATaLog online, a new web-based MT and TM post-editing tool. CATaLog online is a freeware software that can be used through a web browser and it requires only a simple registration. The tool features a number of editing and log functions similar to the desktop version of CATaLog enhanced with several new features that we describ...
This paper presents the JU-USAAR English‐German domain adaptive machine translation (MT) system submitted to the IT domain translation task organized in WMT-2016 . Our system brings improvements over the in-domain baseline system by incorporating out-domain knowledge. We applied two methodologies to accelerate the performance of our in-domain MT sy...
We describe the USAAR-SAPE English– Spanish Automatic Post-Editing (APE) system submitted to the APE Task organized in the Workshop on Statistical Machine Translation (WMT) in 2015. Our system was able to improve upon the baseline MT system output by incorporating Phrase-Based Statistical MT (PBSMT) technique into the monolingual Statistical APE ta...
This paper describes the UdS-Sant English–German Hybrid Machine Translation (MT) system submitted to the Translation Task organized in the Workshop on Statistical Machine Translation (WMT) 2015. Our proposed hybrid system brings improvements over the baseline system by incorporating additional knowledge such as extracted bilingual named entities an...
The major goal of Automatic post-editing (APE) is to reduce the human post-editing efforts and increase human post-editing productivity by improving the quality of machine translation (MT) output in terms of fluency and adequacy, i.e., the translations produced should be as close as possible to manually post-edited translations. In this paper, we a...
Good performance of Statistical Machine Translation (SMT) is usually achieved with huge parallel bilingual training corpora, because the translations of words or phrases are computed basing on bilingual data. However, in case of low-resource language pairs such as English-Bengali, the performance is affected by insufficient amount of bilingual trai...
This paper explores a new TM-based CAT tool entitled CATaLog. New features have been integrated into the tool which aim to improve post-editing both in terms of performance and productivity. One of the new features of CATaLog is a color coding scheme that is based on the similarity between a particular input sentence and the segments retrieved from...
We describe the participation of Dublin City University (DCU) in the FIRE-2014 shared task on transliteration search, hereby referred to as the TST (Transliteration Search Task). The TST involves an ad-hoc search over a collection of Hindi film song lyrics. The Hindi language content of each document of the collection is either written in the nativ...
State-of-the-art Machine Translation (MT) does not perform well while translating senti- ment components from source to target lan- guage. The components such as the sentiment holders, sentiment expressions and their cor- responding objects and relations are not main- tained during translation. In this paper, we described, how sentiment analysis ca...
In this paper, we describe the USAAR-DCU machine translation system submit-ted to the NLP Tools Contest of the Inter-national Conference on Natural Language Processing (ICON 2014). The shared task on statistical machine translation in Indian languages encompassed translating from five languages into Hindi in three differ-ent domains. Our best syste...
We describe the Manawi 1 (mAnEv) system submitted to the 2014 WMT translation shared task. We participated in the English-Hindi (EN-HI) and Hindi-English (HI-EN) language pair and achieved 0.792 for the Translation Error Rate (TER) score 2 for EN-HI, the lowest among the competing systems. Our main innovations are (i) the usage of outputs from NLP...
Reordering poses a big challenge in statistical machine translation between distant language pairs. The paper presents how reordering between distant language pairs can be handled efficiently in phrase-based statistical machine translation. The problem of reordering between distant languages has been approached with prior reordering of the source t...
Building parallel resources for corpus based machine translation, especiall y Statistical Machine Translation (SMT), from comparable corpora has recently received wide attention in the field Machine Translation research. In this paper, we propose an automatic approach for extraction of parallel fragments from comparable corpora. The comparable corp...
Statistical Machine Translation (SMT) delivers a convenient format for representing how translation process is modeled. The translations of words or phrases are generally computed based on their occurrence in some bilingual training corpus. However, SMT still suffers for out of vocabulary (OOV) words and less frequent words especially when only lim...
Multi Lingual Snippet Generation (MLSG) systems provide the users with snippets in multiple languages. But collecting and managing documents in multiple languages in an efficient way is a difficult task and thereby makes this process more complicated. Fortunately, this requirement can be fulfilled in another way by translating the snippets from one...
Phrase-based statistical machine translation (PB-SMT) provides the state-of-the-art in machine translation (MT) today. However, unlike syntax-augmented MT systems, it has proven difficult to integrate syntactic knowledge in order to im-prove translation quality in PB-SMT. This paper describes the effects of linguistically motivated shallow phrases...
Multiword Expression (MWE) contrib-utes to major lexical ambiguity problems for any language and poses a big chal-lenge in statistical machine translation. This paper presents the role of MWEs in improving the performance of phrase based Statistical machine Translation (PB-SMT) system. We preprocess the parallel corpus by single tokenizing the MWEs...
In this article, we present an automated approach of extracting English-Bengali parallel fragments of text from comparable corpora created using Wikipedia documents. Our approach exploits the multilingualism of Wiki-pedia. The most important fact is that this approach does not need any domain specific corpus. We have been able to improve the BLEU s...
Abstract In this article, we present an automated approach of extracting English-Bengali
parallel fragments of text from comparable corpora created using Wikipedia documents. Our
approach exploits the multilingualism of Wikipedia. The most important fact is that this
approach does not need any domain specific corpus. We have been able to improve...
This paper proposes a hybrid word alignment model for Phrase-Based Statistical Machine translation (PB-SMT) system. The word alignment is the backbone for PB-SMT system or any data driven corpus based Machine Translation (MT) systems. Here, we present most informative alignment links that are offered under both unsupervised and semi-supervised word...
This paper proposes the impacts of event and event actor alignment in English and Bengali phrase based Statistical Machine Translation (PB-SMT) System. Initially, events and event actors are identified from English and Bengali parallel corpus. For events and event actor identification in English we proposed a hybrid technique and it was carried out...
The paper describes an SMS-based FAQ retrieval system. The goal of this task is to find a question Q* from corpora of FAQs (Frequently Asked Questions) that best answers or matches the SMS query S. The test corpus used in this paper contained FAQs in three languages: English, Hindi and Malayalam. The FAQs were from several domains, including railwa...
This paper reports on our work in the HOO 2012 shared task. The task is to automatically detect, recognize and correct the errors in the use of prepositions and determiners in a set of given test documents in English. For that, we have developed a hybrid system of an n-gram statistical model along with some rule-based techniques. The system has bee...
The processing of parallel corpus plays very crucial role for improving the overall performance in Phrase Based Statistical Machine Translation systems (PB-SMT). In this paper the automatic alignments of different kind of chunks have been studied that boosts up the word alignment as well as the machine translation quality. Single-tokenization of No...
Data-preprocessing plays a crucial role in Phrase based Statistical Machine Translation (PB-SMT). The present work reports how improved word alignment can boost the performance of a PB-SMT system. A hybrid word alignment system has been developed consisting of a rule based and a statistical word mapping system that automatically aligns words betwee...
This paper reports about our work in the HOO shared task 2011. The task is to automatically correct the English of a given document. For that, we have developed a hybrid system of a statistical CRF based model along with a rule-based technique has been used. The system has been trained on the HOO shared task training datasets and run on the test se...
The measurement of relative compositionality of bigrams is crucial to identify Multi-word Expressions (MWEs) in Natural Language Processing (NLP) tasks. The article presents the experiments carried out as part of the participation in the shared task 'Distributional Semantics and Compositionality (DiSCo)' organized as part of the DiSCo workshop in A...
Preprocessing of the parallel corpus plays an important role in improving the performance of a phrase-based statistical machine translation (PB-SMT). In this paper, we propose a frame work in which predefined information of Multiword Expressions (MWEs) can boost the performance of PB-SMT. We preprocess the parallel corpus to identify Noun-noun MWEs...
This paper reports about the development of a cross-language text re-use detection system as a part of the cross-language text re-use detection task in FIRE 2011. Here the cross-language text re-use detection is treated as a problem of Information Retrieval and it is solved with the help of Nutch, an open source Information Retrieval (IR) system. O...
The note describes the Recognizing Textual Entailment (RTE) system developed at the Computer Science and Engineering Department, Jadavpur University, India. In this competition, we participated and submitted the results in the RTE-6 Main Task (3 runs), Novelty Task (3 runs) and RTE-6 KBP task (3 runs for generic task and 3 runs for tailored task)....
We present an Answer Validation System (AV) based on Textual Entailment and Question Answering. The important features used to develop the AV system are Named Entity Recognition, Textual Entailment, Question-Answer type Analysis and Chunk Boundary and Dependency relations. Separate AV modules have been developed for each of these features. We first...
Abstr act. The article presents the experiments carried out as part of the participation in the Paragraph Selection (PS) Task and Answer Selection (AS) Task of QA@CLEF 2010 – ResPubliQA. Our System use Apache Lucene for document retrieval system. All test documents are indexed using Apache Lucene. Stop words are removed from each question and query...
Abstr act. The article presents the experiments carried out as part of the participation in the Paragraph Selection (PS) Task and Answer Selection (AS) Task of QA@CLEF 2010 – ResPubliQA. Our System use Apache Lucene for document retrieval system. All test documents are indexed using Apache Lucene. Stop words are removed from each question and query...
This paper presents the automatic ex-traction of Complex Predicates (CPs) in Bengali with a special focus on compound verbs (Verb + Verb) and conjunct verbs (Noun /Adjective + Verb). The lexical patterns of com-pound and conjunct verbs are extracted based on the information of shallow morphology and available seed lists of verbs. Lexical scopes of...
Data preprocessing plays a crucial role in phrase-based statistical machine translation (PB-SMT). In this paper, we show how single-tokenization of two types of multi-word expressions (MWE), namely named entities (NE) and compound verbs, as well as their prior alignment can boost the performance of PB-SMT. Single-tokenization of compound verbs and...
This article presents the experiments carried out at Jadavpur University as part of the participation in Multi-Way Classification of Semantic Relations between Pairs of Nomi-nals in the SemEval 2010 exercise. Separate rules for each type of the relations are iden-tified in the baseline model based on the verbs and prepositions present in the seg-me...
The article presents the experiments carried out as part of the
participation in the Task-B of Question Generation Challenge, 2010. In the
present task, generating questions from sentences requires preprocessing of the
corpus with the additional knowledge of parsing, Semantic Role Labeling
(SRL), Named Entities (NE) and causal relaters. The cla...
Questions
Questions (2)
but I also need some human editing logs or some kind of cognitive process
Is there any text book regarding Neural Network available online? Which one is the best book to start Neural Network in NLP applications area?