Preprint

Joint Multitask Learning for Community Question Answering Using Task-Specific Embeddings

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

We address jointly two important tasks for Question Answering in community forums: given a new question, (i) find related existing questions, and (ii) find relevant answers to this new question. We further use an auxiliary task to complement the previous two, i.e., (iii) find good answers with respect to the thread question in a question-comment thread. We use deep neural networks (DNNs) to learn meaningful task-specific embeddings, which we then incorporate into a conditional random field (CRF) model for the multitask setting, performing joint learning over a complex graph structure. While DNNs alone achieve competitive results when trained to produce the embeddings, the CRF, which makes use of the embeddings and the dependencies between the tasks, improves the results significantly and consistently across a variety of evaluation metrics, thus showing the complementarity of DNNs and structured learning.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This survey presents an overview of information retrieval, natural language processing and machine learning research that makes use of forum data, including both discussion forums and community questionanswering (cQA) archives. The focus is on automated analysis, with the goal of gaining a better understanding of the data and its users. We discuss the different strategies used for both retrieval tasks (post retrieval, question retrieval, and answer retrieval) and classification tasks (post type classification, question classification, post quality assessment, subjectivity, and viewpoint classification) at the post level, as well as at the thread level (thread retrieval, solvedness and task orientation, discourse structure recovery and dialogue act tagging, QA-pair extraction, and thread summarisation). We also review work on forum users, including user satisfaction, expert finding, question recommendation and routing, and community analysis. The survey includes a brief history of forums, an overview of the different kinds of forums, a summary of publicly available datasets for forum research, and a short discussion on the evaluation of retrieval tasks using forum data. The aim is to give a broad overview of the different kinds of forum research, a summary of the methods that have been applied, some insights into successful strategies, and potential areas for future research.
Conference Paper
Full-text available
We study how to find relevant questions in community forums when the language of the new questions is different from that of the existing questions in the forum. In particular, we explore the Arabic-English language pair. We compare a kernel-based system with a feed-forward neural network in a scenario where a large parallel corpus is available for training a machine translation system, bilingual dictionaries, and cross-language word embeddings. We observe that both approaches degrade the performance of the system when working on the translated text, especially the kernel-based system, which depends heavily on a syntactic kernel. We address this issue using a cross-language tree kernel, which compares the original Arabic tree to the English trees of the related questions. We show that this kernel almost closes the performance gap with respect to the monolingual system. On the neural network side, we use the parallel corpus to train cross-language embeddings, which we then use to represent the Arabic input and the English related questions in the same space. The results also improve to close to those of the monolingual neural network. Overall, the kernel system shows a better performance compared to the neural network in all cases.
Conference Paper
Full-text available
We transfer a key idea from the field of sentiment analysis to a new domain: community question answering (cQA). The cQA task we are interested in is the following: given a question and a thread of comments, we want to re-rank the comments, so that the ones that are good answers to the question would be ranked higher than the bad ones. We notice that good vs. bad comments use specific vocabulary and that one can often predict the goodness/badness of a comment even ignoring the question, based on the comment contents only. This leads us to the idea to build a good/bad polarity lexicon as an analogy to the positive/negative sentiment polarity lexicons, commonly used in sentiment analysis. In particular, we use pointwise mutual information in order to build large-scale goodness polarity lexicons in a semi-supervised manner starting with a small number of initial seeds. The evaluation results show an improvement of 0.7 MAP points absolute over a very strong baseline, and state-of-the art performance on SemEval-2016 Task 3.
Article
Full-text available
We present a framework for machine translation evaluation using neural networks in a pairwise setting, where the goal is to select the better translation from a pair of hypotheses, given the reference translation. In this framework, lexical, syntactic and semantic information from the reference and the two hypotheses is embedded into compact distributed vector representations, and fed into a multi-layer neural network that models nonlinear interactions between each of the hypotheses and the reference, as well as between the two hypotheses. We experiment with the benchmark datasets from the WMT Metrics shared task, on which we obtain the best results published so far, with the basic network configuration. We also perform a series of experiments to analyze and understand the contribution of the different components of the network. We evaluate variants and extensions, including fine-tuning of the semantic embeddings, and sentence-based representations modeled with convolutional and recurrent neural networks. In summary, the proposed framework is flexible and generalizable, allows for efficient learning and scoring, and provides an MT evaluation metric that correlates with human judgments, and is on par with the state of the art.
Conference Paper
Full-text available
State-of-the-art sequence labeling systems traditionally require large amounts of task-specific knowledge in the form of hand-crafted features and data pre-processing. In this paper, we introduce a novel neutral network architecture that benefits from both word- and character-level representations automatically, by using combination of bidirectional LSTM, CNN and CRF. Our system is truly end-to-end, requiring no feature engineering or data pre-processing, thus making it applicable to a wide range of sequence labeling tasks on different languages. We evaluate our system on two data sets for two sequence labeling tasks --- Penn Treebank WSJ corpus for part-of-speech (POS) tagging and CoNLL 2003 corpus for named entity recognition (NER). We obtain state-of-the-art performance on both the two data --- 97.55\% accuracy for POS tagging and 91.21\% F1 for NER.
Conference Paper
Full-text available
Community question answering, a recent evolution of question answering in the Web context, allows a user to quickly consult the opinion of a number of people on a particular topic, thus taking advantage of the wisdom of the crowd. Here we try to help the user by deciding automatically which answers are good and which are bad for a given question. In particular, we focus on exploiting the output structure at the thread level in order to make more consistent global decisions. More specifically, we exploit the relations between pairs of comments at any distance in the thread, which we incorporate in a graph-cut and in an ILP frameworks. We evaluated our approach on the benchmark dataset of SemEval-2015 Task 3. Results improved over the state of the art, confirming the importance of using thread level information .
Conference Paper
Full-text available
The emergence of user forums in electronic news media has given rise to the proliferation of opinion manipulation trolls. Finding such trolls automatically is a hard task, as there is no easy way to recognize or even to define what they are; this also makes it hard to get training and testing data. We solve this issue pragmatically: we assume that a user who is called a troll by several people is likely to be one. We experiment with different variations of this definition, and in each case we show that we can train a classifier to distinguish a likely troll from a non-troll with very high accuracy, 82–95%, thanks to our rich feature set.
Article
Full-text available
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
Conference Paper
Full-text available
We apply Stochastic Meta-Descent (SMD), a stochastic gradient optimization method with gain vector adaptation, to the train- ing of Conditional Random Fields (CRFs). On several large data sets, the resulting opti- mizer converges to the same quality of solu- tion over an order of magnitude faster than limited-memory BFGS, the leading method reported to date. We report results for both exact and inexact inference techniques.
Article
Purpose The purpose of this paper is to explore the dark side of news community forums: the proliferation of opinion manipulation trolls. In particular, it explores the idea that a user who is called a troll by several people is likely to be one. It further demonstrates the utility of this idea for detecting accused and paid opinion manipulation trolls and their comments as well as for predicting the credibility of comments in news community forums. Design/methodology/approach The authors are aiming to build a classifier to distinguish trolls vs regular users. Unfortunately, it is not easy to get reliable training data. The authors solve this issue pragmatically: the authors assume that a user who is called a troll by several people is likely to be such, which are called accused trolls. Based on this assumption and on leaked reports about actual paid opinion manipulation trolls, the authors build a classifier to distinguish trolls vs regular users. Findings The authors compare the profiles of paid trolls vs accused trolls vs non-trolls, and show that a classifier trained to distinguish accused trolls from non-trolls does quite well also at telling apart paid trolls from non-trolls. Research limitations/implications The troll detection works even for users with about 10 comments, but it achieves the best performance for users with a sizable number of comments in the forum, e.g. 100 or more. Yet, there is not such a limitation for troll comment detection. Practical implications The approach would help forum moderators in their work, by pointing them to the most suspicious users and comments. It would be also useful to investigative journalists who want to find paid opinion manipulation trolls. Social implications The authors can offer a better experience to online users by filtering out opinion manipulation trolls and their comments. Originality/value The authors propose a novel approach for finding paid opinion manipulation trolls and their posts.
Article
Question retrieval, which aims to find similar versions of a given question, is playing a pivotal role in various question answering (QA) systems. This task is quite challenging, mainly in regard to five aspects: synonymy, polysemy, word order, question length, and data sparsity. In this article, we propose a unified framework to simultaneously handle these five problems. We use the word combined with corresponding concept information to handle the synonymy problem and the polysemous problem. Concept embedding and word embedding are learned at the same time from both the context-dependent and context-independent views. To handle the word-order problem, we propose a high-level feature-embedded convolutional semantic model to learn question embedding by inputting concept embedding and word embedding. Due to the fact that the lengths of some questions are long, we propose a value-based convolutional attentional method to enhance the proposed high-level feature-embedded convolutional semantic model in learning the key parts of the question and the answer. The proposed high-level feature-embedded convolutional semantic model nicely represents the hierarchical structures of word information and concept information in sentences with their layer-by-layer convolution and pooling. Finally, to resolve data sparsity, we propose using the multi-view learning method to train the attention-based convolutional semantic model on question–answer pairs. To the best of our knowledge, we are the first to propose simultaneously handling the above five problems in question retrieval using one framework. Experiments on three real question-answering datasets show that the proposed framework significantly outperforms the state-of-the-art solutions.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
In this paper, we propose a variety of Long Short-Term Memory (LSTM) based models for sequence tagging. These models include LSTM networks, bidirectional LSTM (BI-LSTM) networks, LSTM with a Conditional Random Field (CRF) layer (LSTM-CRF) and bidirectional LSTM with a CRF layer (BI-LSTM-CRF). Our work is the first to apply a bidirectional LSTM CRF (denoted as BI-LSTM-CRF) model to NLP benchmark sequence tagging data sets. We show that the BI-LSTM-CRF model can efficiently use both past and future input features thanks to a bidirectional LSTM component. It can also use sentence level tag information thanks to a CRF layer. The BI-LSTM-CRF model can produce state of the art (or close to) accuracy on POS, chunking and NER data sets. In addition, it is robust and has less dependence on word embedding as compared to previous observations.
Conference Paper
Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Article
We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Article
The Meteor Automatic Metric for Machine Translation evaluation, originally developed and released in 2004, was designed with the explicit goal of producing sentence-level scores which correlate well with human judgments of translation quality. Several key design decisions were incorporated into Meteor in support of this goal. In contrast with IBM’s Bleu, which uses only precision-based features, Meteor uses and emphasizes recall in addition to precision, a property that has been confirmed by several metrics as being critical for high correlation with human judgments. Meteor also addresses the problem of reference translation variability by utilizing flexible word matching, allowing for morphological variants and synonyms to be taken into account as legitimate correspondences. Furthermore, the feature ingredients within Meteor are parameterized, allowing for the tuning of the metric’s free parameters in search of values that result in optimal correlation with human judgments. Optimal parameters can be separately tuned for different types of human judgments and for different languages. We discuss the initial design of the Meteor metric, subsequent improvements, and performance in several independent evaluations in recent years.
Article
re the eliefs") while in the physics setting mean eld approximations may be used to predict the macroscopic behavior of the system (e.g. critical temperature, correlation lengths etc.). I will therefore start this chapter reformulating both methods in a common language. I will then show simulation results on some toy problems. I tried to set up the toy problems such that they are small enough to make exact inference possible yet capture some of the qualities of the MRFs that occur in vision applications. I hope more head to head comparisons will be done in dierent graphs with dierent parameter regimes. Figure 1: An example of a MRF with pairwise potentials. The hidden nodes x i are denoted with open circles and observed nodes y i are denoted with lled circles. 1 The setting: undirected graphs with pairwise potentials I will assume we are dealing with pairwise Markov Random Fields (MRFs) of the form shown
An exploration of data augmentation and RNN architectures for question ranking in community question answering
  • Charles Chen
  • Razvan Bunescu
Charles Chen and Razvan Bunescu. 2017. An exploration of data augmentation and RNN architectures for question ranking in community question answering. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJC-NLP '17, pages 442-447, Taipei, Taiwan.
Automatic evaluation of machine translation quality using n-gram cooccurrence statistics
  • George Doddington
George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research, HLT '02, pages 138-145, San Francisco, California, USA.
A review on deep learning techniques applied to answer selection
  • Trung Tuan Manh Lai
  • Sheng Bui
  • Li
Tuan Manh Lai, Trung Bui, and Sheng Li. 2018. A review on deep learning techniques applied to answer selection. In Proceedings of the 27th International Conference on Computational Linguistics, COLING '18, pages 2132-2144, Santa Fe, New Mexico, USA.
Fact checking in community forums
  • Tsvetomila Mihaylova
  • Preslav Nakov
  • Lluís Màrquez
  • Alberto Barrón-Cedeño
  • Mitra Mohtarami
  • Georgi Karadjov
  • James Glass
Tsvetomila Mihaylova, Preslav Nakov, Lluís Màrquez, Alberto Barrón-Cedeño, Mitra Mohtarami, Georgi Karadjov, and James Glass. 2018. Fact checking in community forums. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, AAAI '18, pages 879-886, New Orleans, Louisiana, USA.
Linguistic regularities in continuous space word representations
  • Tomas Mikolov
  • Yih Wen-Tau
  • Geoffrey Zweig
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT '13, pages 746-751, Atlanta, Georgia, USA.
Loopy belief propagation for approximate inference: An empirical study
  • Kevin P Murphy
  • Yair Weiss
  • Michael I Jordan
Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. 1999. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI'99, pages 467-475, Stockholm, Sweden.
Hamdy Mubarak, abed Alhakim Freihat, James Glass, and Bilal Randeree
  • Preslav Nakov
  • Lluís Màrquez
  • Alessandro Moschitti
  • Walid Magdy
Preslav Nakov, Lluís Màrquez, Alessandro Moschitti, Walid Magdy, Hamdy Mubarak, abed Alhakim Freihat, James Glass, and Bilal Randeree. 2016b. SemEval-2016 task 3: Community question answering. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval '16, pages 525-545, San Diego, California, USA.
BLEU: a method for automatic evaluation of machine translation
  • Kishore Papineni
  • Salim Roukos
  • Todd Ward
  • Wei-Jing Zhu
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meting of the Association for Computational Linguistics, ACL '02, pages 311-318, Philadelphia, Pennsylvania, USA.
Convolutional neural tensor network architecture for communitybased question answering
  • Xipeng Qiu
  • Xuanjing Huang
Xipeng Qiu and Xuanjing Huang. 2015. Convolutional neural tensor network architecture for communitybased question answering. In Proceedings of International Joint Conference on Artificial Intelligence, IJCAI '15, pages 1305-1311, Buenos Aires, Argentina.
Learning hybrid representations to retrieve semantically equivalent questions
  • Santos Cicero Dos
  • Luciano Barbosa
  • Dasha Bogdanova
  • Bianca Zadrozny
Cicero dos Santos, Luciano Barbosa, Dasha Bogdanova, and Bianca Zadrozny. 2015. Learning hybrid representations to retrieve semantically equivalent questions. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, ACL-IJCNLP '15, pages 694-699, Beijing, China.
Adversarial domain adaptation for duplicate question detection
  • J Darsh
  • Tao Shah
  • Alessandro Lei
  • Salvatore Moschitti
  • Preslav Romeo
  • Nakov
Darsh J Shah, Tao Lei, Alessandro Moschitti, Salvatore Romeo, and Preslav Nakov. 2018. Adversarial domain adaptation for duplicate question detection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '18, Brussels, Belgium.
A study of translation edit rate with targeted human annotation
  • Matthew Snover
  • Bonnie Dorr
  • Richard Schwartz
  • Linnea Micciulla
  • John Makhoul
Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Biennial Conference of the Association for Machine Translation in the Americas, AMTA '06, pages 223-231, Cambridge, Massachusetts, USA.
LSTM-based deep learning models for non-factoid answer selection
  • Ming Tan
  • Bing Xiang
  • Bowen Zhou
Ming Tan, Bing Xiang, and Bowen Zhou. 2015. LSTM-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108.
Answer sequence learning with neural networks for answer selection in community question answering
  • Xiaoqiang Zhou
  • Baotian Hu
  • Qingcai Chen
  • Buzhou Tang
  • Xiaolong Wang
Xiaoqiang Zhou, Baotian Hu, Qingcai Chen, Buzhou Tang, and Xiaolong Wang. 2015. Answer sequence learning with neural networks for answer selection in community question answering. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, ACL-IJCNLP '15, pages 713-718, Beijing, China.