ArticlePublisher preview available

Question retrieval using combined queries in community question answering

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Community question answering (cQA) has emerged as a popular service on the web; users can use it to ask and answer questions and access historical question-answer (QA) pairs. cQA retrieval, as an alternative to general web searches, has several advantages. First, user can register a query in the form of natural language sentences instead of a set of keywords; thus, they can present the required information more clearly and comprehensively. Second, the system returns several possible answers instead of a long list of ranked documents, thereby enhancing the efficient location of the desired answers. Question retrieval from a cQA archive, an essential function of cQA retrieval services, aims to retrieve historical QA pairs relevant to the query question. In this study, combined queries (combined inverted and nextword indexes) are proposed for question retrieval in cQA. The method performance is investigated for two different scenarios: (a) when only questions from QA pairs are used as documents, and (b) when QA pairs are used as documents. In the proposed method, combined indexes are first created for both queries and documents; then, different information retrieval (IR) models are used to retrieve relevant questions from the cQA archive. Evaluation is performed on a public Yahoo! Answers dataset; the results thereby obtained show that using combined queries for all three IR models (vector space model, Okapi model, and language model) improves performance in terms of the retrieval precision and ranking effectiveness. Notably, by using combined indexes when both QA pairs are used as documents, the retrieval and ranking effectiveness of these cQA retrieval models increases significantly.
This content is subject to copyright. Terms and conditions apply.
https://doi.org/10.1007/s10844-020-00612-x
Question retrieval using combined queries
in community question answering
Saquib Khushhal1·Abdul Majid1·Syed Ali Abbas1·Malik Sajjad Ahmed Nadeem1·
Saeed Arif Shah1
Received: 25 September 2019 / Revised: 25 June 2020 / Accepted: 13 July 2020 /
©Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
Community question answering (cQA) has emerged as a popular service on the web; users
can use it to ask and answer questions and access historical question-answer (QA) pairs.
cQA retrieval, as an alternative to general web searches, has several advantages. First, user
can register a query in the form of natural language sentences instead of a set of keywords;
thus, they can present the required information more clearly and comprehensively. Sec-
ond, the system returns several possible answers instead of a long list of ranked documents,
thereby enhancing the efficient location of the desired answers. Question retrieval from a
cQA archive, an essential function of cQA retrieval services, aims to retrieve historical QA
pairs relevant to the query question. In this study, combined queries (combined inverted and
nextword indexes) are proposed for question retrieval in cQA. The method performance is
investigated for two different scenarios: (a) when only questions from QA pairs are used as
documents, and (b) when QA pairs are used as documents. In the proposed method, com-
bined indexes are first created for both queries and documents; then, different information
retrieval (IR) models are used to retrieve relevant questions from the cQA archive. Evalu-
ation is performed on a public Yahoo! Answers dataset; the results thereby obtained show
that using combined queries for all three IR models (vector space model, Okapi model,
and language model) improves performance in terms of the retrieval precision and rank-
ing effectiveness. Notably, by using combined indexes when both QA pairs are used as
documents, the retrieval and ranking effectiveness of these cQA retrieval models increases
significantly.
Keywords Question retrieval ·Community question-answering services ·
Combined queries ·Combined indexes ·Inverted index
Abdul Majid
majid@ajku.edu.pk
Extended author information available on the last page of the article.
Published online: 2020
July
24
Journal ofIntelligentInformationSystems (2020) 55:30 –32
7
7
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... The main function of the Pinyin language model is to correct inaccuracies, inaccuracies, etc. in the query sentences entered by the user, and the main function of the script is to correct incorrect sentences and other problems in the input sentences. By user [7]. The question analysis module is an integral part of the intelligent question answering system. ...
Article
Full-text available
The elliptic equation intelligent question answering system based on the elliptic equation intelligent question answering system is an intelligent device that combines natural language and data retrieval technology. Knowledge domains used for different purposes are divided into open domains and restricted domains. In order to simplify the existing knowledge of English intelligence answer and elliptic algebra and reduce the efficiency of the questionnaire, a similar algorithm is designed and implemented in the process of English intelligence answer, and the method of distance ellipse switching is adopted. The experimental results show that the accuracy of the questionnaire answering the questions described in this paper is 17% higher than that of general research answers, which confirms the authenticity of the questionnaire algorithm mentioned in this paper. The system not only provides accurate answers to English questions for primary school students, but also plays an important role in the cultivation of English question and answer skills in primary and secondary schools.
... Also, different from conventional collections that are created requiring the user to invent a question for given information, this type of collection has the advantage of being naturally created by the user in natural conditions of questioning. The main difference between the Community QA data set and the conventional ones is that the Community QA collections' answers should be considered candidate answers (Bae & Ko, 2019;Khushhal et al., 2020). It is usually not validated by certified experts and has a score or a ranked order from users' votes. ...
Article
Full-text available
Question Answering (QA) is a field of study addressed to develop automatic methods for answering questions expressed in natural language. Recently, the emergence of the new generation of intelligent assistants, such as Siri, Alexa, and Google Assistant, has intensified the importance of an effective and efficient QA system able to handle questions with different complexities. Regarding the type of question to be answered, QA systems have been divided into two sub-areas: (i) factoid questions that require a single fact – e.g., a name of a person or a date, and (ii) non-factoid questions that need a more complex answer – e.g., descriptions, opinions, or explanations. While factoid QA systems have overcome human performance on some benchmarks, automatic systems for answering non-factoid questions remain a challenge and an open research problem. This work provides an overview of recent research addressing non-factoid questions. It focuses on which methods have been applied in each task, the data sets available, challenges and limitations, and possible research directions. From a total of 455 recent studies, we selected 75 papers based on our quality control system and exclusion criteria for an in-depth analysis. This systematic review helped to answer what are the tasks and methods involved in non-factoid, what are the data sets available, what the limitations are, and what is the recommendations for future research.
... One of the long-standing goals of natural language processing (NLP) is to build systems capable of reasoning about the information present in text. Tasks requiring reasoning include question answering (QA) (Chen et al. 2017a;Xiong et al. 2016;Khushhal et al. 2020), machine reading comprehension (MRC) (Cui et al. 2017;Huang et al. 2017), dialogue systems (Huang et al. 2020;Chen et al. 2017b), and sentiment analysis (Liu and Shen 2020). Reading comprehension and question answering (Dimitrakis et al. 2020), which aim to answer questions about a document, have recently become a major focus of NLP research. ...
Article
Full-text available
In recent years, question answering (QA) and reading comprehension (RC) has attracted much attention, and most research on QA has focused on multi-hop QA task which requires connecting multiple pieces of evidence scattered in a long context to answer the question. The key to the multi-hop QA task is semantic feature interaction between documents and questions, which is widely processed by Bi-directional Attention Flow (Bi-DAF), but Bi-DAF generally captures only the surface semantics of words in complex questions, and fails to capture implied semantic feature of intermediate answers, as well as ignoring parts of contexts related to the question and failing to extract the most important parts of multiple documents. In this paper, we propose a new model architecture for multi-hop question answering by applying two completion strategies:(1) Coarse-Grained complex question Decomposition (CGDe) strategy is introduced to decompose complex questions into simple ones without any additional annotations; (2) Fine-Grained Interaction (FGIn) strategy is introduced to explicitly represent each word in documents and extract more comprehensive and accurate sentences related to the inference path. The above two strategies are combined and tested on the SQuAD and HotpotQA datasets, and the experimental results show that our method outperforms state-of-the-art baselines.
Chapter
Full-text available
Community Question Answering (cQA) are platforms where users can post their questions, expecting for other users to provide them with answers. We focus on the task of question retrieval in cQA which aims to retrieve previous questions that are similar to new queries. The past answers related to the similar questions can be therefore used to respond to the new queries. The major challenges in this task are the shortness of the questions and the word mismatch problem as users can formulate the same query using different wording. Although question retrieval has been widely studied over the years, it has received less attention in Arabic and still requires a non trivial endeavour. In this paper, we focus on this task both in Arabic and English. We propose to use word embeddings, which can capture semantic and syntactic information from contexts, to vectorize the questions. In order to get longer sequences, questions are expanded with words having close word vectors. The embedding vectors are fed into the Siamese LSTM model to consider the global context of questions. The similarity between the questions is measured using the Manhattan distance. Experiments on real world Yahoo! Answers dataset show the efficiency of the method in Arabic and English.
Article
Full-text available
Community Question Answering (CQA) services have evolved into a popular way of online information seeking, where users can interact and exchange knowledge in the form of questions and answers. In this paper, we study the problem of finding historical questions that are semantically equivalent to the queried ones, assuming that the answers to the similar questions should also answer the new ones. The major challenge of question retrieval is the word mismatch problem between questions, as users can formulate the same question using different wording. Most existing methods measure the similarity between questions based on the bag-of-words (BOWs) representation capturing no semantics between words. Therefore, this study proposes to use word embeddings, which can capture semantic and syntactic information from contexts, to vectorize the questions. The questions are clustered using Kmeans to speed up the search and ranking tasks. The similarity between the questions is measured using cosine similarity based on their weighted continuous valued vectors. We run our experiments on real world data set from Yahoo! Answers in English and Arabic to show the efficiency and generality of our proposed method.
Article
Full-text available
Classifying the task of automatically assigning unlabeled questions into predefined categories (or topics) and effectively retrieving a similar question are crucial aspects of an effective cQA service. We first address the problems associated with estimating and utilizing the distribution of words in each category of word weights. We then apply an automatic expansion word generation technique that is based on our proposed weighting method and the pseudo relevance feedback to question classification. Secondly to address the lexical gap problem in question retrieval, the case frame of the sentence is first defined using the extracted components of a sentence, and a similarity measure based on the case frame and the word embedding is then derived to determine the similarities between two sentences. These similarities are then used to reorder the results of the first retrieval model. Consequently, the proposed methods significantly improve the performance of question classification and retrieval.
Conference Paper
We study the problem of question retrieval in community question answering (CQA). The biggest challenge within this task is lexical gaps between questions since similar questions are usually expressed with different but semantically related words. To bridge the gaps, state-of-the-art methods incorporate extra information such as word-to-word translation and categories of questions into the traditional language models. We find that the existing language model based methods can be interpreted using a new framework, that is they represent words and question categories in a vector space and calculate question-question similarities with a linear combination of dot products of the vectors. The problem is that these methods are either heuristic on data representation or difficult to scale up. We propose a principled and efficient approach to learning representations of data in CQA. In our method, we simultaneously learn vectors of words and vectors of question categories by optimizing an objective function naturally derived from the framework. In question retrieval, we incorporate learnt representations into traditional language models in an effective and efficient way. We conduct experiments on large scale data from Yahoo! Answers and Baidu Knows, and compared our method with state-of-the-art methods on two public data sets. Experimental results show that our method can significantly improve on baseline methods for retrieval relevance. On 1 million training data, our method takes less than 50 minutes to learn a model on a single multicore machine, while the translation based language model needs more than 2 days to learn a translation table on the same machine.
Conference Paper
Community Question Answering (CQA) sites such as Yahoo! Answers and Stack Overflow, are knowledge sharing platforms that allow users to post questions and answer questions asked by other users. One of the characteristics of CQA is a time lag between questions and answers. To reduce the time lag between them, various approaches have been suggested such as question retrieval and question routing. In this paper we propose a weighted question retrieval model that uses question titles, question descriptions, and their relationship for calculating question similarity in large-scale CQA archives. From the experiment our weighted question retrieval model outperforms the baseline that uses only question titles, and we found that exploiting the question descriptions increases the ranks of the relevant questions while reducing the recalls of them as compared with the baseline.
Conference Paper
This paper studies the problem of question retrieval in community question answering (CQA). To bridge lexical gaps in questions, which is regarded as the biggest challenge in retrieval, state-of-the-art methods learn translation models using answers under an assumption that they are parallel texts. In practice, however, questions and answers are far from "parallel". Indeed, they are heterogeneous for both the literal level and user behaviors. There are a particularly large number of low quality answers, to which the performance of translation models is vulnerable. To address these problems, we propose a supervised question-answer topic modeling approach. The approach assumes that questions and answers share some common latent topics and are generated in a "question language" and "answer language" respectively following the topics. The topics also determine an answer quality signal. Compared with translation models, our approach not only comprehensively models user behaviors on CQA portals, but also highlights the instinctive heterogeneity of questions and answers. More importantly, it takes answer quality into account and performs robustly against noise in answers. With the topic modeling approach, we propose a topic-based language model, which matches questions not only on a term level but also on a topic level. We conducted experiments on large scale data from Yahoo! Answers and Baidu Knows. Experimental results show that the proposed model can significantly outperform state-of-the-art retrieval models in CQA.