Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Question retrieval in current community-based question answering (CQA) services does not, in general, work well for long and complex queries. One of the main difficulties lies in the word mismatch between queries and candidate questions. Existing solutions try to expand the queries at word level, but they usually fail to consider concept level enrichment. In this paper, we explore a pivot language translation based approach to derive the paraphrases of key concepts. We further propose a unified question retrieval model which integrates the keyconcepts and their paraphrases for the query question. Experimental results demonstrate that the paraphrase enhanced retrieval model significantly outperforms the state-of-the-art models in question retrieval.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... past decades [22,5]. Among various challenges in automatic QA systems, retrieval of similar questions [9,33,32,28,29] is attracting much attention for answering frequently asked questions from users. However, this task is quite challenging on three aspects: ...
... The second group is the topic-based model [10] and supervised question-answer topic model [28]. The third group is the paraphrasing based model [29]. ...
... Besides, some other studies model the semantic relations between questions and answers with topic model [10,28] or key concept paraphrasing based on language translation [29]. Recently, some work [31,34] exploited the category metadata within cQA pages to further improve the performance. ...
Conference Paper
The question retrieval, which aims to find similar questions of a given question, is playing pivotal role in various question answering (QA) systems. This task is quite challenging mainly on three aspects: lexical gap, polysemy and word order. In this paper, we propose a unified framework to simultaneously handle these three problems. We use word combined with corresponding concept information to handle the polysemous problem. The concept embedding and word embedding are learned at the same time from both context-dependent and context-independent view. The lexical gap problem is handled since the semantic information has been encoded into the embedding. Then, we propose to use a high-level feature embedded convolutional semantic model to learn the question embedding by inputting the concept embedding and word embedding without manually labeling training data. The proposed framework nicely represent the hierarchical structures of word information and concept information in sentences with their layer-by-layer composition and pooling. Finally, the framework is trained in a weakly-supervised manner on question answer pairs, which can be directly obtained without manually labeling. Experiments on two real question answering datasets show that the proposed framework can significantly outperform the state-of-the-art solutions.
... However, most of the old research articles and documents on the Web pages do not include keyphrases, and it has become impractical to manually assign keyphrases to each document that is too time-consuming. Besides information retrieval (Jones and Staveley 1999;Luo et al. 2015), keyphrases are very useful for many natural language processing (NLP) tasks, such as text summarization (Qazvinian et al. 2010), question answering (Zhang et al. 2015;Tang et al. 2017) and other information processing tasks (Nedjah et al. 2017;Din et al. 2018;Plageras et al. 2018;Gupta 2018). Thus, automatic keyphrase extraction techniques have attracted research attention. ...
Article
Full-text available
Unsupervised random-walk keyphrase extraction models mainly rely on global structural information of the word graph, with nodes representing candidate words and edges capturing the co-occurrence information between candidate words. However, using word embedding method to integrate multiple kinds of useful information into the random-walk model to help better extract keyphrases is relatively unexplored. In this paper, we propose a random-walk-based ranking method to extract keyphrases from text documents using word embeddings. Specifically, we first design a heterogeneous text graph embedding model to integrate local context information of the word graph (i.e., the local word collocation patterns) with some crucial features of candidate words and edges of the word graph. Then, a novel random-walk-based ranking model is designed to score candidate words by leveraging such learned word embeddings. Finally, a new and generic similarity-based phrase scoring model using word embeddings is proposed to score phrases for selecting top-scoring phrases as keyphrases. Experimental results show that the proposed method consistently outperforms eight state-of-the-art unsupervised methods on three real datasets for keyphrase extraction.
... Such models are further improved by modeling domainspeci c semantics of the word/phrases by discriminating named entities and noisy (unimportant) words present in the questions [21]. Recently translation models have been developed by employing efcient paraphrasing technique [29]. Translation-based approaches consistently prove their mettle yielding better performance than the traditional IR based approaches (such as, VSM, BM25 and LM) even if the problem of lexical gap persists. ...
Conference Paper
Full-text available
The current study presents a two-stage question retrieval approach which, in the first phase, retrieves similar questions for a given query using a deep learning based approach and in the second phase, re-ranks initially retrieved questions on the basis of inter-question similarities. The suggested deep learning based approach is trained using several surface features of texts and the associated weights are pre-trained using a deep generative model for better initialization. The proposed retrieval model outperforms standard baseline question retrieval approaches. The proposed re-ranking approach performs inference over a similarity graph constructed with the initially retrieved questions and re-ranks the questions based on their similarity with other relevant questions. Suggested re-ranking approach significantly improves the precision for the retrieval task.
... The method of Zhang et al. [14] also uses a pivot language for paraphrasing. Like us, they translate one language to another, then re-translate from the translated language into the original language. ...
Conference Paper
Full-text available
Our legal question answering system combines legal information retrieval and textual entailment, and exploits paraphrasing and sentence-level analysis of queries and legal statutes. We have evaluated our system using the training data from the competition on legal information extraction/entailment (COLIEE)-2016. The competition focuses on the legal information processing required to answer yes/no questions from Japanese legal bar exams, and it consists of three phases: legal ad-hoc information retrieval (Phase 1), textual entailment (Phase 2), and a combination of information retrieval and textual entailment (Phase 3). Phase 1 requires the identification of Japan civil law articles relevant to a legal bar exam query. For this phase, we have used an information retrieval approach using TF-IDF and a Ranking SVM. Phase 2 requires decision on yes/no answer for previously unseen queries, which we approach by comparing the approximate meanings of queries with relevant articles. Our meaning extraction process uses a selection of features based on a kind of paraphrase, coupled with a condition/conclusion/exception analysis of articles and queries. We also identify synonym relations using word embedding, and detect negation patterns from the articles. Our heuristic selection of attributes is used to build an SVM model, which provides the basis for ranking a decision on the yes/no questions. Experimental evaluation show that our method outperforms previous methods. Our result ranked highest in the Phase 3 in the COLIEE-2016 competition.
Article
Question retrieval, which aims to find similar versions of a given question, is playing a pivotal role in various question answering (QA) systems. This task is quite challenging, mainly in regard to five aspects: synonymy, polysemy, word order, question length, and data sparsity. In this article, we propose a unified framework to simultaneously handle these five problems. We use the word combined with corresponding concept information to handle the synonymy problem and the polysemous problem. Concept embedding and word embedding are learned at the same time from both the context-dependent and context-independent views. To handle the word-order problem, we propose a high-level feature-embedded convolutional semantic model to learn question embedding by inputting concept embedding and word embedding. Due to the fact that the lengths of some questions are long, we propose a value-based convolutional attentional method to enhance the proposed high-level feature-embedded convolutional semantic model in learning the key parts of the question and the answer. The proposed high-level feature-embedded convolutional semantic model nicely represents the hierarchical structures of word information and concept information in sentences with their layer-by-layer convolution and pooling. Finally, to resolve data sparsity, we propose using the multi-view learning method to train the attention-based convolutional semantic model on question–answer pairs. To the best of our knowledge, we are the first to propose simultaneously handling the above five problems in question retrieval using one framework. Experiments on three real question-answering datasets show that the proposed framework significantly outperforms the state-of-the-art solutions.
Conference Paper
Traditional supervised keyphrase extraction models depend on the features of labelled keyphrases while prevailing unsupervised models mainly rely on structure of the word graph, with candidate words as nodes and edges capturing the co-occurrence information between words. However, systematically integrating all these multidimensional heterogeneous information into a unified model is relatively unexplored. In this paper, we focus on how to effectively exploit multidimensional information to improve the keyphrase extraction performance (MIKE). Specifically, we propose a random-walk parametric model, MIKE, that learns the latent representation for a candidate keyphrase that captures the mutual influences among all information, and simultaneously optimizes the parameters and ranking scores of candidates in the word graph. We use the gradient-descent algorithm to optimize our model and show the comprehensive experiments with two publicly-available WWW and KDD datasets in Computer Science. Experimental results demonstrate that our approach significantly outperforms the state-of-the-art graph-based keyphrase extraction approaches.
Conference Paper
Grammar question retrieval aims to find relevant grammar questions that have similar grammatical structure and usage as the input question query. Previous work on text and sentence retrieval which is mainly based on statistical analysis approach and syntactic analysis approach is not effective in finding relevant grammar questions with similar grammatical focus. In this paper, we propose a syntactic parse-key tree based approach for English grammar question retrieval which can find relevant grammar questions with similar grammatical focus effectively. In particular, we propose a syntactic parse-key tree to capture the grammatical focus of grammar questions according to the blank or answer position of the questions. Then we propose a novel method to compute the parse-key tree similarity between the parse-trees of the question query and the database questions for question retrieval. The performance results have shown that our proposed approach outperforms other classical text and sentence retrieval methods in accuracy.
Article
Community question answering (cQA) has become an important issue due to the popularity of cQA archives on the Web. This paper focuses on addressing the lexical gap problem in question retrieval. Question retrieval in cQA archives aims to find the existing questions that are semantically equivalent or relevant to the queried questions. However, the lexical gap problem brings new challenge for question retrieval in cQA. In this paper, we propose to model and learn continuous word embeddings with metadata of category information within cQA pages for question retrieval using two novel category powered models. One is basic category powered model called MB-NET and the other one is enhanced category powered model called ME-NET which can better learn the word embeddings and alleviate the lexical gap problem. To deal with the variable size of word embedding vectors, we employ the framework of fisher kernel to aggregate them into the fixed-length vectors. Experimental results on large-scale English and Chinese cQA data sets show that our proposed approaches can significantly outperform state-of-the-art translation models and topic-based models for question retrieval in cQA. Moreover, we further conduct our approaches on large-scale automatic evaluation experiments. The evaluation results show that promising and significant performance improvements can be achieved.
Article
Community question answering (CQA) has become an increasingly popular research topic. In this paper, we focus on the problem of question retrieval. Question retrieval in CQA can automatically find the most relevant and recent questions that have been solved by other users. However, the word ambiguity and word mismatch problems bring about new challenges for question retrieval in CQA. State-of-the-art approaches address these issues by implicitly expanding the queried questions with additional words or phrases using monolingual translation models. While useful, the effectiveness of these models is highly dependent on the availability of quality parallel monolingual corpora (e.g., question-answer pairs) in the absence of which they are troubled by noise issues. In this work, we propose an alternative way to address the word ambiguity and word mismatch problems by taking advantage of potentially rich semantic information drawn from other languages. Our proposed method employs statistical machine translation to improve question retrieval and enriches the question representation with the translated words from other languages via non-negative matrix factorization. Experiments conducted on real CQA data sets show that our proposed approach is promising.
Conference Paper
Full-text available
It has long been recognized that capturing term relationships is an important aspect of information retrieval. Even with large amounts of data, we usually only have significant evidence for a fraction of all potential term pairs. It is therefore important to consider whether multiple sources of evidence may be combined to predict term relations more accurately. This is particularly important when trying to predict the probability of relevance of a set of terms given a query, which may involve both lexical and semantic relations between the terms.We describe a Markov chain framework that combines multiple sources of knowledge on term associations. The stationary distribution of the model is used to obtain probability estimates that a potential expansion term reflects aspects of the original query. We use this model for query expansion and evaluate the effectiveness of the model by examining the accuracy and robustness of the expansion methods, and investigate the relative effectiveness of various sources of term evidence. Statistically significant differences in accuracy were observed depending on the weighting of evidence in the random walk. For example, using co-occurrence data later in the walk was generally better than using it early, suggesting further improvements in effectiveness may be possible by learning walk behaviors.
Conference Paper
Full-text available
Pseudo-relevance feedback assumes that most frequent terms in the pseudo-feedback documents are useful for the retrieval. In this study, we re-examine this assumption and show that it does not hold in reality - many expansion terms identified in traditional approaches are indeed unrelated to the query and harmful to the retrieval. We also show that good expansion terms cannot be distinguished from bad ones merely on their distributions in the feedback documents and in the whole collection. We then propose to integrate a term classification process to predict the usefulness of expansion terms. Multiple additional features can be integrated in this process. Our experiments on three TREC collections show that retrieval effectiveness can be much improved when term classification is used. In addition, we also demonstrate that good terms should be identified directly according to their possible impact on the retrieval effectiveness, i.e. using supervised learning, instead of unsupervised learning.
Conference Paper
Full-text available
We improve the quality of paraphrases ex- tracted from parallel corpora by requiring that phrases and their paraphrases be the same syn- tactic type. This is achieved by parsing the En- glish side of a parallel corpus and altering the phrase extraction algorithm to extract phrase labels alongside bilingual phrase pairs. In or- der to retain broad coverage of non-constituent phrases, complex syntactic labels are intro- duced. A manual evaluation indicates a 19% absolute improvement in paraphrase quality over the baseline method.
Conference Paper
Full-text available
We apply statistical machine translation (SMT) tools to generate novel paraphrases of input sentences in the same language. The system is trained on large volumes of sentence pairs automatically extracted from clustered news articles available on the World Wide Web. Alignment Error Rate (AER) is measured to gauge the quality of the resulting corpus. A monotone phrasal decoder generates contextual replacements. Human evaluation shows that this system outperforms baseline paraphrase generation techniques and, in a departure from previ- ous work, offers better coverage and scal- ability than the current best-of-breed paraphrasing approaches.
Conference Paper
Full-text available
Monolingual translation probabilities have recently been introduced in retrieval mod- els to solve the lexical gap problem. They can be obtained by training statisti- cal translation models on parallel mono- lingual corpora, such as question-answer pairs, where answers act as the "source" language and questions as the "target" language. In this paper, we propose to use as a parallel training dataset the definitions and glosses provided for the same term by different lexical semantic re- sources. We compare monolingual trans- lation models built from lexical semantic resources with two other kinds of datasets: manually-tagged question reformulations and question-answer pairs. We also show that the monolingual translation probabil- ities obtained (i) are comparable to tradi- tional semantic relatedness measures and (ii) significantly improve the results over the query likelihood and the vector-space model for answer finding.
Conference Paper
Full-text available
This paper explores the use of bilingual parallel corpora as a source of lexical knowledge for cross-lingual textual entailment. We claim that, in spite of the inherent difficulties of the task, phrase tables extracted from parallel data allow to capture both lexical relations between single words, and contextual information useful for inference. We experiment with a phrasal matching method in order to: i) build a system portable across languages, and ii) evaluate the contribution of lexical knowledge in isolation, without interaction with other inference mechanisms. Results achieved on an English-Spanish corpus obtained from the RTE3 dataset support our claim, with an overall accuracy above average scores reported by RTE participants on monolingual data. Finally, we show that using parallel corpora to extract paraphrase tables reveals their potential also in the monolingual setting, improving the results achieved with other sources of lexical knowledge.
Article
Full-text available
In this paper, experiments on automatic extraction of keywords from abstracts using a supervised machine learning algorithm are discussed. The main point of this paper is that by adding linguistic knowledge to the representation (such as syntactic features), rather than relying only on statistics (such as term frequency and n- grams), a better result is obtained as measured by keywords previously assigned by professional indexers. In more detail, extracting NP-chunks gives a better precision than n-grams, and by adding the POS tag(s) assigned to the term as a feature, a dramatic improvement of the results is obtained, independent of the term selection approach applied.
Article
Full-text available
In this paper we present a proposal to extend WordNet-like lexical databases by adding phrasets, i.e. sets of free combinations of words which are recurrently used to express a concept (let's call them recurrent free phrases). Phrasets are a useful source of information for different NLP tasks, and particularly in a multilingual environment to manage lexical gaps.
Article
Full-text available
this article the problem of finding the word alignment of a bilingual sentence-aligned corpus by using language-independent statistical methods. There is a vast literature on this topic, and many different systems have been suggested to solve this problem. Our work follows and extends the methods introduced by Brown, Della Pietra, Della Pietra, and Mercer (1993) by using refined statistical models for the translation process. The basic idea of this approach is to develop a model of the translation process with the word alignment as a hidden variable of this process, to apply statistical estimation theory to compute the "optimal" model parameters, and to perform alignment search to compute the best word alignment
Article
Community question answering (cQA), which provides a platform for people with diverse background to share information and knowledge, has become an increasingly popular research topic. In this paper, we focus on the task of question retrieval. The key problem of question retrieval is to measure the similarity between the queried questions and the historical questions which have been solved by other users. The traditional methods measure the similarity based on the bag-of-words (BOWs) representation. This representation neither captures dependencies between related words, nor hand les synonyms or polysemous words. In this work, we first propose a way to build a concept thesaurus based on the semantic relations extracted from the world knowledge of Wikipedia. Then, we develop a unified framework to leverage these semantic relations in order to enhance the question similarity in the concept space. Experiments conducted on a real cQA data set show that with the help of Wikipedia thesaurus, the performance of question retrieval is improved as compared to the traditional methods.
Article
Community Question Answering (CQA) is a popular type of service where users ask questions and where answers are obtained from other users or from historical question-answer pairs. CQA archives contain large volumes of questions organized into a hierarchy of categories. As an essential function of CQA services, question retrieval in a CQA archive aims to retrieve historical question-answer pairs that are relevant to a query question. This article presents several new approaches to exploiting the category information of questions for improving the performance of question retrieval, and it applies these approaches to existing question retrieval models, including a state-of-the-art question retrieval model. Experiments conducted on real CQA data demonstrate that the proposed techniques are effective and efficient and are capable of outperforming a variety of baseline methods significantly.
Article
We collected a corpus of parallel text in 11 lan-guages from the proceedings of the European Par-liament, which are published on the web 1 . This cor-pus has found widespread use in the NLP commu-nity. Here, we focus on its acquisition and its appli-cation as training data for statistical machine trans-lation (SMT). We trained SMT systems for 110 lan-guage pairs, which reveal interesting clues into the challenges ahead.
Conference Paper
Modeling query concepts through term dependencies has been shown to have a signiflcant positive efiect on retrieval performance, especially for tasks such as web search, where relevance at high ranks is particularly critical. Most pre- vious work, however, treats all concepts as equally impor- tant, an assumption that often does not hold, especially for longer, more complex queries. In this paper, we show that one of the most efiective existing term dependence models can be naturally extended by assigning weights to concepts. We demonstrate that the weighted dependence model can be trained using existing learning-to-rank techniques, even with a relatively small number of training queries. Our study compares the efiectiveness of both endogenous (collection- based) and exogenous (based on external sources) features for determining concept importance. To test the weighted dependence model, we perform experiments on both pub- licly available TREC corpora and a proprietary web cor- pus. Our experimental results indicate that our model con- sistently and signiflcantly outperforms both the standard bag-of-words model and the unweighted term dependence model, and that combining endogenous and exogenous fea- tures generally results in the best retrieval efiectiveness.
Conference Paper
Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has re cently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is he avily cited in the machine learning literature, but its feasibilit y and effectiveness in information retrieval is mostly un known. In this paper, we study how to efficiently use LDA to impro ve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.
Conference Paper
Five independently generated Boolean query formulations for ten different TREC topics were produced by ten different expert online searchers. These different formulations were grouped, and the groups, and combinations of them, were used as searches against the TREC test collection, using the INQUERY probabilistic inference network retrieval engine, Results show that progressive combination of query formulations leads to progressively improving retrieval performance, Results were compared against the performance of INQUERY natural language based queries, and in combination with them. The issue of recall as a performance measure in large databases was raised, since overlap between the searches conducted in this study, and the TREC-1 searches, was smaller than expected.
Conference Paper
While traditional question answering (QA) systems tailored to the TREC QA task work relatively well for simple questions, they do not suffice to answer real world questions. The community-based QA systems offer this service well, as they contain large archives of such questions where manually crafted answers are directly available. However, finding similar questions in the QA archive is not trivial. In this paper, we propose a new retrieval framework based on syntactic tree structure to tackle the similar question matching problem. We build a ground-truth set from Yahoo! Answers, and experimental results show that our method outperforms traditional bag-of-word or tree kernel based methods by 8.3% in mean average precision. It further achieves up to 50% improvement by incorporating semantic features as well as matching of potential answers. Our model does not rely on training, and it is demonstrated to be robust against grammatical errors as well.
Conference Paper
Current search engines do not, in general, perform well with longer, more verbose queries. One of the main issues in processing these queries is identifying the key concepts that will have the most impact on effectiveness. In this paper, we develop and evaluate a technique that uses query-dependent, corpus-dependent, and corpus-independent features for automatic extraction of key concepts from verbose queries. We show that our method achieves higher accuracy in the identification of key concepts than standard weighting methods such as inverse document frequency. Finally, we propose a probabilistic model for integrating the weighted key concepts identified by our method into a query, and demonstrate that this integration significantly improves retrieval effectiveness for a large set of natural language description queries derived from TREC topics on several newswire and web collections.
Conference Paper
Retrieval in a question and answer archive involves finding good answers for a user's question. In contrast to typical document retrieval, a retrieval model for this task can ex- ploit question similarity as well as ranking the associated an- swers. In this paper, we propose a retrieval model that com- bines a translation-based language model for the question part with a query likelihood approach for the answer part. The proposed model incorporates word-to-word translation probabilities learned through exploiting different sources of information. Experiments show that the proposed transla- tion based language model for the question part outperforms baseline methods significantly. By combining with the query likelihood language model for the answer part, substantial additional effectiveness improvements are obtained.
Article
This paper presents an in-depth analysis of a state-of-the-art Question Answering system. Several scenarios are examined: (1) the performance of each module in a serial baseline system, (2) the impact of feedbacks and the insertion of a logic prover, and (3) the impact of various lexical resources. The main conclusion is that the overall performance depends on the depth of natural language processing resources and the tools used for answer finding.
Article
We explore probabilistic lexico-syntactic pattern matching, also known as soft pattern matching, in a definitional question answering system. Most current systems use regular expression-based hard matching patterns to identify definition sentences. Such rigid surface matching often fares poorly when faced with language variations. We propose two soft matching models to address this problem: one based on bigrams and the other on the Profile Hidden Markov Model (PHMM). Both models provide a theoretically sound method to model pattern matching as a probabilistic process that generates token sequences. We demonstrate the effectiveness of the models on definition sentence retrieval for definitional question answering. We show that both models significantly outperform the state-of-the-art manually constructed hard matching patterns on recent TREC data. A critical difference between the two models is that the PHMM has a more complex topology. We experimentally show that the PHMM can handle language variations more effectively but requires more training data to converge. While we evaluate soft pattern models only on definitional question answering, we believe that both models are generic and can be extended to other areas where lexico-syntactic pattern matching can be applied.
Article
The wealth of information on the web makes it an attractive resource for seeking quick answers to simple, factual questions such as &quote;who was the first American in space?&quote; or &quote;what is the second tallest mountain in the world?&quote; Yet today's most advanced web search services (e.g., Google and AskJeeves) make it surprisingly tedious to locate answers to such questions. In this paper, we extend question-answering techniques, first studied in the information retrieval literature, to the web and experimentally evaluate their performance.First we introduce Mulder, which we believe to be the first general-purpose, fully-automated question-answering system available on the web. Second, we describe Mulder's architecture, which relies on multiple search-engine queries, natural-language parsing, and a novel voting procedure to yield reliable answers coupled with high recall. Finally, we compare Mulder's performance to that of Google and AskJeeves on questions drawn from the TREC-8 question answering track. We find that Mulder's recall is more than a factor of three higher than that of AskJeeves. In addition, we find that Google requires 6.6 times as much user effort to achieve the same level of recall as Mulder.
Article
This article presents two probabilistic models for answering ranking in the multilingual question-answering (QA) task, which finds exact answers to a natural language question written in different languages. Although some probabilistic methods have been utilized in traditional monolingual answer-ranking, limited prior research has been conducted for answer-ranking in multilingual question-answering with formal methods. This article first describes a probabilistic model that predicts the probabilities of correctness for individual answers in an independent way. It then proposes a novel probabilistic method to jointly predict the correctness of answers by considering both the correctness of individual answers as well as their correlations. As far as we know, this is the first probabilistic framework that proposes to model the correctness and correlation of answer candidates in multilingual question-answering and provide a novel approach to design a flexible and extensible system architecture for answer selection in multilingual QA. An extensive set of experiments were conducted to show the effectiveness of the proposed probabilistic methods in English-to-Chinese and English-to-Japanese cross-lingual QA, as well as English, Chinese, and Japanese monolingual QA using TREC and NTCIR questions.
Web 1t 5-gram version 1
  • T Brants
Brants, T., and Franz, A. 2006. Web 1t 5-gram version 1.
Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources
  • B Dolan
  • C Quirk
  • C Brockett
Dolan, B.; Quirk, C.; and Brockett, C. 2004. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics, COLING '04.
Extracting structural paraphrases from aligned monolingual corpora
  • A Ibrahim
  • B Katz
  • J Lin
Ibrahim, A.; Katz, B.; and Lin, J. 2003. Extracting structural paraphrases from aligned monolingual corpora. In Proceedings of the second international workshop on Paraphrasing -Volume 16, PARAPHRASE '03, 57-64.
Statistical phrasebased translation
  • P Koehn
  • F J Och
  • D Marcu
Koehn, P.; Och, F. J.; and Marcu, D. 2003. Statistical phrasebased translation. In North American Chapter of the Association for Computational Linguistics on Human Language Technology -Volume 1, NAACL '03, 48-54.
Improved statistical machine translation using monolingually-derived paraphrases
  • Y Marton
  • C Callison-Burch
  • P Resnik
Marton, Y.; Callison-Burch, C.; and Resnik, P. 2009. Improved statistical machine translation using monolingually-derived paraphrases. In Empirical Methods in Natural Language Processing: Volume 1 -Volume 1, EMNLP '09, 381-390.
A language modeling approach to information retrieval
  • J M Ponte
  • W B Croft
Ponte, J. M., and Croft, W. B. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '98, 275-281.
Query expansion using local and global document analysis
  • J Xu
  • W B Croft
Xu, J., and Croft, W. B. 1996. Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '96, 4-11.