PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The ability to automatically extract Knowledge Graphs (KG) from a given collection of documents is a long-standing problem in Artificial Intelligence. One way to assess this capability is through the task of slot filling. Given an entity query in form of [Entity, Slot, ?], a system is asked to `fill' the slot by generating or extracting the missing value from a relevant passage or passages. This capability is crucial to create systems for automatic knowledge base population, which is becoming in ever-increasing demand, especially in enterprise applications. Recently, there has been a promising direction in evaluating language models in the same way we would evaluate knowledge bases, and the task of slot filling is the most suitable to this intent. The recent advancements in the field try to solve this task in an end-to-end fashion using retrieval-based language models. Models like Retrieval Augmented Generation (RAG) show surprisingly good performance without involving complex information extraction pipelines. However, the results achieved by these models on the two slot filling tasks in the KILT benchmark are still not at the level required by real-world information extraction systems. In this paper, we describe several strategies we adopted to improve the retriever and the generator of RAG in order to make it a better slot filler. Our KGI0 system (available at https://github.com/IBM/retrieve-write-slot-filling) reached the top-1 position on the KILT leaderboard on both T-REx and zsRE dataset with a large margin.
Content may be subject to copyright.
ZERO-SHOT SLOT FILLING WITH DPR AND RAG
A PREPRINT
Michael Glass, Gaetano Rossiello, Alfio Gliozzo
IBM Research AI
mrglass@us.ibm.com, gaetano.rossiello@ibm.com, gliozzo@us.ibm.com
April 20, 2021
ABS TRAC T
The ability to automatically extract Knowledge Graphs (KG) from a given collection of documents
is a long-standing problem in Artificial Intelligence. One way to assess this capability is through
the task of slot filling. Given an entity query in form of [ENTITY, SLOT, ?], a system is asked to
‘fill’ the slot by generating or extracting the missing value from a relevant passage or passages. This
capability is crucial to create systems for automatic knowledge base population, which is becoming
in ever-increasing demand, especially in enterprise applications. Recently, there has been a promising
direction in evaluating language models in the same way we would evaluate knowledge bases, and
the task of slot filling is the most suitable to this intent. The recent advancements in the field try to
solve this task in an end-to-end fashion using retrieval-based language models. Models like Retrieval
Augmented Generation (RAG) show surprisingly good performance without involving complex
information extraction pipelines. However, the results achieved by these models on the two slot filling
tasks in the KILT benchmark are still not at the level required by real-world information extraction
systems. In this paper, we describe several strategies we adopted to improve the retriever and the
generator of RAG in order to make it a better slot filler. Our
KGI0
system
1
reached the top-1 position
on the KILT leaderboard on both T-REx and zsRE dataset with a large margin.
1 Introduction
A main barrier for adoption of KG technology for enterprise is the effort required to define the schema and populate
enterprise specific relational data sources, such as KGs. In this work, we address this problem by exploring the use of
zero-shot learning approaches for slot filling.
In the task of slot filling the goal is to identify a pre-determined set of relations for a given entity, and use them to
populate infobox like structures. This can be done by exploring the occurrences of the input entity in the corpus and
gathering information about its slot fillers from the context in which it it located. Figure 1 illustrates the slot filling task.
A slot filling system processes and indexes a corpus of documents, then when prompted with an entity and a number of
relations, fills out an infobox and provides the evidence passages which explain the predictions.
Over the past years, the proposed slot filling systems commonly involve complex pipelines for named entity recognition,
entity co-reference resolution and relation extraction [Ellis et al., 2015]. In particular, the task of extracting relations
between entities from text has been shown to be the weakest component of the chain. The community proposed different
solutions to improve relation extraction performance, such as rule-based [Angeli et al., 2015], supervised [Zhang et al.,
2017], or distantly supervised [Glass et al., 2018]. However, all these approaches require a considerable human effort in
creating hand-craft rules, annotating training data, or building well-curated datasets for bootstrapping relation classifiers.
The use of language models as sources of knowledge [Petroni et al., 2019, Roberts et al., 2020, Wang et al., 2020,
Petroni et al., 2020a], has opened tasks such as zero-shot slot filling to pre-trained transformers. The introduction of
retrieval augmented language models such as RAG [Lewis et al., 2020b] and REALM [Guu et al., 2020] also permit
providing textual provenance for the generated slot fillers.
1Our source code is available at: https://github.com/IBM/retrieve-write- slot-filling
arXiv:2104.08610v1 [cs.AI] 17 Apr 2021
Zero-shot Slot Filling with DPR and RAG A PREPRINT
Figure 1: Knowledge Graph Induction from textual corpora
A recently introduced suite of benchmarks, KILT (Knowledge Intensive Language Tasks) [Petroni et al., 2020b],
standardizes two zero-shot slot filling tasks: zsRE [Levy et al., 2017] and T-REx [Elsahar et al., 2018]. These tasks
provide a competitive benchmark to drive advancements in slot filling.
One of the most interesting aspects of using pre-trained language models for zero-shot slot filling is the lower effort
required for production deployment, which is a key feature for fast adaptation to new domains. However, the best
performance achieved by the current retrieval-based models on the two slot filling tasks in KILT are still not satisfactory.
This is mainly due to the lack of retrieval performance that affects the generation of the filler as well.
In this work, we propose a new slot filling specific training for both DPR and RAG. Furthermore, we observed that
the RAG strategy of multiple sequence-to-sequence works better than the three passage concatenation in Mulit-DPR
BART. We implemented these ideas in our
KGI0
system, showing large gains on both T-REx (+24% KILT-F1) and
zsRE (+18% KILT-F1) datasets.
2 Related Work
KILT was introduced with a number of baseline approaches. The best performing of these is RAG [Lewis et al., 2020b].
The model incorporates Dense Passage Retrieval (DPR) [Karpukhin et al., 2020] to first gather evidence passages for
the query, then uses a model initialized from BART [Lewis et al., 2020a] to do sequence-to-sequence generation from
each evidence passage concatenated with the query to generate the answer. In the baseline RAG approach only the
query encoder and generation component are fine-tuned on the task. The passage encoder, trained on Natural Questions
[Kwiatkowski et al., 2019] is held fixed. Interestingly, while it gives the best performance of the baselines tested on the
task of producing slot fillers, its performance on the retrieval metrics is worse than BM25. This suggests that fine-tuning
the entire retrieval component could be beneficial.
In an effort to improve the retrieval performance, Multi-task DPR [Maillard et al., 2021] used the multi-task training of
the KILT suite of benchmarks to train the DPR passage and query encoder. The top-3 passages returned by the resulting
passage index were then combined into a single sequence with the query and a BART model was used to produce the
answer. This resulted in large gains in retrieval performance.
DensePhrases [Lee et al., 2020] is a different approach to knowledge intensive tasks with a short answer. Rather than
index passages which are then consumed by a reader or generator component, DensePhrases indexes the phrases in the
corpus that can be potential answers to questions, or fillers for slots. Each phrase is represented by the pair of its start
and end token vectors from the final layer of a transformer initialized from SpanBERT-base-cased [Joshi et al., 2020].
Question vectors come from the [CLS] token of two other transformers: one to be matched with the slot filler’s start
vector and one for the end vector. The start and end token vectors are indexed separately for maximum inner product
search. Then at inference time the top-k start tokens are found for the question’s start vector and the top-k end tokens
2
Zero-shot Slot Filling with DPR and RAG A PREPRINT
are found for the question’s end vector. These results are merged to find the top scoring phrase, which is then predicted
as the slot filler.
GENRE [De Cao et al., 2020] addresses the retrieval task in KILT slot filling by using a sequence-to-sequence
transformer to generate the title of the Wikipedia page where the answer can be found. This method can produce
excellent scores for retrieval but does not address the problem of producing the slot filler. It is trained on BLINK [Wu
et al., 2020] and all KILT tasks jointly.
3 Knowledge Graph Induction
Figure 2 shows Knowledge Graph Induction (KGI), our approach to zero-shot slot filling, combining a DPR model and
RAG model, both trained for slot filling. Due to the close connection between slot filling and open factoid question
answering, we initialize our models from the Natural Questions[Kwiatkowski et al., 2019] trained models for DPR and
RAG available from Hugging Face
2
. We then use a two phase training procedure: first we train the DPR model, i.e.
both the query and context encoder, using the KILT provenance ground truth. Then we train the sequence-to-sequence
generation and further train the query encoder using only the target tail entity as the objective.
Query Encoder
Generator
Passage Encoder
DPR
RAG
KILT Knowledge
Source
head [SEP]
relation
tail
Passages
ANN
Index
Figure 2: K GI Architecture
Since the transformers for passage encoding and generation can accept a limited sequence length, we segment the
documents of the KILT knowledge source (2019/08/01 Wikipedia snapshot) into passages. The ground truth provenance
for the slot filling tasks is at the granularity of paragraphs, so we align our passage segmentation on paragraph boundaries
when possible. If two or more paragraphs are short enough to be combined, we combine them into a single passage and
if a single paragraph is too long, we truncate it.
Our approach to DPR training for slot filling is a straightforward adaptation of the question answering training in the
original DPR work [Karpukhin et al., 2020]. We first index the passages using a traditional keyword search engine,
Anserini
3
. The head entity and the relation are used as a keyword query to find the top-k passages by BM25. Passages
with overlapping paragraphs to the ground truth are excluded as well as passages that contain a correct answer. The
remaining top ranked result is used as a hard negative for DPR training.
After locating a hard negative for each query, the DPR training data is a set of triples: query, positive passage (given by
the KILT ground truth provenance) and our BM25 hard negative passage. Figure 3 shows the training process for DPR.
For each batch of training triples, we encode the queries and passages independently. The passage and query encoders
are BERT [Devlin et al., 2019] models. Then we find the inner product of all queries with all passages. After applying a
softmax to the score vector for each query, the loss is the negative log-likelihood for the positive passages.
2https://github.com/huggingface/transformers
3https://github.com/castorini/anserini
3
Zero-shot Slot Filling with DPR and RAG A PREPRINT
head
1[SEP] relation1
head
2[SEP] relation2
head
3[SEP] relation3
Passage1+
Passage1-
Passage2+
Passage2-
Passage3+
Passage3-
p
1+
p
1-
p
2+
p
2-
p
3+
p
3-
q1
q2
q3
softmax
by row
positive hard negative batch negatives
Passage Encoder
Query Encoder
Figure 3: DPR Training
Using the trained DPR passage encoder we generate vectors for the approximately 32 million passages in our segmen-
tation of the KILT knowledge source. Though this is a computationally expensive step, it is easily parallelized. The
passage-vectors are then indexed with an ANN (Approximate Nearest Neighbors) data structure, in this case HNSW
(Hierarchical Navigable Small World)[Malkov and Yashunin, 2018] using the open source FAISS [Johnson et al., 2017]
library4. We use scalar quantization down to 8 bits to reduce the memory footprint.
The query encoder is also trained for slot filling alongside the passage encoder. We inject the trained query encoder
into the RAG model for Natural Questions. Due to the loose coupling between the query encoder and the sequence-to-
sequence generation of RAG, we can update the pre-trained model’s query encoder without disrupting the quality of the
generation.
tail
head [SEP]
relation
Query Encoder
ANN
Index Generator
Marginalize
Passages
Figure 4: RAG Architecture
Figure 4 illustrates the architecture of RAG. The RAG model is trained to predict the ground truth tail entity from the
head and relation query. First the query is encoded to a vector and relevant passages are retrieved from the ANN index.
The query is concatenated to each passage and the generator predicts a probability distribution over the possible next
tokens for each sequence. These predictions are weighted according to the score between the query and passage - the
inner product of the query vector and passage vector. The weighted probability distributions are then combined to give
a single probability distribution for the next token. Beam search is used to select the overall most likely tail entity.
Because the provenance used is at the level of passages but the evaluation is on page level retrieval, we retrieve up to
twenty passages so that we typically get at least five documents for the Recall@5 metric.
4https://github.com/facebookresearch/faiss
4
Zero-shot Slot Filling with DPR and RAG A PREPRINT
Hyperparameter DPR RAG
learn rate 5e-5 3e-5
batch size 128 128
epochs 2 1*
warmup instances 0 10000
learning schedule linear triangular
max grad norm 1 1
weight decay 0 0
Adam epsilon 1e-8 1e-8
* Since the training set for T-REx is so large, we take only 500k instances.
Table 1: K GI0hyperparameters
Instances Relations
Dataset Train Dev Test Train Dev Test
zsRE 147909 3724 4966 84 12 24
T-REx 2284168 5000 5000 106 104 104
Table 2: Zero-shot Slot Filling Dataset Sizes
We have not done hyperparameter tuning, instead using hyperparameters similar to the original works on training DPR
and RAG. Table 1 shows the hyperparameters used in our experiments.
4 Experiments
Table 2 gives statistics on the two zero-shot slot filling datasets. While the T-REx dataset is larger by far in the number
of instances, the training sets have a similar number of distinct relations. We use only 500 thousand training instances
of T-REx in our experiments to increase the speed of experimentation.
As an initial experiment we tried RAG with its default index of Wikipedia, distributed through Hugging Face. We refer
to this as RAG-KKS, or RAG without the KILT Knowledge Source. Since the passages returned are not aligned to the
KILT provenance ground truth, we do not report retrieval metrics for this experiment. Motivated by the low retrieval
performance reported for the RAG baseline by Petroni et al. [2020b], we also experimented with replacing the DPR
retrieval with simple BM25 (RAG+BM25). We provide the raw BM25 scores for the passages to the RAG model,
to weight their impact in generation. This provides a significant boost in performance. Finally, we use the approach
explained in Section 3 to train both the DPR and RAG models. We call this system
KGI0
, an initial knowledge graph
induction system.
The metrics we report include accuracy and F1 on the slot filler, where F1 is based on the recall and precision of the
tokens in the answer - allowing for partial credit on slot fillers. Our systems (except for RAG-KKS) also provide
provenance information for the top answer. R-Precision and Recall@5 measure the quality of this provenance against
the KILT ground truth provenance. Finally, KILT-Accuracy and KILT-F1 are combined metrics that measure the
accuracy and F1 of the slot filler only when the correct provenance is provided. Table 3 gives our development set
results.
Method R-Prec Recall@5 Accuracy F1 KILT-AC KILT-F1
zsRE
RAG-KKS 38.72% 46.94%
RAG+BM25 58.86% 80.24% 45.73% 55.18% 36.14% 41.85%
KGI077.27% 96.37% 69.55% 97.66% 69.31% 76.83%
T-REx
RAG 63.28% 67.67%
RAG+BM25 46.40% 67.31% 69.10% 73.11% 39.98% 41.21%
KGI061.30% 71.18% 76.58% 80.27% 56.40% 57.70%
Table 3: Dev. Set Performance for Various Retrieval Methods
5
Zero-shot Slot Filling with DPR and RAG A PREPRINT
Table 4 gives the test set performance of the top systems on the KILT leaderboard.
KGI0
is our system, while
DensePhrases, GENRE, Multi-DPR and RAG for KILT are explained briefly in Section 2.
KGI0
gains dramatically in
slot filling accuracy over the previous best systems, with gains of over 10 percentage points in zsRE and even more in
T-REx. The combined metrics of KILT-AC and KILT-F1 show even larger gains, suggesting that the
KGI0
approach is
effective at providing justifying evidence when generating the correct answer. We achieve gains of 17 to 27 percentage
points in KILT-AC.
Method R-Prec Recall@5 Accuracy F1 KILT-AC KILT-F1
zsRE
KGI094.18% 95.19% 68.97% 74.47% 68.32% 73.45%
DensePhrases 57.43% 60.47% 47.42% 54.75% 41.34% 46.79%
GENRE 95.81% 97.83% 0.02% 2.10% 0.00% 1.85%
Multi-DPR 80.91% 93.05% 57.95% 63.75% 50.64% 55.44%
RAG (KILT organizers) 53.73% 59.52% 44.74% 49.95% 36.83% 39.91%
T-REx
KGI059.70% 70.38% 77.90% 81.31% 55.54% 56.79%
DensePhrases 37.62% 40.07% 53.90% 61.74% 27.84% 32.34%
GENRE 79.42% 85.33% 0.10% 7.67% 0.04% 6.66%
Multi-DPR 69.46% 83.88% 0.00% 0.00% 0.00% 0.00%
RAG (KILT organizers) 28.68% 33.04% 59.20% 62.96% 23.12% 23.94%
Table 4: KILT Leaderboard Top Systems
5 Conclusion
The
KGI0
system, combining slot filling specific training for both its DPR and RAG components, produces large gains
in zero-shot slot-filling. Our early experiments suggested the effectiveness of fine-tuning the retrieval component for the
task, and highlighted the loose coupling of RAG’s retrieval with its generation. We find that DPR can be customized to
the slot filling task and inserted into a pre-trained QA model for generation, to then be fine-tuned on the task. Relative
to Multi-DPR, we see the benefit of weighting passage importance by retrieval score and marginalizing over multiple
generations, compared to the strategy of concatenating the top three passages and running a single sequence-to-sequence
generation. GENRE is still best in retrieval, suggesting that at least for a corpus such as Wikipedia, generating the title
of the page can be very effective.
References
Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D. Manning. Leveraging linguistic structure for open
domain information extraction. In ACL (1), pages 344–354. The Association for Computer Linguistics, 2015.
Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. Autoregressive entity retrieval. arXiv preprint
arXiv:2010.00904, 2020.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, 2019.
Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, Zhiyi Song, Ann Bies, and Stephanie M. Strassel. Overview of
linguistic resources for the TAC KBP 2015 evaluations: Methodologies and results. In TAC. NIST, 2015.
Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Frederique Laforest, and Elena
Simperl. T-rex: A large scale alignment of natural language with knowledge base triples. In Proceedings of the
Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
Michael R. Glass, Alfio Gliozzo, Oktie Hassanzadeh, Nandana Mihindukulasooriya, and Gaetano Rossiello. Inducing
implicit relations from text using distantly supervised deep nets. In International Semantic Web Conference (1),
volume 11136 of Lecture Notes in Computer Science, pages 38–55. Springer, 2018.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language
model pre-training. arXiv preprint arXiv:2002.08909, 2020.
Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. arXiv preprint
arXiv:1702.08734, 2017.
6
Zero-shot Slot Filling with DPR and RAG A PREPRINT
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving
pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:
64–77, 2020.
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau
Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle
Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei
Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question
answering research. Transactions of the Association of Computational Linguistics, 2019.
Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, and Danqi Chen. Learning dense representations of phrases at scale. arXiv
preprint arXiv:2012.12624, 2020.
Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. Zero-shot relation extraction via reading comprehension.
In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 333–342,
2017.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov,
and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation,
and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
pages 7871–7880, 2020a.
Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler,
Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.
arXiv preprint arXiv:2005.11401, 2020b.
Jean Maillard, Vladimir Karpukhin, Fabio Petroni, Wen-tau Yih, Barlas O ˘
guz, Veselin Stoyanov, and Gargi Ghosh.
Multi-task retrieval for knowledge-intensive tasks. arXiv preprint arXiv:2101.00117, 2021.
Yu A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical
navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836, 2018.
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick S. H. Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander H.
Miller. Language models as knowledge bases? In EMNLP/IJCNLP (1), pages 2463–2473. Association for
Computational Linguistics, 2019.
Fabio Petroni, Patrick S. H. Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H. Miller, and
Sebastian Riedel. How context affects language models’ factual predictions. In AKBC, 2020a.
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine
Jernite, Vassilis Plachouras, Tim Rocktäschel, et al. Kilt: a benchmark for knowledge intensive language tasks. arXiv
preprint arXiv:2009.02252, 2020b.
Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language
model? In EMNLP (1), pages 5418–5426. Association for Computational Linguistics, 2020.
Chenguang Wang, Xiao Liu, and Dawn Song. Language models are open knowledge graphs. CoRR, abs/2010.11967,
2020.
Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. Scalable zero-shot entity linking
with dense entity retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 6397–6407, 2020.
Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. Position-aware attention and
supervised data improve slot filling. In EMNLP, pages 35–45. Association for Computational Linguistics, 2017.
7
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. SpanBERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE.
Article
Full-text available
We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.
Article
Similarity search finds application in database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data parallel tasks such as distance computation, prior approaches in this domain are bottlenecked by algorithms that expose less parallelism, such as k -min selection, or make poor use of the memory hierarchy. We propose a novel design for k -selection. We apply it in different similarity search scenarios, by optimizing brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation operates at up to 55 percent of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5 × faster than prior GPU state of the art. It enables the construction of a high accuracy k -NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility.