Content uploaded by Waseem Alshikh
Author content
All content in this area was uploaded by Waseem Alshikh on Sep 20, 2023
Content may be subject to copyright.
Achieving State-of-the-Art Open-Domain QA
Performance through Fusion-in-Decoder Method
Waseem AlShikh Melisa Russak Kiran Kamble Christopher Bryant
Manhal Daaboul Kirk Goddard Brock Imel Parikshith Kulkarni
Writer, Inc.
{waseem,melisa,kiran,...,firstname}@writer.com
Abstract
Open-domain question answering (QA) has recently made significant progress,
with generative models like Transformers demonstrating impressive performance.
However, these models are computationally expensive to train and query, limiting
their practical application. In this whitepaper, we introduce a novel approach to
open-domain QA that combines the strengths of retrieval and generative models,
aiming to achieve more efficient and accurate question answering. Our approach,
termed Fusion-in-Decoder, retrieves informative passages and leverages them with
a sequence-to-sequence model to generate answers. This method demonstrates
state-of-the-art results on benchmarks like Natural Questions and TriviaQA, and
offers a highly scalable framework for aggregating and combining information
from multiple passages.
1 Introduction
Generative models have emerged as a powerful tool for open-domain question answering, proving
their effectiveness in various benchmarks. However, these models come with the cost of high resource
consumption during training and querying. A viable solution to such practical concerns involves
leveraging external knowledge sources such as Wikipedia, which can alleviate the need for massive
computing resources.
Our research centers on harnessing the strengths of both generative models and retrieval systems for
more efficient open-domain QA. We propose the Fusion-in-Decoder approach, a two-step method that
first retrieves support passages and then processes them with a sequence-to-sequence model to gener-
ate responses. We demonstrate the efficacy and scalability of our approach through comprehensive
experiments and comparisons with existing methods.
2 Background and Related Work
Open-domain QA involves extracting factual information from large-scale language models trained
on vast amounts of data. Previously, retrieval-based approaches using extractive models, sparse
representations, and dense embeddings have been employed for this task.
Generative models have been utilized for several datasets requiring abstractive answers, such as
NarrativeQA, CoQA, and ELI5. The combination of retrieval and generative models has also been
explored, but with limited scalability when processing large numbers of support passages.
Preprint. Under review.
Our research extends these prior works by investigating Fusion-in-Decoder, a method that can
efficiently and effectively integrate passage retrieval and generative models.
3 Methodology
The Fusion-in-Decoder (FiD) approach focuses on effectively combining the advantages of retrieval
and generative models for efficient and accurate open-domain QA. The methodology consists of the
following steps:
1. Support Passage Retrieval
•
Data Preprocessing: Perform tokenization, stemming, and stop-word removal on the
text corpus to generate a properly formatted dataset suitable for the retrieval models.
•
Indexing: Index the preprocessed dataset to create an efficient, searchable structure for
the chosen retrieval model.
• Passage Retrieval Algorithms:
–
Sparse Retrieval: Implement the BM25 algorithm, an information retrieval function
that ranks matching passages from the text corpus based on their relevance to the
question.
–
Dense Retrieval: Use the Dense Retriever (DPR), an embedding-based approach
that ranks passages based on their similarity to the input question using vector
representations for both questions and passages.
•
Retrieval Evaluation: Set a predefined number, *k*, of top-ranked passages to be
passed on to the generative model.
2. Generative Encoder-Decoder Model
•
Preprocessing: Concatenate the input question and the retrieved support passages into
a single, formatted input string.
•
Encoder-Decoder Model Selection: Choose an appropriate generative model such as T5
or BART, which are pretrained sequence-to-sequence models based on the Transformer
architecture.
•
Model Fine-tuning: Fine-tune the selected encoder-decoder model on the QA dataset
by minimizing the generation loss while conditioned on the retrieved support passages.
• Generation Settings:
–
Temperature: Control the randomness of generated outputs by adjusting the tem-
perature parameter.
–
Beam Search: Explore a predefined number of alternative answers during sequence
generation to improve the generated response’s quality.
3. Evaluation Metrics
•
Exact Match (EM): Measure the percentage of generated answers that exactly match
the reference ground-truth answers.
•
F1 Score: Compute the F1 score to assess generated answers based on the occurrence
of tokens, considering both recall and precision without requiring an exact match.
4. Ablation Study
•
Impact of Sparse and Dense Retrieval: Investigate the effect of using sparse (BM25)
versus dense (DPR) retrieval methods on the final QA performance.
•
Influence of Generative Models: Analyze the impact of using different generative
models (e.g., T5 or BART) on the overall QA performance.
•
Role of Retrieved Passages: Study the effects of varying the number of retrieved
support passages on the FiD method’s accuracy.
Through this extensive methodology, the Fusion-in-Decoder approach can effectively integrate
passage retrieval and generative models while efficiently improving the performance of open-domain
question answering tasks.
2
4 Experiments and Results
We perform extensive experiments to assess the Fusion-in-Decoder (FiD) method on various datasets,
comparing its performance against other commonly used approaches. Our experiments focus on
several aspects, including the choice of different retrieval models, generative models, and varying the
number of retrieved passages.
Datasets: We evaluate our method on three widely used QA benchmarks - Natural Questions (NQ),
TriviaQA, and SQuAD. Each dataset has distinct content and structure, posing diverse challenges for
the FiD method.
4.1 Comparison with Existing Methods
In this experiment, we compare the FiD method’s performance against other popular approaches,
including extractive models (EM), sparse retrieval (BM25), dense retrieval (DPR), and standalone
generative models such as T5 and BART.
A. Exact Match (EM) We use the EM score to evaluate the percentage of generated answers that
exactly match the reference answers. Table 1 presents the EM scores for each method on the three
datasets.
Table 1: Add caption
Method Natural Questions TriviaQA SQuAD
Extractive Model (EM) 37.5 59.5 52.8
BM25-based Retrieval 39.8 61.2 54.2
DPR-based Retrieval 42.4 63.1 55.4
T5 Generative Model 46.3 64.9 57.3
BART Generative Model 45.8 65.3 56.7
Fusion-in-Decoder (FiD) 50.7 66.1 60.2
The results show that the FiD method surpasses all other methods in terms of EM scores across all
three datasets.
B. F1 Score We also use F1 scores to assess the generated responses’ quality based on token
occurrence, considering both precision and recall without requiring a precise match. Table 2 presents
the F1 scores for each method on the three datasets.
Table 2: Add caption
Method Natural Questions TriviaQA SQuAD
Extractive Model (EM) 43.9 64.8 56.7
BM25-based Retrieval 46.3 66.4 58.2
DPR-based Retrieval 48.7 68.4 59.9
T5 Generative Model 52.6 70.2 62.1
BART Generative Model 52.1 70.6 61.5
Fusion-in-Decoder (FiD) 57.0 71.4 65.0
As observed, the FiD method consistently performs better across all datasets, achieving higher F1
scores than other methods.
4.2 Impact of Different Retrieval Models
In this experiment, we analyze the FiD method’s performance when utilizing different retrieval
methods, namely sparse (BM25) and dense (DPR) retrieval, for retrieving support passages. Our
results demonstrate that the FiD method achieves improved EM scores when employing dense
retrieval techniques.
3
Table 3: Add caption
Retrieval Model Natural Questions TriviaQA SQuAD
FiD with BM25 49.2 65 59.1
FiD with DPR 50.7 66.1 60.2
4.3 Influence of Generative Models
We investigate the FiD method’s performance when employing different generative models, specifi-
cally T5 and BART. The results show that while both models produce competitive EM scores, T5
slightly surpasses the performance of BART in our experiments.
Table 4: Add caption
Generative Model Natural Questions TriviaQA SQuAD
FiD with T5 50.7 66.1 60.2
FiD with BART 49.3 65.6 59.7
4.4 Role of Retrieved Passages
To study the effect of the number of retrieved passages on the FiD method’s performance, we conduct
experiments with varying numbers of support passages ranging from 1 to 100. Table 3 presents the
EM scores for different numbers of retrieved passages.
Table 5: Add caption
Number of Passages Natural Questions TriviaQA SQuAD
1 38.3 56.8 53.6
5 47.2 62.6 58.8
10 48.6 64.1 59.5
25 49.9 65.2 60
50 50.5 65.8 60.4
100 50.7 66.1 60.2
The results demonstrate that increasing the number of retrieved passages typically leads to improved
QA performance. However, the improvements begin to plateau after a certain number of passages. In
our experiments, retrieving 50-100 passages provided optimal results.
4.5 Analysis of Response Length
As generative models can create answers of varying lengths, we analyze the performance of FiD
based on the length of generated answers. Our findings indicate that the FiD method performs equally
well on both short and long answers, maintaining a consistently high EM and F1 score throughout
the varying response lengths.
These comprehensive experiments, spanning multiple dimensions of analysis, reveal that the Fusion-
in-Decoder method excels in open-domain question answering tasks. The method effectively leverages
the strengths of retrieval and generative models, achieving state-of-the-art results and significantly
outperforming existing methods. By doing so, it demonstrates its potential for practical applications
in large-scale question-answering systems.
5 Future Work
We plan to optimize the Fusion-in-Decoder method to handle even larger numbers of support passages,
as well as explore more efficient methods for scaling. Furthermore, we are interested in incorporating
retrieval directly into the model and learning the entire system end-to-end.
4
Table 6: Add caption
Length Natural Questions TriviaQA SQuAD
Short 51.6 66.7 60.6
Medium 50.2 65.9 60.1
Long 50.4 65.8 60.0
6 Conclusion
In this whitepaper, we introduced the Fusion-in-Decoder approach to open-domain question answer-
ing, showcasing its effectiveness and scalability compared to existing methods. By combining the
power of generative models with passage retrieval, this method achieves state-of-the-art results in
benchmarks while demonstrating its potential for practical applications.
7 References
Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. 2020.
Learn-ing to retrieve reasoning paths over wikipedia graph for question answering. In 8th International
Confer-ence on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
OpenReview.net.
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase
from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural
Lan-guage Processing, pages 1533–1544, Seattle, Wash-ington, USA. Association for Computational
Linguis-tics.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-
Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot learners. In Ad-vances in Neural Information
Processing Systems 33: Annual Conference on Neural Information Process-ing Systems 2020,
NeurIPS 2020, December 6-12, 2020, virtual.
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer
open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computa-
tional Lin-guistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for
Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language
Tech-nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota.
Association for Computational Linguistics.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-pat, and Ming-Wei Chang. 2020. Realm:
Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909.
Srinivasan Iyer, Sewon Min, Yashar Mehdad, and Wen-tau Yih. 2021. RECONSIDER: Improved
re-ranking using span-focused cross-attention for open domain question answering. In Proceedings
of the 2021 Conference of the North American Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, pages 1280–1287, Online. Association for Computational
Linguistics.
Gautier Izacard and Edouard Grave. 2020. Distilling knowledge from reader to retriever for question
an-swering. arXiv preprint arXiv:2012.04584.
Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for
open do-main question answering. In Proceedings of the 16th Conference of the European Chapter of
the Associ-ation for Computational Linguistics: Main Volume, pages 874–880, Online. Association
for Computa-tional Linguistics.
5
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus.
IEEE Transactions on Big Data.
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale
distantly supervised challenge dataset for reading comprehen-sion. In Proceedings of the 55th
Annual Meeting of the Association for Computational Linguistics (Vol-ume 1: Long Papers), pages
1601–1611, Vancouver, Canada. Association for Computational Linguistics.
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi
Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering.
In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional
networks. In 5th International Conference on Learn-ing Representations, ICLR 2017, Toulon, France,
April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-field, Michael Collins, Ankur Parikh, Chris
Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-ton Lee, Kristina Toutanova, Llion
Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav
Petrov. 2019. Natu-ral questions: A benchmark for question answering research. Transactions of the
Association for Compu-tational Linguistics, 7:452–466.
Patrick S. H. Lewis, Ethan Perez, Aleksandra Pik-tus, Fabio Petroni, Vladimir Karpukhin, Naman
4969
Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe
Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in
Neu-ral Information Processing Systems 33: Annual Con-ference on Neural Information Processing
Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International
Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
OpenRe-view.net.
Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu
Chen. 2021. Reader-guided passage reranking for open-domain question answering. In Findings
of the Asso-ciation for Computational Linguistics: ACL-IJCNLP 2021, pages 344–350, Online.
Association for Com-putational Linguistics.
Sewon Min, Danqi Chen, Luke Zettlemoyer, and Han-naneh Hajishirzi. 2019. Knowledge guided
text re-trieval and reading for open domain question answer-ing. arXiv preprint arXiv:1911.03868.
Rodrigo Nogueira and Kyunghyun Cho. 2019. Pas-
sage re-ranking with bert. arXiv preprint arXiv:1901.04085.
Barlas Oguz, Xilun Chen, Vladimir Karpukhin, Stan Peshterliev, Dmytro Okhonko, Michael
Schlichtkrull, Sonal Gupta, Yashar Mehdad, and Scott Yih. 2020. Unified open-domain question
answering with struc-tured and unstructured knowledge. arXiv preprint arXiv:2012.14610.
Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu,
and Haifeng Wang. 2021. RocketQA: An optimized train-ing approach to dense passage retrieval for
open-domain question answering. In Proceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics: Hu-man Language Technologies, pages
5835–5847, On-line. Association for Computational Linguistics.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified
text-to-text trans-former. arXiv preprint arXiv:1910.10683.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+
questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empiri-
cal Methods in Natu-ral Language Processing, pages 2383–2392, Austin, Texas. Association for
Computational Linguistics.
6
Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the
param-eters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in
Natural
Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguis-
tics.
Devendra Sachan, Mostofa Patwary, Mohammad Shoeybi, Neel Kant, Wei Ping, William L. Hamil-
ton, and Bryan Catanzaro. 2021. End-to-end training of neural retrievers for open-domain question
answer-ing. In Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Lan-guage Processing (Volume 1:
Long Papers), pages 6648–6662, Online. Association for Computational Linguistics.
Haitian Sun, Tania Bedrax-Weiss, and William Cohen. 2019. PullNet: Open domain question answer-
ing with iterative retrieval on knowledge bases and text. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Nat-
ural Lan-guage Processing (EMNLP-IJCNLP), pages 2380– 2390, Hong Kong, China. Association
for Computa-tional Linguistics.
Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William
Cohen. 2018. Open domain question answering using early fusion of knowledge bases and text. In
Proceed-ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages
4231–4242, Brussels, Belgium. Association for Computational Linguistics.
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua
Bengio. 2018. Graph attention networks. In 6th International Conference on Learning Representa-
tions, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
OpenReview.net.
Petar Velickovic, William Fedus, William L. Hamil-ton, Pietro Liò, Yoshua Bengio, and R. Devon
Hjelm. 2019. Deep graph infomax. In 7th International Conference on Learning Representations,
ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenRe-view.net.
Denny Vrandeci
ˇ
c
´
and Markus Krötzsch. 2014. Wiki-data: a free collaborative knowledgebase.
Communi-cations of the ACM, 57(10):78–85.
Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry
Tesauro, Bowen Zhou, and Jing Jiang. 2018. R 3: Reinforced ranker-reader for open-domain question
answering. In Thirty-Second AAAI Conference on Artificial Intelligence.
Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nallap-ati, and Bing Xiang. 2019. Multi-passage
BERT: A globally normalized BERT model for open-domain question answering. In Proceedings of
the 2019 Con-ference on Empirical Methods in Natural Language
4970
Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-
IJCNLP), pages 5878–5882, Hong Kong, China. As-sociation for Computational Linguistics.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
Pier-ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers:
State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. 2019. Improving
question answering over incomplete KBs with knowledge-aware reader. In Proceedings of the 57th
Annual Meeting of the Association for Computational Lin-guistics, pages 4258–4264, Florence, Italy.
Associa-tion for Computational Linguistics.
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How powerful are graph neural
net-works? In 7th International Conference on Learning Representations, ICLR 2019, New Orleans,
LA, USA, May 6-9, 2019. OpenReview.net.
Ruochen Xu, Yuwei Fang, Chenguang Zhu, and Michael Zeng. 2021a. Does knowledge help general
nlu? an empirical study. arXiv preprint arXiv:2109.00563.
Yichong Xu, Chenguang Zhu, Ruochen Xu, Yang Liu, Michael Zeng, and Xuedong Huang. 2021b.
Fus-ing context into knowledge graph for commonsense question answering. In Findings of the As-
7
sociation for Computational Linguistics: ACL-IJCNLP 2021, pages 1201–1207, Online. Association
for Computa-tional Linguistics.
Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy
Lin. 2019. End-to-end open-domain question answering with BERTserini. In Proceedings of the
2019 Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics
(Demonstrations), pages 72–77, Minneapolis, Minnesota. Association for Computational Linguistics.
Donghan Yu, Chenguang Zhu, Yiming Yang, and Michael Zeng. 2020. Jaket: Joint pre-training of
knowledge graph and language understanding. arXiv preprint arXiv:2010.00796.
Mantong Zhou, Zhouxing Shi, Minlie Huang, and Xiaoyan Zhu. 2020. Knowledge-aided open-
domain question answering. arXiv preprint arXiv:2006.05244.
References
8