Conference Paper

Hierarchical Catalogue Generation for Literature Review: A Benchmark

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This has prompted development of tools for efficient literature review (Altmami and Menai, 2022). Most tools have focused on au-tomating review generation, treating it as a multidocument summarization task (Mohammad et al., 2009;Jha et al., 2015;Wallace et al., 2020;DeYoung et al., 2021;Liu et al., 2022), sometimes using intermediate structures such as hierarchies/outlines to better scaffold generation (Zhu et al., 2023), with limited success. However, recent work on assessing the utility of NLP tools like LLMs for systematic review reveals that domain experts prefer literature review tools to be assistive instead of automatic (Yun et al., 2023). ...
... Other work has focused on the task of generating related work sections for a scientific paper (Hoang and Kan, 2010;Hu and Wan, 2014;Li et al., 2022;Wang et al., 2022), which while similar in nature to literature review, has a narrower scope and expects more concise generation outputs. Finally, motivated by the everimproving capabilities of generative models, some prior work has attempted to automate end-to-end review generation treating it as multi-document summarization, with limited success (Mohammad et al., 2009;Jha et al., 2015;Wallace et al., 2020;DeYoung et al., 2021;Liu et al., 2022;Zhu et al., 2023). Of these, Zhu et al. (2023) generates intermediate hierarchical outlines to scaffold literature review generation, but unlike our work, they do not produce multiple organizations for the same set of related studies. ...
... Finally, motivated by the everimproving capabilities of generative models, some prior work has attempted to automate end-to-end review generation treating it as multi-document summarization, with limited success (Mohammad et al., 2009;Jha et al., 2015;Wallace et al., 2020;DeYoung et al., 2021;Liu et al., 2022;Zhu et al., 2023). Of these, Zhu et al. (2023) generates intermediate hierarchical outlines to scaffold literature review generation, but unlike our work, they do not produce multiple organizations for the same set of related studies. Additionally, we focus solely on the problem of organizing related studies for literature review, leaving review generation and writing assistance to future work. ...
Preprint
Full-text available
Literature review requires researchers to synthesize a large amount of information and is increasingly challenging as the scientific literature expands. In this work, we investigate the potential of LLMs for producing hierarchical organizations of scientific studies to assist researchers with literature review. We define hierarchical organizations as tree structures where nodes refer to topical categories and every node is linked to the studies assigned to that category. Our naive LLM-based pipeline for hierarchy generation from a set of studies produces promising yet imperfect hierarchies, motivating us to collect CHIME, an expert-curated dataset for this task focused on biomedicine. Given the challenging and time-consuming nature of building hierarchies from scratch, we use a human-in-the-loop process in which experts correct errors (both links between categories and study assignment) in LLM-generated hierarchies. CHIME contains 2,174 LLM-generated hierarchies covering 472 topics, and expert-corrected hierarchies for a subset of 100 topics. Expert corrections allow us to quantify LLM performance, and we find that while they are quite good at generating and organizing categories, their assignment of studies to categories could be improved. We attempt to train a corrector model with human feedback which improves study assignment by 12.6 F1 points. We release our dataset and models to encourage research on developing better assistive tools for literature review.
... Given a set of reference article abstracts relevant to a proposal, Zhu et al. (2023); Martin-Boyle et al. (2024) auto-generates the thematic categories in a hierarchical form (termed as a catalogue) and organize references. However, the results demonstrate that the auto-generated catalogue does not match with the original-author-defined catalogue, leading to discrepancies in downstream literature review generation. ...
Preprint
Full-text available
Objective: This study aims to summarize the usage of Large Language Models (LLMs) in the process of creating a scientific review. We look at the range of stages in a review that can be automated and assess the current state-of-the-art research projects in the field. Materials and Methods: The search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google Scholar databases by human reviewers. Screening and extraction process took place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model. ChatGPT was used to clean extracted data and generate code for figures in this manuscript, ChatGPT and Scite.ai were used in drafting all components of the manuscript, except the methods and discussion sections. Results: 3,788 articles were retrieved, and 172 studies were deemed eligible for the final review. ChatGPT and GPT-based LLM emerged as the most dominant architecture for review automation (n=126, 73.2%). A significant number of review automation projects were found, but only a limited number of papers (n=26, 15.1%) were actual reviews that used LLM during their creation. Most citations focused on automation of a particular stage of review, such as Searching for publications (n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled performance of GPT-based and BERT-based models, the former were better in data extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8), while being slightly less accurate in title and abstract screening stage (Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic review revealed a significant number of research projects related to review automation using LLMs. The results looked promising, and we anticipate that LLMs will change in the near future the way the scientific reviews are conducted.
Article
Full-text available
Background The academic publishing world is changing significantly, with ever-growing numbers of publications each year and shifting publishing patterns. However, the metrics used to measure academic success, such as the number of publications, citation number, and impact factor, have not changed for decades. Moreover, recent studies indicate that these metrics have become targets and follow Goodhart’s Law, according to which, “when a measure becomes a target, it ceases to be a good measure.” Results In this study, we analyzed >120 million papers to examine how the academic publishing world has evolved over the last century, with a deeper look into the specific field of biology. Our study shows that the validity of citation-based measures is being compromised and their usefulness is lessening. In particular, the number of publications has ceased to be a good metric as a result of longer author lists, shorter papers, and surging publication numbers. Citation-based metrics, such citation number and h-index, are likewise affected by the flood of papers, self-citations, and lengthy reference lists. Measures such as a journal’s impact factor have also ceased to be good metrics due to the soaring numbers of papers that are published in top journals, particularly from the same pool of authors. Moreover, by analyzing properties of >2,600 research fields, we observed that citation-based metrics are not beneficial for comparing researchers in different fields, or even in the same department. Conclusions Academic publishing has changed considerably; now we need to reconsider how we measure success.
Conference Paper
Full-text available
We introduce the novel problem of auto- matic related work summarization. Given multiple articles (e.g., conference/journal papers) as input, a related work sum- marization system creates a topic-biased summary of related work specific to the target paper. Our prototype Related Work Summarization system, ReWoS, takes in set of keywords arranged in a hierarchical fashion that describes a target paper's top- ics, to drive the creation of an extractive summary using two different strategies for locating appropriate sentences for general topics as well as detailed ones. Our initial results show an improvement over generic multi-document summarization baselines in a human evaluation.
Conference Paper
Full-text available
The number of research publications in various disciplines is growing exponentially. Researchers and scientists are increasingly finding themselves in the position of having to quickly understand large amounts of technical material. In this paper we present the first steps in producing an automatically generated, readily consumable, technical survey. Specifically we explore the combination of citation information and summarization techniques. Even though prior work (Teufel et al., 2006) argues that citation text is unsuitable for summarization, we show that in the framework of multi-document survey creation, citation texts can play a crucial role.
Conference Paper
Full-text available
Article
The rapid explosion of scientific publications has made related work writing increasingly laborious. In this paper, we propose a fully automated approach to generate related work sections by leveraging a seq2seq neural network. In particular, the main goal of our work is to improve the abstractive generation of related work by introducing problem and method information, which serve as a pivot to connect the previous works in the related work section and has been ignored by the existing studies. More specifically, we employ a title-generation strategy to automatically obtain problem and method information from given references and add the problem and method information as an additional feature to enhance the generation of related work. To verify the effectiveness and feasibility of our approach, we conduct a comparative experiment on publicly available datasets using several common neural summarizers. The experimental results indicate that the introduction of problem and method information contributes to the better generation of related work and our approach substantially outperforms the informed baseline on ROUGE-1 and ROUGE-L. The case study shows that the problem and method information enables considerable topic coherence between the generated related work section and the original paper.
Conference Paper
We present the TCS Alignment Toolbox, which offers a flexible framework to calculate and visualize (dis)similarities between sequences in the context of educational data mining and intelligent tutoring systems. The toolbox offers a variety of alignment algorithms, allows for complex input sequences comprised of multi-dimensional elements, and is adjustable via rich parameterization options, including mechanisms for an automatic adaptation based on given data. Our demo shows an example in which the alignment measure is adapted to distinguish students' Java programs w.r.t. different solution strategies, via a machine learning technique.
Conference Paper
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluations. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale sum- marization evaluation sponsored by NIST.
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Article
We introduce a stochastic graph-based method for computing relative importance of textual units for Natural Language Processing. We test the technique on the problem of Text Summarization (TS). Extractive TS relies on the concept of sentence salience to identify the most important sentences in a document or set of documents. Salience is typically defined in terms of the presence of particular important words or in terms of similarity to a centroid pseudo-sentence. We consider a new approach, LexRank, for computing sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences. Our system, based on LexRank ranked in first place in more than one task in the recent DUC 2004 evaluation. In this paper we present a detailed analysis of our approach and apply it to a larger data set including data from earlier DUC evaluations. We discuss several methods to compute centrality using the similarity graph. The results show that degree-based methods (including LexRank) outperform both centroid-based methods and other systems participating in DUC in most of the cases. Furthermore, the LexRank with threshold method outperforms the other degree-based techniques including continuous LexRank. We also show that our approach is quite insensitive to the noise in the data that may result from an imperfect topical clustering of documents.
Meteor: An automatic metric for mt evaluation with improved correlation with human judgments
  • Satanjeev Banerjee
  • Alon Lavie
Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65-72.
Longformer: The long-document transformer
  • Iz Beltagy
  • E Matthew
  • Arman Peters
  • Cohan
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Doc2ppt: Automatic presentation slides generation from scientific documents
  • Tsu-Jui Fu
  • William Yang Wang
  • Daniel Mcduff
  • Yale Song
Tsu-Jui Fu, William Yang Wang, Daniel McDuff, and Yale Song. 2022. Doc2ppt: Automatic presentation slides generation from scientific documents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 634-642.
X-sql: reinforce schema representation with context
  • Pengcheng He
  • Yi Mao
  • Kaushik Chakrabarti
  • Weizhu Chen
Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Weizhu Chen. 2019. X-sql: reinforce schema representation with context. arXiv preprint arXiv:1908.08113.
Generating a structured summary of numerous academic papers: Dataset and method
  • Liu Shuaiqi
  • Jiannong Cao
  • Ruosong Yang
  • Zhiyuan Wen
Shuaiqi LIU, Jiannong Cao, Ruosong Yang, and Zhiyuan Wen. 2022. Generating a structured summary of numerous academic papers: Dataset and method. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 4259-4265. International Joint Conferences on Artificial Intelligence Organization. Main Track.
  • Kelvin Luu
  • Xinyi Wu
  • Rik Koncel-Kedziorski
  • Kyle Lo
  • Isabel Cachola
  • Noah A Smith
Kelvin Luu, Xinyi Wu, Rik Koncel-Kedziorski, Kyle Lo, Isabel Cachola, and Noah A Smith. 2020. Explaining relationships between scientific documents. arXiv preprint arXiv:2002.00317.
Revisiting the tree edit distance and its backtracing: A tutorial
  • Benjamin Paaßen
Benjamin Paaßen. 2018. Revisiting the tree edit distance and its backtracing: A tutorial. arXiv preprint arXiv:1805.06869.
Bleu: a method for automatic evaluation of machine translation
  • Kishore Papineni
  • Salim Roukos
  • Todd Ward
  • Wei-Jing Zhu
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311-318.
Generating a structured summary of numerous academic papers: Dataset and method
  • Jiannong Liu Shuaiqi
  • Ruosong Cao
  • Zhiyuan Yang
  • Wen
LIU Shuaiqi, Jiannong Cao, Ruosong Yang, and Zhiyuan Wen. 2022. Generating a structured summary of numerous academic papers: Dataset and method.
  • Ross Taylor
  • Marcin Kardas
  • Guillem Cucurull
  • Thomas Scialom
  • Anthony Hartshorn
  • Elvis Saravia
  • Andrew Poulton
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
Surveytree: Automatic generation of survey structures for nlp and ai topics
  • Prawat Trairatvorakul
  • Alexander Fabbri
  • Dragomir R Radev
Prawat Trairatvorakul, Alexander Fabbri, and Dragomir R Radev. Surveytree: Automatic generation of survey structures for nlp and ai topics.