Chapter

Attend to Your Review: A Deep Neural Network to Extract Aspects from Peer Reviews

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Peer-review process is fraught with issues like bias, inconsistencies, arbitrariness, non-committal weak rejects, etc. However, it is anticipated that the peer reviews provide constructive feedback to the authors against some aspects of the paper such as Motivation/Impact, Soundness/Correctness, Novelty, Substance, etc. A good review is expected to evaluate a paper under the lens of these aspects. An automated system to extract these implicit aspects from the reviews would help determine the quality/goodness of the peer review. In this work, we propose a deep neural architecture to extract the aspects of the paper on which the reviewer commented in their review. Our automatic aspect-extraction model based on BERT and neural attention mechanism achieves superior performance over the standard baselines. We make our codes, analyses and other matrials available at https://github.com/cruxieu17/aspect-extraction-peer-reviews .

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The authors of DeepSentiPeer [13] used three channels of information: paper, corresponding review, and the review polarity to predict the overall recommendation as well as final decision. There are a few other works on NLP/ML for peer review problems [14,15] such as aspect extraction [16] and sentiment analysis, which are worthy to explore to understand the related NLP/ML investigations in this domain. ...
Article
Full-text available
With the ever-increasing number of submissions in top-tier conferences and journals, finding good reviewers and meta-reviewers is becoming increasingly difficult. Writing a meta-review is not straightforward as it involves a series of sub-tasks, including making a decision on the paper based on the reviewer’s recommendation and their confidence in the recommendation, mitigating disagreements among the reviewers, and other such similar tasks. In this work, we develop a novel approach to automatically generate meta-reviews that are decision-aware and which also take into account a set of relevant sub-tasks in the peer-review process. More specifically, we first predict the recommendation scores and confidence scores for the reviews, using which we then predict the decision on a particular manuscript. Finally, we utilize the decision signals for generating the meta-reviews using a transformer-based seq2seq architecture. Our proposed pipelined approach for automatic decision-aware meta-review generation achieves significant performance improvement over the standard summarization baselines as well as relevant prior works on this problem. We make our codes available at https://github.com/saprativa/seq-to-seq-decision-aware-mrg.
... We envisage our proposed dataset BetterPR would propel research to answer this question. Some relevant works on NLP/ML for peer reviews are worthy exploring in this regard [6][7][8][11][12][13][14][15][19][20][21][22]27]. ...
Chapter
Full-text available
Review comments play an important role in the improvement of scientific articles. There are typically many rounds of review-revision before the different reviewers with varying backgrounds arrive at a consensus on a submission. However, not always the reviews are helpful. Sometimes the reviewers are unnecessarily critical of the work without justifying their comments. Peer reviews are always meant to be critical yet constructive feedback on the scientific merit of a submitted article. However, with the rising number of paper submissions leading to the involvement of novice or less experienced reviewers in the reviewing process, the reviewers tend to spend less expert time on their voluntary reviewing job. This results in lackluster reviews where the authors do not have many takeaways from their reviews. The entire scientific enterprise is heavily dependent on this very human peer-review process. In this paper, we make an attempt to automatically distinguish between constructive and non-constructive peer reviews. We deem constructive comment to be the one that, despite being critical, is polite and provides feedback to the authors to improve their submissions. To this end, we present BetterPR, a manually annotated dataset to estimate the constructiveness of peer review comments. Further, we benchmark BetterPR with standard baselines and analyze their performance. We collect the peer reviews from open access forums and design an annotation scheme to label whether a review comment is constructive or non-constructive. We provide our dataset and codes (https://github.com/PrabhatkrBharti/BetterPR.git) for further exploration by the community.KeywordsPeer reviewsReview constructivenessPeer review quality
... The first is to select the embedding of the first sentence or chunk, and the second is to take a mean average over all embeddings. Prior work has found averaging sentence embeddings to work well for document retrieval (Yang et al, 2019), aspect-extraction (Verma et al, 2021), and fake news identification (Slovikovskaya & Attardi, 2020). The second approach is therefore taken for the sentence-based models. ...
Article
Full-text available
Judgments concerning animals have arisen across a variety of established practice areas. There is, however, no publicly available repository of judgments concerning the emerging practice area of animal protection law. This has hindered the identification of individual animal protection law judgments and comprehension of the scale of animal protection law made by courts. Thus, we detail the creation of an initial animal protection law repository using natural language processing and machine learning techniques. This involved domain expert classification of 500 judgments according to whether or not they were concerned with animal protection law. 400 of these judgments were used to train various models, each of which was used to predict the classification of the remaining 100 judgments. The predictions of each model were superior to a baseline measure intended to mimic current searching practice, with the best performing model being a support vector machine (SVM) approach that classified judgments according to term frequency—inverse document frequency (TF-IDF) values. Investigation of this model consisted of considering its most influential features and conducting an error analysis of all incorrectly predicted judgments. This showed the features indicative of animal protection law judgments to include terms such as ‘welfare’, ‘hunt’ and ‘cull’, and that incorrectly predicted judgments were often deemed marginal decisions by the domain expert. The TF-IDF SVM was then used to classify non-labelled judgments, resulting in an initial animal protection law repository. Inspection of this repository suggested that there were 175 animal protection judgments between January 2000 and December 2020 from the Privy Council, House of Lords, Supreme Court and upper England and Wales courts.
Chapter
With the ever-increasing number of submissions in top-tier conferences and journals, finding good reviewers and meta-reviewers is becoming increasingly difficult. Writing a meta-review is not straightforward as it involves a series of sub-tasks, including making a decision on the paper based on the reviewer’s recommendation and their confidence in the recommendation, mitigating disagreements among the reviewers, etc. In this work, we develop a novel approach to automatically generate meta-reviews that are decision-aware and which also take into account a set of relevant sub-tasks in the peer-review process. Our initial pipelined approach for automatic decision-aware meta-review generation achieves significant performance improvement over the standard summarization baselines and relevant prior works on this problem. We make our codes available at https://github.com/saprativa/seq-to-seq-decision-aware-mrg.KeywordsMeta-review generationPeer-reviewDecision-aware
Article
Full-text available
The purpose of this paper is to analyze the causes and effects of arbitrariness in the peer review process. This paper focuses on two main reasons for the arbitrariness in peer review. The first is that referees are not homogenous and display homophily in their taste and perception of innovative ideas. The second element is that reviewers are different in the time they allocate for peer review. Our model replicates the NIPS experiment of 2014, showing that the ratings of peer review are not robust, and that altering reviewers leads to a dramatic impact on the ranking of the papers. This paper also shows that innovative works are not highly ranked in the existing peer review process, and in consequence are often rejected.
Article
Full-text available
Significance Scientific peer review has been a cornerstone of the scientific method since the 1600s. Debate continues regarding the merits of single-blind review, in which anonymous reviewers know the authors of a paper and their affiliations, compared with double-blind review, in which this information is hidden. We present an experimental study of this question. In computer science, research often appears first or exclusively in peer-reviewed conferences rather than journals. Our study considers full-length submissions to the highly selective 2017 Web Search and Data Mining conference (15.6% acceptance rate). Each submission is simultaneously scored by two single-blind and two double-blind reviewers. Our analysis shows that single-blind reviewing confers a significant advantage to papers with famous authors and authors from high-prestige institutions.
Article
Full-text available
Neural Information Processing Systems (NIPS) is a top-tier annual conference in machine learning. The 2016 edition of the conference comprised more than 2,400 paper submissions, 3,000 reviewers, and 8,000 attendees, representing a growth of nearly 40% in terms of submissions, 96% in terms of reviewers, and over 100% in terms of attendees as compared to the previous year. In this report, we analyze several aspects of the data collected during the review process, including an experiment investigating the efficacy of collecting ordinal rankings from reviewers (vs. usual scores aka cardinal rankings). Our goal is to check the soundness of the review process we implemented and, in going so, provide insights that may be useful in the design of the review process of subsequent conferences. We introduce a number of metrics that could be used for monitoring improvements when new ideas are introduced.
Conference Paper
Full-text available
Identifying peer-review helpfulness is an important task for improving the quality of feedback that students receive from their peers. As a first step towards enhancing existing peer-review systems with new functionality based on helpfulness detection, we examine whether standard product review analysis techniques also apply to our new context of peer reviews. In addition, we investigate the utility of incorporating additional specialized features tailored to peer review. Our preliminary results show that the structural features, review uni-grams and meta-data combined are useful in modeling the helpfulness of both peer reviews and product reviews, while peer-review specific auxiliary features can further improve helpfulness prediction.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
The rapid development of science and technology has been accompanied by an exponential growth in peer-reviewed scientific publications. At the same time, the review of each paper is a laborious process that must be carried out by subject matter experts. Thus, providing high-quality reviews of this growing number of papers is a significant challenge. In this work, we ask the question “can we automate scientific reviewing? ”, discussing the possibility of using natural language processing (NLP) models to generate peer reviews for scientific papers. Because it is non-trivial to define what a “good” review is in the first place, we first discuss possible evaluation metrics that could be used to judge success in this task. We then focus on the machine learning domain and collect a dataset of papers in the domain, annotate them with different aspects of content covered in each review, and train targeted summarization models that take in papers as input and generate reviews as output. Comprehensive experimental results on the test set show that while system-generated reviews are comprehensive, touching upon more aspects of the paper than human-written reviews, the generated texts are less constructive and less factual than human-written reviews for all aspects except the explanation of the core ideas of the papers, which are largely factually correct. Given these results, we pose eight challenges in the pursuit of a good review generation system together with potential solutions, which, hopefully, will inspire more future research in this direction. We make relevant resource publicly available for use by future research: https://github. com/neulab/ReviewAdvisor. In addition, while our conclusion is that the technology is not yet ready for use in high-stakes review settings we provide a system demo, ReviewAdvisor (http://review.nlpedia.ai/), showing the current capabilities and failings of state-of-the-art NLP models at this task (see demo screenshot in A.2). A review of this paper written by the system proposed in this paper can be found in A.1.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
The Communications Web site, http://cacm.acm.org, features more than a dozen bloggers in the BLOG@CACM community. In each issue of Communications, we'll publish selected posts or excerpts. twitter Follow us on Twitter at http://twitter.com/blogCACM http://cacm.acm.org/blogs/blog-cacm John Langford examines the results of the NIPS experiment, while Mark Guzdial considers the role of class size in teaching computer science.
Transformers: state-of-the-art natural language processing
  • T Wolf
Avoiding a tragedy of the commons in the peer review process
  • D Sculley
  • J Snoek
  • A Wiltschko
An open review of OpenReview: a critical analysis of the machine learning conference review process
  • D Tran
Bert: pre-training of deep bidirectional transformers for language understanding
  • J Devlin
  • M W Chang
  • K Lee
  • K Toutanova