ArticlePublisher preview available

Comparing published scientific journal articles to their pre-print versions

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Academic publishers claim that they add value to scholarly communications by coordinating reviews and contributing and enhancing text during publication. These contributions come at a considerable cost: US academic libraries paid \(\$1.7\) billion for serial subscriptions in 2008 alone. Library budgets, in contrast, are flat and not able to keep pace with serial price inflation. We have investigated the publishers’ value proposition by conducting a comparative study of pre-print papers from two distinct science, technology, and medicine corpora and their final published counterparts. This comparison had two working assumptions: (1) If the publishers’ argument is valid, the text of a pre-print paper should vary measurably from its corresponding final published version, and (2) by applying standard similarity measures, we should be able to detect and quantify such differences. Our analysis revealed that the text contents of the scientific papers generally changed very little from their pre-print to final published versions. These findings contribute empirical indicators to discussions of the added value of commercial publishers and therefore should influence libraries’ economic decisions regarding access to scholarly publications.
This content is subject to copyright. Terms and conditions apply.
International Journal on Digital Libraries (2019) 20:335–350
https://doi.org/10.1007/s00799-018-0234-1
Comparing published scientific journal articles to their pre-print
versions
Martin Klein1·Peter Broadwell2·Sharon E. Farb2·Todd Grappone2
Received: 18 May 2017 / Revised: 2 January 2018 / Accepted: 18 January 2018 / Published online: 5 February 2018
© This is a U.S. government work and its text is not subject to copyright protection in the United States; however, its text may be subject to foreign
copyright protection 2018
Abstract
Academic publishers claim that they add value to scholarly communications by coordinating reviews and contributing and
enhancing text during publication. These contributions come at a considerable cost: US academic libraries paid $1.7 billion
for serial subscriptions in 2008 alone. Library budgets, in contrast, are flat and not able to keep pace with serial price
inflation. We have investigated the publishers’ value proposition by conducting a comparative study of pre-print papers from
two distinct science, technology, and medicine corpora and their final published counterparts. This comparison had two
working assumptions: (1) If the publishers’ argument is valid, the text of a pre-print paper should vary measurably from its
corresponding final published version, and (2) by applying standard similarity measures, we should be able to detect and
quantify such differences. Our analysis revealed that the text contents of the scientific papers generally changed very little
from their pre-print to final published versions. These findings contribute empirical indicators to discussions of the added
value of commercial publishers and therefore should influence libraries’ economic decisions regarding access to scholarly
publications.
Keywords Open access ·Pre-print ·Scholarly publishing ·Text similarity
1 Introduction
Academic publishers of all types claim that they add value to
scholarly communications by coordinating reviews and con-
tributing and enhancing text during publication. These contri-
butions come at a considerable cost: U.S. academic libraries
paid $1.7 billion for serial subscriptions in 2008 alone and
this number continues to rise. Library budgets, in contrast, are
flat and not able to keep pace with serial price inflation. Sev-
eral institutions have therefore discontinued or significantly
scaled back their subscription agreements with commercial
BMartin Klein
mklein@lanl.gov; martinklein0815@gmail.com
Peter Broadwell
broadwell@library.ucla.edu
Sharon E. Farb
farb@library.ucla.edu
Todd Grappone
grappone@library.ucla.edu
1Los Alamos National Laboratory, Los Alamos, NM, USA
2University of California, Los Angeles, Los Angeles, CA, USA
publishers such as Elsevier and Wiley-Blackwell. We have
investigated the publishers’ value proposition by conducting
a comparative study of pre-print papers and their final pub-
lished counterparts in the areas of science, technology, and
medicine (STM). We have two working assumptions:
1. If the publishers’ argument is valid, the text of a pre-print
paper should vary measurably from its corresponding
final published version.
2. By applying standard similarity measures, we should be
able to detect and quantify such differences.
In this paper, we present our preliminary results based on pre-
print publications from arXiv.org and bioRxiv.org and their
final published counterparts. After matching papers via their
digital object identifier (DOI), we applied comparative ana-
lytics and evaluated the textual similarities of components of
the papers such as the title, abstract, and body. Our analysis
revealed that the text of the papers in our test data set changed
very little from their pre-print to final published versions,
although more copyediting changes were evident in the paper
sets from bioRxiv.org than those from arXiv.org. In gen-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... 23 Metascience studies suggest that the discrepancy between preprints and peer-reviewed articles is small and the quality of reporting is within comparable range, supporting the validity of communicating research findings in preprints before review. 24,25 The results of our study and others [24][25][26][27][28][29][30] suggest that the reliability of data reported in preprints is generally high. Although there are measurable effects on research articles after peer review, such as the observed reduction in CIs, effect sizes are small. ...
... 23 Metascience studies suggest that the discrepancy between preprints and peer-reviewed articles is small and the quality of reporting is within comparable range, supporting the validity of communicating research findings in preprints before review. 24,25 The results of our study and others [24][25][26][27][28][29][30] suggest that the reliability of data reported in preprints is generally high. Although there are measurable effects on research articles after peer review, such as the observed reduction in CIs, effect sizes are small. ...
Article
Full-text available
Scientists have expressed concern that the risk of flawed decision making is increased through the use of preprint data that might change after undergoing peer review. This Health Policy paper assesses how COVID-19 evidence presented in preprints changes after review. We quantified attrition dynamics of more than 1000 epidemiological estimates first reported in 100 preprints matched to their subsequent peer-reviewed journal publication. Point estimate values changed an average of 6% during review; the correlation between estimate values before and after review was high (0·99) and there was no systematic trend. Expert peer-review scores of preprint quality were not related to eventual publication in a peer-reviewed journal. Uncertainty was reduced during peer review, with CIs reducing by 7% on average. These results support the use of preprints, a component of biomedical research literature, in decision making. These results can also help inform the use of preprints during the ongoing COVID-19 pandemic and future disease outbreaks.
... The main arguments against preprints are that they take a long time to publish, allow for the 2 possibility of being scooped, and aren't peer-reviewed [14,15,15,16,17,18,19]. This lack of peerreview can lead to submissions containing inconsistent results or conclusions [8,9] This trend was one of the driving factors for efforts to examine textual differences between preprints and their corresponding published versions [20,21]. Interestingly, these studies found that most differences between preprints and their corresponding published versions were small stylistic changes [20,21]. ...
... This lack of peerreview can lead to submissions containing inconsistent results or conclusions [8,9] This trend was one of the driving factors for efforts to examine textual differences between preprints and their corresponding published versions [20,21]. Interestingly, these studies found that most differences between preprints and their corresponding published versions were small stylistic changes [20,21]. However, these studies only had limited data to analyze the differences. ...
Article
Scientific communication is essential for science as it enables the field to grow. This task is often accomplished through a written form such as preprints and published papers. We can obtain a high-level understanding of science and how scientific trends adapt over time by analyzing these resources. This thesis focuses on conducting multiple analyses using biomedical preprints and published papers. In Chapter 2, we explore the language contained within preprints and examine how this language changes due to the peer-review process. We find that token differences between published papers and preprints are stylistically based, suggesting that peer-review results in modest textual changes. We also discovered that preprints are eventually published and adopted quickly within the life science community. Chapter 3 investigates how biomedical terms and tokens change their meaning and usage through time. We show that multiple machine learning models can correct for the latent variation contained within the biomedical text. Also, we provide the scientific community with a listing of over 43,000 potential change points. Tokens with notable changepoints such as “sars” and “cas9” appear within our listing, providing some validation for our approach. In Chapter 4, we use the weak supervision paradigm to examine the possibility of speeding up the labeling function generation process for multiple biomedical relationship types. We found that the language used to describe a biomedical relationship is often distinct, leading to a modest performance in terms of transferability. An exception to this trend is Compound-binds-Gene and Gene-interacts-Gene relationship types.
... This obviously has consequences for the "objective" science. For example, we know that agreement between peer-reviewers about manuscripts, and indeed research funding applications, is extremely weak , and that peer review often adds minimal quality enhancements to manuscripts (Carneiro et al., 2020;Klein et al., 2019). ...
... Others may intentionally use preprints to release replications, and null results that are difficult to publish (27), or works in progress which may be less well-written and inadequate at sharing data/code (17)(18)(19)(20). Many preprints actually report their results in a balanced way so as not to 'oversell' their findings (31), and there is growing evidence of high concordance between findings published in preprints and in peer-reviewed journals (25,(32)(33)(34)(35)(36)(37)(38)(39). Nonetheless, a large portion of COVID-19 preprints show substantial changes in methods and results after peer-review (nearly half of the preprints analysed by Oikonomidi (2020) and Nicolalde et al. (2020) (25,27)), suggesting flaws in the most essential elements of many COVID-19 preprints. ...
Article
Background: The quality of COVID-19 preprints should be considered with great care, as their contents can influence public policy. Efforts to improve preprint quality have mostly focused on introducing quick peer review, but surprisingly little has been done to calibrate the public’s evaluation of preprints and their contents. Purpose: The PRECHECK project aimed to generate a tool to teach and guide scientifically literate non-experts to critically evaluate preprints, on COVID-19 and beyond. Methods: To create a checklist, we applied a 4-step procedure consisting of an initial internal review, an external review by a pool of experts (methodologists, meta-researchers/experts on preprints, journal editors, and science journalists), a final internal review, and an implementation stage. For the external review step, experts rated the relevance of each element of the checklist on five-point Likert scales, and provided written feedback. After each internal review round, we applied the checklist on a set of high-quality preprints from an online list of milestone research works on COVID-19 and low-quality preprints, which were eventually retracted, to verify whether the checklist can discriminate between the two categories. Results: At the external review step, 26 of the 54 contacted experts responded. The final checklist contained 4 elements (Research question, Study type, Transparency and integrity, and Limitations), with ‘superficial’ and ‘deep’ levels for evaluation. When using both levels of evaluation, the checklist was effective at discriminating high- from low-quality preprints. Its usability was confirmed in workshops with our target audience: Bachelors students in Psychology and Medicine, and science journalists. Conclusions: We created a simple, easy-to-use tool for helping scientifically literate non-experts navigate preprints with a critical mind. We believe that our checklist has great potential to help guide decisions about the quality of preprints on COVID-19 in our target audience and that this extends beyond COVID-19.
... Algunas veces las editoriales promueven o dicen promover su publicación en sitios especializados antes de que exista la revisión por pares. El crecimiento en el número de manuscritos en los servidores de preprints (Narock & Goldstein, 2019) es un orden de magnitud mayor que el de los artículos publicados en revistas (Lin, 2018) , aunque la calidad de la publicación final no sea mayor que la del preprint (Carneiro et al., 2019;Klein et al., 2019) . Esta comparación solo puede hacerse con aquellos preprints que luego pasan por una revisión por pares y llegan a ser publicados de forma tradicional. ...
Article
Full-text available
Uruguay, al igual que más de 190 países miembros, ha suscrito la Recomendación de Ciencia Abierta de Unesco que se ha aprobado en noviembre de 2021. La ciencia abierta es un ecosistema de procesos interconectados construido sobre distintos movimientos: acceso abierto, datos abiertos, código abierto e investigación abierta reproducible, entre otros, cuyo objetivo es hacer las investigaciones científicas, datos y divulgación accesibles e inclusivos para todos los niveles de la sociedad. La implementación de políticas de ciencia abierta requiere equilibrar cuidadosamente sus costos y beneficios. Las experiencias de algunos países parecen ser exitosas, aunque la factibilidad de algunos aspectos plantea dudas en la comunidad científica. Los países del Sur Global tienen una oportunidad para posicionarse y beneficiarse de esta transición, pero deben estar un paso adelante y ser parte de su construcción. En este trabajo se revisan los principales conceptos para la implementación de un sistema de ciencia abierta y se realizan algunas consideraciones sobre el sistema
... In other words, we assume that there are revisions at the same degree between all pairs of preprints and publisher versions. Klein et al. [19] investigated textual similarity of preprints and publisher versions using arXiv and bioRxiv, and reported that there are no significant difference between them. On the other hand, Oikonomidi et al. [28] stated that the evidence components reported across preprints and publisher versions are not stable over time, focusing on COVID-19 research. ...
Conference Paper
Full-text available
Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bias is produced. Existing works, however, have not analyzed preprints with respect to citation bias, although they play an increasingly important role in modern scholarly communication. In this paper, we investigate whether preprints are affected by citation bias with respect to the author affiliation. We measure citation bias for bioRxiv preprints and their publisher versions at the institution level and country level, using the Lorenz curve and Gini coefficient. This allows us to mitigate the effects of confounding factors and see whether or not citation biases related to author affiliation have an increased effect on preprint citations. We observe consistent higher Gini coefficients for preprints than those for publisher versions. Thus, we can confirm that citation bias exists and that it is more severe in case of preprints. As preprints are on the rise, affiliation-based citation bias is, thus, an important topic not only for authors (e.g., when deciding what to cite), but also to people and institutions that use citations for scientific impact quantification (e.g., funding agencies deciding about funding based on citation counts).
... In other words, we assume that there are revisions at the same degree between all pairs of preprints and publisher versions. Klein et al. [19] investigated textual similarity of preprints and publisher versions using arXiv and bioRxiv, and reported that there are no significant difference between them. On the other hand, Oikonomidi et al. [28] stated that the evidence components reported across preprints and publisher versions are not stable over time, focusing on COVID-19 research. ...
Preprint
Full-text available
Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bias is produced. Existing works, however, have not analyzed preprints with respect to citation bias, although they play an increasingly important role in modern scholarly communication. In this paper, we investigate whether preprints are affected by citation bias with respect to the author affiliation. We measure citation bias for bioRxiv preprints and their publisher versions at the institution level and country level, using the Lorenz curve and Gini coefficient. This allows us to mitigate the effects of confounding factors and see whether or not citation biases related to author affiliation have an increased effect on preprint citations. We observe consistent higher Gini coefficients for preprints than those for publisher versions. Thus, we can confirm that citation bias exists and that it is more severe in case of preprints. As preprints are on the rise, affiliation-based citation bias is, thus, an important topic not only for authors (e.g., when deciding what to cite), but also to people and institutions that use citations for scientific impact quantification (e.g., funding agencies deciding about funding based on citation counts).
Article
Full-text available
Importance Preprints have been widely adopted to enhance the timely dissemination of research across many scientific fields. Concerns remain that early, public access to preliminary medical research has the potential to propagate misleading or faulty research that has been conducted or interpreted in error. Objective To evaluate the concordance among study characteristics, results, and interpretations described in preprints of clinical studies posted to medRxiv that are subsequently published in peer-reviewed journals (preprint-journal article pairs). Design, Setting, and Participants This cross-sectional study assessed all preprints describing clinical studies that were initially posted to medRxiv in September 2020 and subsequently published in a peer-reviewed journal as of September 15, 2022. Main Outcomes and Measures For preprint-journal article pairs describing clinical trials, observational studies, and meta-analyses that measured health-related outcomes, the sample size, primary end points, corresponding results, and overarching conclusions were abstracted and compared. Sample size and results from primary end points were considered concordant if they had exact numerical equivalence. Results Among 1399 preprints first posted on medRxiv in September 2020, a total of 1077 (77.0%) had been published as of September 15, 2022, a median of 6 months (IQR, 3-8 months) after preprint posting. Of the 547 preprint-journal article pairs describing clinical trials, observational studies, or meta-analyses, 293 (53.6%) were related to COVID-19. Of the 535 pairs reporting sample sizes in both sources, 462 (86.4%) were concordant; 43 (58.9%) of the 73 pairs with discordant sample sizes had larger samples in the journal publication. There were 534 pairs (97.6%) with concordant and 13 pairs (2.4%) with discordant primary end points. Of the 535 pairs with numerical results for the primary end points, 434 (81.1%) had concordant primary end point results; 66 of the 101 discordant pairs (65.3%) had effect estimates that were in the same direction and were statistically consistent. Overall, 526 pairs (96.2%) had concordant study interpretations, including 82 of the 101 pairs (81.2%) with discordant primary end point results. Conclusions and Relevance Most clinical studies posted as preprints on medRxiv and subsequently published in peer-reviewed journals had concordant study characteristics, results, and final interpretations. With more than three-fourths of preprints published in journals within 24 months, these results may suggest that many preprints report findings that are consistent with the final peer-reviewed publications.
Article
Objective Availability of randomized controlled trial (RCT) protocols is essential for the interpretation of trial results and research transparency. Study Design and Setting In this study, we determined the availability of RCT protocols approved in Switzerland, Canada, Germany and the UK in 2012. For these RCTs, we searched PubMed, Google Scholar, Scopus, and trial registries for publicly available protocols and corresponding full-text publications of results. We determined the proportion of RCTs with: (1) publicly available protocols, (2) publications citing the protocol, and (3) registries providing a link to the protocol. A multivariable logistic regression model explored factors associated with protocol availability. Results 326 RCTs were included of which 118 (36.2 %) made their protocol publicly available; 56 (47.6% 56/118) as a peer-reviewed publications and 48 (40.7%, 48/118) provided as supplementary material. 90.9% (100/110) of the protocols were cited in the main publication and 55.9% (66/118) were linked in clinical trial registry. Larger sample size (>500; OR 5.90, 95% CI, 2.75-13.31) and investigator-sponsorship (OR 1.99, 95% CI, 1.11-3.59) were associated with increased protocol availability. Most protocols were made available shortly before the publication of the main results. Conclusion RCT protocols should be made available at an early stage of the trial.
Article
Full-text available
Google Scholar, a widely used academic search engine, plays a major role in finding free full-text versions of articles. But little is known about the sources of full-text files in Google Scholar. The aim of the study was to find out about the sources of full-text items and to look at subject differences in terms of number of versions, times cited, rate of open access availability and sources of full-text files. Three queries were created for each of 277 minor subject categories of Scopus. The queries were searched in Google Scholar and the first ten hits for each query were analyzed. Citations and patents were excluded from the results and the time frame was limited to 2004–2014. Results showed that 61.1 % of articles were accessible in full-text in Google Scholar; 80.8 % of full-text articles were publisher versions and 69.2 % of full-text articles were PDF. There was a significant difference between the means of times cited of full text items and non-full-text items. The highest rate of full text availability for articles belonged to life science (66.9 %). Publishers’ websites were the main source of bibliographic information for non-full-text articles. For full-text articles, educational (edu, ac.xx etc.) and org domains were top two sources of full text files. ResearchGate was the top single website providing full-text files (10.5 % of full-text articles).
Article
Full-text available
A "mega-journal" is a new type of scientific journal that publishes freely accessible articles, which have been peer reviewed for scientific trustworthiness, but leaves it to the readers to decide which articles are of interest and importance to them. In the wake of the phenomenal success of PLOS ONE, several other publishers have recently started mega-journals. This article presents the evolution of mega-journals since 2010 in terms of article publication rates. The fastest growth seems to have ebbed out at around 35,000 annual articles for the 14 journals combined. Acceptance rates are in the range of 50-70%, and speed of publication is around 3-5 months. Common features in mega-journals are alternative impact metrics, easy reusability of figures and data, post-publication discussions and portable reviews from other journals.
Article
Full-text available
Many studies in information science have looked at the growth of science. In this study, we re-examine the question of the growth of science. To do this we (i) use current data up to publication year 2012 and (ii) analyse it across all disciplines and also separately for the natural sciences and for the medical and health sciences. Furthermore, the data are analysed with an advanced statistical technique - segmented regression analysis - which can identify specific segments with similar growth rates in the history of science. The study is based on two different sets of bibliometric data: (1) The number of publications held as source items in the Web of Science (WoS, Thomson Reuters) per publication year and (2) the number of cited references in the publications of the source items per cited reference year. We have looked at the rate at which science has grown since the mid-1600s. In our analysis of cited references we identified three growth phases in the development of science, which each led to growth rates tripling in comparison with the previous phase: from less than 1% up to the middle of the 18th century, to 2 to 3% up to the period between the two world wars and 8 to 9% to 2012.
Article
Full-text available
Common error in bibliographies: "Étude comparative de la distribution florale dans une portion des Alpes et des Jura".
Article
Contents Executive summary ● Scholarly communication ● The research cycle ● Types of scholarly communication ● Changes in scholarly communication system ● The journal ● What is a journal? ● The journals publishing cycle ● Sales channels and models ● Journal economics and market size ● Journal and articles numbers and trends ● Global trends in scientific output ● Authors and readers ● Publishers ● Peer review. ● Reading patterns ● Disciplinary differences ● Citations and the Impact Factor ● Costs of journal publishing ● Authors’ behaviour, perceptions and attitudes ● Publishing ethics ● Copyright and licensing ● Long term preservation ● TRANSFER code ● Researchers’ access to journals ● Open access ● Drivers of open access ● Open access business models ● Types of open access journal ● Delayed open access ● Open access via self-archiving ("Green" OA) ● Other open access variants ● SCOAP3 ● Open access to scholarly books ● Public access ● System-wide and economic perspectives ● Other developments in open access ● Transition and sustainability issues ● Effect of self-archiving on journals. ● Open access impacts on use ● New developments in scholarly communication ● “Science 2.0” or "Open Science" ● FORCE11 and “Science in Transition” ● Publishing platforms and APIs ● Social media ● Mobile access and apps ● Research data ● Semantic web and semantic enrichment ● New article formats and features. ● Text and data mining ● Reproducibility ● Big data & analytics ● Identity and disambiguation ● Research management and analytics ● FundRef ● Library publishing ● Open Annotation ● Learned societies ● Author services and tools ● Collaborative writing and sharing tools ● Open notebook science ● Conclusions ● Information sources ● Publisher organisations ● Global statistics and trends ● Open access ● Publishing industry research and analysis ● References 180pp
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.