Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Background Access to sample-level metadata is important when selecting public metagenomic sequencing datasets for reuse in new biological analyses. The Standards, Precautions, and Advances in Ancient Metagenomics community (SPAAM, https://spaam-community.org) has previously published AncientMetagenomeDir, a collection of curated and standardised sample metadata tables for metagenomic and microbial genome datasets generated from ancient samples. However, while sample-level information is useful for identifying relevant samples for inclusion in new projects, Next Generation Sequencing (NGS) library construction and sequencing metadata are also essential for appropriately reprocessing ancient metagenomic data. Currently, recovering information for downloading and preparing such data is difficult when laboratory and bioinformatic metadata is heterogeneously recorded in prose-based publications. Methods Through a series of community-based hackathon events, AncientMetagenomeDir was updated to provide standardised library-level metadata of existing and new ancient metagenomic samples. In tandem, the companion tool 'AMDirT' was developed to facilitate rapid data filtering and downloading of ancient metagenomic data, as well as improving automated metadata curation and validation for AncientMetagenomeDir. Results AncientMetagenomeDir was extended to include standardised metadata of over 6000 ancient metagenomic libraries. The companion tool 'AMDirT' provides both graphical- and command-line interface based access to such metadata for users from a wide range of computational backgrounds. We also report on errors with metadata reporting that appear to commonly occur during data upload and provide suggestions on how to improve the quality of data sharing by the community. Conclusions Together, both standardised metadata reporting and tooling will help towards easier incorporation and reuse of public ancient metagenomic datasets into future analyses.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... I surveyed the genomic data deposited into public archives by 42 recent studies that sequenced ancient or historical DNA from multicellular organisms through shotgun sequencing or large-scale capture . Studies with primarily microbial or metagenomic targets were not included here, but while current practices and considerations might differ slightly for such studies [50][51][52] , the fundamental archiving principles are the same whether the targeted material is genomic or metagenomic. I found widespread shortcomings in data archiving practices across the surveyed studies, and I present the results within the framework of six best-practice recommendations School of Biological Sciences, University of East Anglia, Norwich, UK. e-mail: a.bergstrom@uea.ac.uk artIcle OPeN Content courtesy of Springer Nature, terms of use apply. ...
... Depending on DNA preservation properties and read lengths, it should not be ruled out that some datasets might contain endogenous molecules that genuinely are longer than twice the read length and thus fail merging. Some current use cases for archived data also require the paired-end structure of reads to be intact, e.g. in metagenomics de-novo assembly 52 . In a similar vein, while analysts might choose to trim off the last few bases of all reads to mitigate the effects of postmortem damage, this should not be applied to the version of the files that are archived-it represents a loss of information about the underlying sequence, and future users should be able to make their own trimming decisions. ...
... www.nature.com/scientificdata/ libraries that have previously been shotgun sequenced, etc. Keeping track of how samples, libraries and sequencing runs relate to each other is important to enable correct reprocessing, and errors and ambiguity in experiment metadata is a common source of confusion for users of published ancient genomic data 52 . A key feature of the metadata model used by the INSDC databases is that samples exist independently of studies and datasets. ...
Article
Full-text available
Ancient DNA is producing a rich record of past genetic diversity in humans and other species. However, unless the primary data is appropriately archived, its long-term value will not be fully realised. I surveyed publicly archived data from 42 recent ancient genomics studies. Half of the studies archived incomplete datasets, preventing accurate replication and representing a loss of data of potential future use. No studies met all criteria that could be considered best practice. Based on these results, I make six recommendations for data producers: (1) archive all sequencing reads, not just those that aligned to a reference genome, (2) archive read alignments too, but as secondary analysis files, (3) provide correct experiment metadata on samples, libraries and sequencing runs, (4) provide informative sample metadata, (5) archive data from low-coverage and negative experiments, and (6) document archiving choices in papers, and peer review these. Given the reliance on destructive sampling of finite material, ancient genomics studies have a particularly strong responsibility to ensure the longevity and reusability of generated data.
Preprint
Full-text available
Analysis of microbial data from archaeological samples is a rapidly growing field with a great potential for understanding ancient environments, lifestyles and disease spread in the past. However, high error rates have been a long-standing challenge in ancient metagenomics analysis. This is also complicated by a limited choice of ancient microbiome specific computational frameworks that meet the growing computational demands of the field. Here, we propose aMeta, an accurate ancient Metagenomic profiling workflow designed primarily to minimize the amount of false discoveries and computer memory requirements. Using simulated ancient metagenomic samples, we benchmark aMeta against a current state-of-the-art workflow, and demonstrate its superior sensitivity and specificity in both microbial detection and authentication, as well as substantially lower usage of computer memory. aMeta is implemented as a Snakemake workflow to facilitate use and reproducibility.
Article
Full-text available
The analysis of shotgun metagenomic data provides valuable insights into microbial communities, while allowing resolution at individual genome level. In absence of complete reference genomes, this requires the reconstruction of metagenome assembled genomes (MAGs) from sequencing reads. We present the nf-core/mag pipeline for metagenome assembly, binning and taxonomic classification. It can optionally combine short and long reads to increase assembly continuity and utilize sample-wise group-information for co-assembly and genome binning. The pipeline is easy to install-all dependencies are provided within containers-portable and reproducible. It is written in Nextflow and developed as part of the nf-core initiative for best-practice pipeline development. All codes are hosted on GitHub under the nf-core organization https://github.com/nf-core/mag and released under the MIT license.
Article
Full-text available
The broadening utilisation of ancient DNA to address archaeological, palaeontological, and biological questions is resulting in a rising diversity in the size of laboratories and scale of analyses being performed. In the context of this heterogeneous landscape, we present an advanced, and entirely redesigned and extended version of the EAGER pipeline for the analysis of ancient genomic data. This Nextflow pipeline aims to address three main themes: accessibility and adaptability to different computing configurations, reproducibility to ensure robust analytical standards, and updating the pipeline to the latest routine ancient genomic practices. The new version of EAGER has been developed within the nf-core initiative to ensure high-quality software development and maintenance support; contributing to a long-term life-cycle for the pipeline. nf-core/eager will assist in ensuring that a wider range of ancient DNA analyses can be applied by a diverse range of research groups and fields.
Article
Full-text available
Ancient DNA and RNA are valuable data sources for a wide range of disciplines. Within the field of ancient metagenomics, the number of published genetic datasets has risen dramatically in recent years, and tracking this data for reuse is particularly important for large-scale ecological and evolutionary studies of individual taxa and communities of both microbes and eukaryotes. AncientMetagenomeDir (archived at https://doi.org/10.5281/zenodo.3980833) is a collection of annotated metagenomic sample lists derived from published studies that provide basic, standardised metadata and accession numbers to allow rapid data retrieval from online repositories. These tables are community-curated and span multiple sub-disciplines to ensure adequate breadth and consensus in metadata definitions, as well as longevity of the database. Internal guidelines and automated checks facilitate compatibility with established sequence-read archives and term-ontologies, and ensure consistency and interoperability for future meta-analyses. This collection will also assist in standardising metadata reporting for future ancient metagenomic studies. Measurement(s)genome • Metagenome • Metadata • Ancient DNATechnology Type(s)digital curationFactor Type(s)geographic location • sample age Measurement(s) genome • Metagenome • Metadata • Ancient DNA Technology Type(s) digital curation Factor Type(s) geographic location • sample age Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.13241537
Article
Full-text available
The European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena), provided by the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), has for almost forty years continued in its mission to freely archive and present the world's public sequencing data for the benefit of the entire scientific community and for the acceleration of the global research effort. Here we highlight the major developments to ENA services and content in 2020, focussing in particular on the recently released updated ENA browser, modernisation of our release process and our data coordination collaborations with specific research communities.
Article
Full-text available
The detailed know-how to implement research protocols frequently remains restricted to the research group that developed the method or technology. This knowledge often exists at a level that is too detailed for inclusion in the methods section of scientific articles. Consequently, methods are not easily reproduced, leading to a loss of time and effort by other researchers. The challenge is to develop a method-centered collaborative platform to connect with fellow researchers and discover state-of-the-art knowledge. Protocols.io is an open-access platform for detailing, sharing, and discussing molecular and computational protocols that can be useful before, during, and after publication of research results.
Article
Full-text available
There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.
Article
Full-text available
This study analyzes data sharing regarding mitochondrial, Y chromosomal and autosomal polymorphisms in a total of 162 papers on ancient human DNA published between 1988 and 2013. The estimated sharing rate was not far from totality (97.6% ± 2.1%) and substantially higher than observed in other fields of genetic research (evolutionary, medical and forensic genetics). Both a questionnaire-based survey and the examination of Journals’ editorial policies suggest that this high sharing rate cannot be simply explained by the need to comply with stakeholders requests. Most data were made available through body text, but the use of primary databases increased in coincidence with the introduction of complete mitochondrial and next-generation sequencing methods. Our study highlights three important aspects. First, our results imply that researchers’ awareness of the importance of openness and transparency for scientific progress may complement stakeholders’ policies in achieving very high sharing rates. Second, widespread data sharing does not necessarily coincide with a prevalent use of practices which maximize data findability, accessibility, useability and preservation. A detailed look at the different ways in which data are released can be very useful to detect failures to adopt the best sharing modalities and understand how to correct them. Third and finally, the case of human paleogenetics tells us that a widespread awareness of the importance of Open Science may be important to build reliable scientific practices even in the presence of complex experimental challenges.
Article
Full-text available
Next-generation sequencing technologies have revolutionized the field of paleogenomics, allowing the reconstruction of complete ancient genomes and their comparison with modern references. However, this requires the processing of vast amounts of data and involves a large number of steps that use a variety of computational tools. Here we present PALEOMIX (http://geogenetics.ku.dk/publications/paleomix), a flexible and user-friendly pipeline applicable to both modern and ancient genomes, which largely automates the in silico analyses behind whole-genome resequencing. Starting with next-generation sequencing reads, PALEOMIX carries out adapter removal, mapping against reference genomes, PCR duplicate removal, characterization of and compensation for postmortem damage, SNP calling and maximum-likelihood phylogenomic inference, and it profiles the metagenomic contents of the samples. As such, PALEOMIX allows for a series of potential applications in paleogenomics, comparative genomics and metagenomics. Applying the PALEOMIX pipeline to the three ancient and seven modern Phytophthora infestans genomes as described here takes 5 d using a 16-core server.
Article
Full-text available
Under favorable conditions DNA can survive for thousands of years in the remains of dead organisms. The DNA extracted from such remains is invariably degraded to a small average size by processes that at least partly involve depurination. It also contains large amounts of deaminated cytosine residues that are accumulated toward the ends of the molecules, as well as several other lesions that are less well characterized.
Article
Full-text available
Here we present a standard developed by the Genomic Standards Consortium (GSC) for reporting marker gene sequences--the minimum information about a marker gene sequence (MIMARKS). We also introduce a system for describing the environment from which a biological sample originates. The 'environmental packages' apply to any genome sequence of known origin and can be used in combination with MIMARKS and other GSC checklists. Finally, to establish a unified standard for describing sequence data and to provide a single point of entry for the scientific community to access and learn about GSC checklists, we present the minimum information about any (x) sequence (MIxS). Adoption of MIxS will enhance our ability to analyze natural genetic diversity documented by massive DNA sequencing efforts from myriad ecosystems in our ever-changing biosphere.
Article
Although the first ancient DNA molecules were extracted more than three decades ago, the first ancient nuclear genomes could only be characterized after high-throughput sequencing was invented. Genome-scale data have now been gathered from thousands of ancient archaeological specimens, and the number of ancient biological tissues amenable to genome sequencing is growing steadily. Ancient DNA fragments are typically ultrashort molecules and carry extensive amounts of chemical damage accumulated after death. Their extraction, manipulation and authentication require specific experimental wet-laboratory and dry-laboratory procedures before patterns of genetic variation from past individuals, populations and species can be interpreted. Ancient DNA data help to address an entire array of questions in anthropology, evolutionary biology and the environmental and archaeological sciences. The data have revealed a considerably more dynamic past than previously appreciated and have revolutionized our understanding of many major prehistoric and historic events. This Primer provides an overview of concepts and state-of-the-art methods underlying ancient DNA analysis and illustrates the diversity of resulting applications. The article also addresses some of the ethical challenges associated with the destructive analysis of irreplaceable material, emphasizes the need to fully involve archaeologists and stakeholders as part of the research design and analytical process, and discusses future perspectives.
Article
Current conventions for reporting radiocarbon determinations do not cover the reporting of calibrated dates. This article proposes revised conventions that have been endorsed by many C-14 scientists. For every determination included in a scientific paper, the following should apply: (1) the laboratory measurement should be reported as a conventional radiocarbon age in C-14 yr BP or a fractionation-corrected fraction modern ((FC)-C-14) value; (2) the laboratory code for the determination should be included; and (3) the sample material dated, the pretreatment method applied, and quality control measurements should be reported. In addition, for every calibrated determination or modeled date, the following should be reported: (4) the calibration curve and any reservoir offset used; (5) the software used for calibration, including version number, the options and/or models used, and wherever possible a citation of a published description of the software; and (6) the calibrated date given as a range (or ranges) with an associated probability on a clearly identifiable calendar timescale.
ewels/sra-explorer: Version 1.0.
  • P Ewels
ffq: A tool to find sequencing data and metadata from public databases.
  • A Gálvez-Merchán
PubMed Abstract|Publisher Full Text
PubMed Abstract|Publisher Full Text
SPAAM-community/AncientMetagenomeDir: v22.09.2.
  • J Fellows Yates
streamlit-aggrid: Implementation of Ag-Grid component for streamlit.
  • P Fonseca
SPAAM-community/AncientMetagenomeDir: v23.03.0: Rocky necropolis of pantalica.
  • J Fellows Yates
Fellows Yates JA: ewels/sra-explorer: Version 1.0
  • P Ewels
  • A Duncan
Ewels P, Duncan A, Fellows Yates JA: ewels/sra-explorer: Version 1.0. March 2023. Reference Source
ffq: A tool to find sequencing data and metadata from public databases. 2022. Reference Source
  • Á Gálvez-Merchán
  • Khj Min
  • L Pachter
Gálvez-Merchán Á, Min KHJ, Pachter L, et al.: ffq: A tool to find sequencing data and metadata from public databases. 2022. Reference Source
SPAAMcommunity/AncientMetagenomeDir: v22.09.2
  • J A Fellows Yates
  • Andrades Valtueña
  • A Vågene
Fellows Yates JA, Andrades Valtueña A, Vågene ÅJ, et al.: SPAAMcommunity/AncientMetagenomeDir: v22.09.2. August 2022. Reference Source 13. Grüning B, Dale R, Sjödin A, et al.: Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods. July 2018; 15(7): 475-476.
pandas-dev/pandas: Pandas.