Scheme of MMD double sampling and multinomial bottleneck. From top to bottom: p contains the true relative frequencies in the sample and is the estimation target. Features are sampled with a multinomial distribution P[m|p, M]. Counts in m are not observed. A replication process is applied, e.g. PCR, thus producing an exponential growth of features whose m i = 0. A new multinomial sampling is carried out with probabilities q = m/M, only if equal rates of replication are assumed. Features that were not previously sampled, m i = 0, are not resampled in this second multinomial sampling. The counts obtained in this second sampling are observed and constitute the data. The procedure ends estimating p using some estimator p.

Scheme of MMD double sampling and multinomial bottleneck. From top to bottom: p contains the true relative frequencies in the sample and is the estimation target. Features are sampled with a multinomial distribution P[m|p, M]. Counts in m are not observed. A replication process is applied, e.g. PCR, thus producing an exponential growth of features whose m i = 0. A new multinomial sampling is carried out with probabilities q = m/M, only if equal rates of replication are assumed. Features that were not previously sampled, m i = 0, are not resampled in this second multinomial sampling. The counts obtained in this second sampling are observed and constitute the data. The procedure ends estimating p using some estimator p.

Source publication
Article
Full-text available
download: https://academic.oup.com/nargab/article/2/4/lqaa094/5996081?login=true Measurements in sequencing studies are mostly based on counts. There is a lack of theoretical developments for the analysis and modelling of this type of data. Some thoughts in this direction are presented, which might serve as a seed. The main issues addressed are th...

Contexts in source publication

Context 1
... μ (n) (p) are the ordinary moments of a multinomial with probabilities p. The orders of the moments are in n. Figure 1 is a scheme of the MMD generation. The simulation of the MMD in (5) is easily carried out (see Appendix D). ...
Context 2
... characteristics are reproduced in an MMD simulation, thus suggesting that something similar occurs in practice. Figure B1 in Appendix B shows some of these characteristics. ...
Context 3
... probability function is easily computable, as the two factors can be computed from standard numerical functions. Figure A1 shows the effect of considering the number of trials in a binomial (multinomial of two categories) as random and Poisson distributed (green crosses) with mean equal to 50. The blue (red) curves correspond to a probability of the binomial equal to 0.1 (0.4). ...
Context 4
... overdispersion observed in sequencing data cannot be explained with the variability of the number of trials in the sample. Note that, in order to obtain the pd in Figure A1, the pd in Equation (A3) needs to be marginalized again to obtain the distribution of a single category. This consists of an integration over a discrete simplex. ...

Similar publications

Article
Full-text available
Compositional data are commonly known as multivariate observations carrying relative information. Even though the case of vector or even two-factorial compositional data (compositional tables) is already well described in the literature, there is still a need for a comprehensive approach to the analysis of multi-factorial relative-valued data. Ther...
Article
Full-text available
In this article we study the structure of pencils of conics passing through four points of the plane in general position and consisting entirely of hyperbolas. We show that these pencils represent a generalization of the "Poncelet pencil", whose members are all rectangular hyper-bolas circumscribing a triangle. Our study uses properties deriving fr...
Article
Full-text available
R\'esum\'e This article studies the orthogonal hypergeometric groups of degree five. We establish the thinness of 12 out of the 19 hypergeometric groups of type O (3, 2) from [4, Table 6]. Some of these examples are associated with Calabi-Yau 4-folds. We also establish the thinness of 9 out of the 17 hypergeometric groups of type O (4, 1) from [13]...

Citations

... [15] showed a similar phenomenon in logarithmic transformations in single-cell sequencing; varying read depths among cells led to systematic errors and spurious differences in expression. Relatedly, [6] presented novel modelling techniques in light of these issues in count data. ...
Preprint
Full-text available
Motivation: Compositional data comprise vectors that describe the constituent parts of a whole. Data arising from various -omics platforms such as 16S and RNA-sequencing are compositional in nature. However, correlations between features on raw counts have no meaningful interpretation. Metrics of proportionality were formulated to address this problem. However, there is an inherent bias that arises when calculating these metrics empirically on count-based measures due to variability in read depths. Results: We quantify the bias introduced by empirically calculating proportionality-based association metrics in count data. Additionally, we propose a means of estimating these metrics within a logit-normal multinomial model in pursuit of more accurate estimates. The model-based estimates are shown to outperform empirical estimates in simulated data, and are additionally applied to a mouse embryonic stem-cell single-cell sequencing dataset as well as a pediatric-onset multiple sclerosis metagenomic dataset. Availability and Implementation: An R package is available at https://CRAN.R-project.org/package=countprop. Supplementary information: Supplementary data are available at Bioinformatics online.
... In particular, the (largely unrealized) power of metabarcoding applications lies in the ability to obtain reliable quantitative estimates of underlying communities [17][18][19]. In the case of metabarcoding and similar amplicon-based studies [20], it has become clear that 1) observations are non-linearly related to the underlying biology of interest [21,22], and 2) those observations are noisy, with many having relatively high variances as a function of expected values [19,[23][24][25]. To obtain reliable quantitative estimates for any set of observations, we must be able to distinguish random variation from real signal. ...
... In response, recent mechanistic frameworks have begun to address the discrepancies between observed metabarcoding sequence counts and true underlying biological patterns by modeling the compounding processes that occur between DNA extraction and sequence observation [32][33][34][35][36]. These processes include DNA extraction, PCR, and multiple subsampling steps prior to sequencing [24,25,35,37,38]. We model the collection process after Shelton et al. [37] (Fig 1). ...
... We specifically build on the metabarcoding framework proposed in Shelton et al. [35], in which species-specific amplification efficiencies strongly influence observed sequence proportions, and additionally explore the effects of subsampling rare molecules prior to PCR amplification on patterns of sequence counts and non-detections. Previous work has explored the effects of amplification efficiencies and subsampling on observed metabarcoding results individually [24,25,33,34], but no study to date has modeled both processes simultaneously to explore their interactive and relative effects on observed sequence data. ...
Article
Full-text available
Metabarcoding is a powerful molecular tool for simultaneously surveying hundreds to thousands of species from a single sample, underpinning microbiome and environmental DNA (eDNA) methods. Deriving quantitative estimates of underlying biological communities from metabarcoding is critical for enhancing the utility of such approaches for health and conservation. Recent work has demonstrated that correcting for amplification biases in genetic metabarcoding data can yield quantitative estimates of template DNA concentrations. However, a major source of uncertainty in metabarcoding data stems from non-detections across technical PCR replicates where one replicate fails to detect a species observed in other replicates. Such non-detections are a special case of variability among technical replicates in metabarcoding data. While many sampling and amplification processes underlie observed variation in metabarcoding data, understanding the causes of non-detections is an important step in distinguishing signal from noise in metabarcoding studies. Here, we use both simulated and empirical data to 1) suggest how non-detections may arise in metabarcoding data, 2) outline steps to recognize uninformative data in practice, and 3) identify the conditions under which amplicon sequence data can reliably detect underlying biological signals. We show with both simulations and empirical data that, for a given species, the rate of non-detections among technical replicates is a function of both the template DNA concentration and species-specific amplification efficiency. Consequently, we conclude metabarcoding datasets are strongly affected by (1) deterministic amplification biases during PCR and (2) stochastic sampling of amplicons during sequencing-both of which we can model-but also by (3) stochastic sampling of rare molecules prior to PCR, which remains a frontier for quantitative metabarcoding. Our results highlight the importance of estimating species-specific amplification efficiencies and critically evaluating patterns of non-detection in metabarcoding datasets to better distinguish environmental signal from the noise inherent in molecular detections of rare targets.
... To this end, recent work has begun to characterize the effects of additional mechanistic processes prior to amplification, particularly the effect of subsampling processes on observed read abundances and non-detections (Egozcue et al., 2020;Gold, Shelton, et al., 2022). ...
Article
Full-text available
Marine heatwaves can drive large-scale shifts in marine ecosystems, but studying their impacts on whole species assemblages is difficult. Analysis combining microscopic observations with environmental DNA (eDNA) metabarcoding of the ethanol preservative of an ichthyoplankton biorepository spanning a 23 years time series captures major and sometimes unexpected changes to fish assemblages in the California Current Large Marine Ecosystem during and after the 2014-2016 Pacific Marine Heatwave. Joint modeling efforts reveal patterns of tropicalization with increases in southern, mesopelagic species and associated declines in commercially important temperate fish species (e.g., North Pacific Hake [Merluccius productus] and Pacific Sardine [Sardinops sagax]). Data show shifts in fisheries assemblages (e.g., Northern Anchovy, Engraulis mordax) even after the return to average water temperatures, corroborating ecosystem impacts found through multiple traditional surveys of this study area. Our innovative approach of metabarcoding preservative eDNA coupled with quantitative modeling leverages the taxonomic breadth and resolution of DNA sequences combined with microscopy-derived ichthyoplankton identification to yield higher-resolution, species-specific quantitative abundance estimates. This work opens the door to economically reconstruct the historical dynamics of assemblages from modern and archived samples worldwide. K E Y W O R D S amplicon sequencing, CalCOFI, California Current Ecosystem, eDNA, ichthyoplankton, joint model, marine heatwave, quantitative metabarcoding
... The statistical analysis of infant 16S sequencing datasets is challenging due to a combination of data properties. These data are extremely sparse, they have a high dispersion and they are compositional (4)(5)(6)(7). Sparsity and dispersion are the result of heterogeneity among experimental units (i.e. infants and technical variation). ...
Article
Full-text available
Differential abundance analysis of infant 16S microbial sequencing data is complicated by challenging data properties, including high sparsity, extreme dispersion and the relative nature of the information contained within the data. In this study, we propose a pairwise ratio analysis that uses the compositional data analysis principle of subcompositional coherence and merges it with a beta-binomial regression model. The resulting method provides a flexible and easily interpretable approach to infant 16S sequencing data differential abundance analysis that does not require zero imputation. We evaluate the proposed method using infant 16S data from clinical trials and demonstrate that the proposed method has the power to detect differences, and demonstrate how its results can be used to gain insights. We further evaluate the method using data-inspired simulations and compare its power against related methods. Our results indicate that power is high for pairwise differential abundance analysis of taxon pairs that have a large abundance. In contrast, results for sparse taxon pairs show a decrease in power and substantial variability in method performance. While our method shows promising performance on well-measured subcompositions, we advise strong filtering steps in order to avoid excessive numbers of underpowered comparisons in practical applications.
... Relative abundances are by nature the estimates of the multinomial probabilities for the OTU counts. These have compositional characteristics [19,21] and must be log-transformed to resolve the constant sum constraint [22]. Moreover, distributions of relative abundances are often highly skewed (over-dispersed) with a lot of zero values [23]. ...
Article
Full-text available
Feeding chicken with black soldier fly larvae (BSF) may influence their rates of growth via effects on the composition of their gut microbiota. To verify this hypothesis, we aim to evaluate a probabilistic structural equation model because it can unravel the complex web of relationships that exist between the bacteria involved in digestion and evaluate whether these influence bird growth. We followed 90 chickens fed diets supplemented with 0%, 5% or 10% BSF and measured the strength of the relationship between their weight and the relative abundance of bacteria (OTU) present in their cecum or cloaca at 16, 28, 39, 67 or 73 days of age, while adjusting for potential confounding effects of their age and sex. Results showed that OTUs (62 genera) could be combined into ten latent constructs with distinctive metabolic attributes. Links were discovered between these constructs that suggest nutritional relationships. Age directly influenced weights and microbiotal composition, and three constructs indirectly influenced weights via their dependencies on age. The proposed methodology was able to simplify dependencies among OTUs into knowledgeable constructs and to highlight links potentially important to understand the role of insect feed and of microbiota in chicken growth.
... We characterize the relative abundance of different Synechococcus oligotypes through time and analyse patterns with compositional data analysis techniques. It is increasingly recognized that high-throughput sequence data are compositional in nature (Gloor et al., 2016;Egozcue et al., 2020), and that analysis of this data type requires appropriate tools that take into consideration the distinct challenges of data belonging to a constrained subset of real space (Aitchison, 1986). Common methods of analysis for sequence data, if they do not account for the sample space, can lead to misleading interpretations and errors (Gloor et al., 2016;Chong and Spencer, 2018). ...
Article
Full-text available
Marine microbes often show a high degree of physiological or ecological diversity below the species level. This microdiversity raises questions about the processes that drive diversification and permit coexistence of diverse yet closely‐related marine microbes, especially given the theoretical efficiency of competitive exclusion. Here, we provide insight with an 8‐year time series of diversity within Synechococcus, a widespread and important marine picophytoplankter. The population of Synechococcus on the Northeast U.S. Shelf is comprised of six main types, each of which displays a distinct and consistent seasonal pattern. With compositional data analysis, we show that these patterns can be reproduced with a simple model that couples differential responses to temperature and light with the seasonal cycle of the physical environment. These observations support the hypothesis that temporal variability in environmental factors can maintain microdiversity in marine microbial populations. We also identify how seasonal diversity patterns directly determine overarching Synechococcus population abundance features. This article is protected by copyright. All rights reserved.
... There are two articles that explore the implications of this. Egozcue et al. (23) revisit the distributional modeling of count compositions. While providing a short review of current approaches, they also make a proposal for a new class of distributions with interesting properties. ...
Article
Full-text available
Surf zones are highly dynamic marine ecosystems that are subject to increasing anthropogenic and climatic pressures, posing multiple challenges for biomonitoring. Traditional methods such as seines and hook and line surveys are often labor intensive, taxonomically biased, and can be physically hazardous. Emerging techniques, such as baited remote underwater video (BRUV) and environmental DNA (eDNA) are promising nondestructive tools for assessing marine biodiversity in surf zones of sandy beaches. Here we compare the relative performance of beach seines, BRUV, and eDNA in characterizing community composition of bony (teleost) and cartilaginous (elasmobranch) fishes of surf zones at 18 open coast sandy beaches in southern California. Seine and BRUV surveys captured overlapping, but distinct fish communities with 50% (18/36) of detected species shared. BRUV surveys more frequently detected larger species (e.g. sharks and rays) while seines more frequently detected one of the most abundant species, barred surfperch (Amphistichus argenteus). In contrast, eDNA metabarcoding captured 88.9% (32/36) of all fishes observed in seine and BRUV surveys plus 57 additional species, including 15 that frequent surf zone habitats. On average, eDNA detected over 5 times more species than BRUVs and 8 times more species than seine surveys at a given site. eDNA approaches also showed significantly higher sensitivity than seine and BRUV methods and more consistently detected 31 of the 32 (96.9%) jointly observed species across beaches. The four species detected by BRUV/seines, but not eDNA were only resolved at higher taxonomic ranks (e.g. Embiotocidae surfperches and Sygnathidae pipefishes). In frequent co-detection of species between methods limited comparisons of richness and abundance estimates, highlighting the challenge of comparing biomonitoring approaches. Despite potential for improvement, results overall demonstrate that eDNA can provide a cost-effective tool for long-term surf zone monitoring that complements data from seine and BRUV surveys, allowing more comprehensive surveys of vertebrate diversity in surf zone habitats.
Article
Full-text available
Environmental DNA (eDNA) metabarcoding is a powerful tool that can enhance marine ecosystem/biodiversity monitoring programs. Here we outline five important steps managers and researchers should consider when developing eDNA monitoring program: (1) select genes and primers to target taxa; (2) assemble or develop comprehensive barcode reference databases; (3) apply rigorous site occupancy based decontamination pipelines; (4) conduct pilot studies to define spatial and temporal variance of eDNA; and (5) archive samples, extracts, and raw sequence data. We demonstrate the importance of each of these considerations using a case study of eDNA metabarcoding in the Ports of Los Angeles and Long Beach. eDNA metabarcoding approaches detected 94.1% (16/17) of species observed in paired trawl surveys while identifying an additional 55 native fishes, providing more comprehensive biodiversity inventories. Rigorous benchmarking of eDNA metabarcoding results improved ecological interpretation and confidence in species detections while providing archived genetic resources for future analyses. Well designed and validated eDNA metabarcoding approaches are ideally suited for biomonitoring applications that rely on the detection of species, including mapping invasive species fronts and endangered species habitats as well as tracking range shifts in response to climate change. Incorporating these considerations will enhance the utility and efficacy of eDNA metabarcoding for routine biomonitoring applications.
Article
Full-text available
Amplicon‐sequence data from environmental DNA (eDNA) and microbiome studies provides important information for ecology, conservation, management, and health. At present, amplicon‐sequencing studies – known also as metabarcoding studies, in which the primary data consist of targeted, amplified fragments of DNA sequenced from many taxa in a mixture ‐ struggle to link genetic observations to underlying biology in a quantitative way, but many applications require quantitative information about the taxa or systems under scrutiny. As metabarcoding studies proliferate in ecology, it becomes more important to develop ways to make them quantitative to ensure that their conclusions are adequately supported. Here we link previously disparate sets of techniques for making such data quantitative, showing that the underlying PCR mechanism explains observed patterns of amplicon data in a general way. By modeling the process through which amplicon‐sequence data arises, rather than transforming the data post‐hoc, we show how to estimate the starting DNA proportions from a mixture of many taxa. We illustrate how to calibrate the model using mock communities and apply the approach to simulated data and a series of empirical examples. Our approach opens the door to improve the use of metabarcoding data in a wide range of applications in ecology, public health, and related fields.