Content uploaded by Bart C Weimer
Author content
All content in this area was uploaded by Bart C Weimer on May 22, 2020
Content may be subject to copyright.
1
Monitoring the microbiome for food safety and quality using deep
1
shotgun sequencing
2
3
Running Title: Monitoring the microbiome for food safety and quality
4
5
Kristen L. Beck1,8*+, Niina Haiminen2,8+, David Chambliss1,8, Stefan Edlund1,8, Mark Kunitomi1,8,
6
B. Carol Huang3,8, Nguyet Kong3,8, Balasubramanian Ganesan4,5,8, Robert Baker4,8, Peter
7
Markwell4,8, Ban Kawas1,8, Matthew Davis1,8, Robert J. Prill1,8, Harsha Krishnareddy1,8, Ed
8
Seabolt1,8, Carl H. Marlowe6,8, Sophie Pierre7,8, André Quintanar7,8, Laxmi Parida2,8, Geraud
9
Dubois1,8, James Kaufman1,8, and Bart C. Weimer3,8*
10
11
*Co-corresponding authors
12
+Contributed equally
13
14
Contact information: Kristen L. Beck, IBM Almaden Research Center, 650 Harry Road, San Jose
15
CA, 95120 USA, klbeck@us.ibm.com, +1 408-927-1963
16
17
AUTHOR AFFILIATIONS:
18
1IBM Almaden Research Center, San Jose CA
19
2IBM T.J. Watson Research Center, Yorktown Heights, NY
20
3University of California Davis, School of Veterinary Medicine, 100K Pathogen Genome Project,
21
Davis, CA 95616
22
4Mars Global Food Safety Center, Beijing, China
23
5Wisdom Health, A Division of Mars Petcare, Vancouver WA.
24
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
2
6Bio-Rad Laboratories, Hercules CA
25
7Bio-Rad, Food Science Division, MArnes-La-Coquette, France
26
8Consortium for Sequencing the Food Supply Chain, San Jose, CA
27
28
29
ABSTRACT:
30
In this work, we hypothesized that shifts in the food microbiome can be used as an indicator of
31
unexpected contaminants or environmental changes. To test this hypothesis, we sequenced total
32
RNA of 31 high protein powder (HPP) samples of poultry meal pet food ingredients. We
33
developed a microbiome analysis pipeline employing a key eukaryotic matrix filtering step that
34
improved microbe detection specificity to >99.96% during in silico validation. The pipeline
35
identified 119 microbial genera per HPP sample on average with 65 genera present in all
36
samples. The most abundant of these were Bacteroides, Clostridium, Lactococcus, Aeromonas,
37
and Citrobacter. We also observed shifts in the microbial community corresponding to
38
ingredient composition differences. When comparing culture-based results for Salmonella with
39
total RNA sequencing, we found that Salmonella growth did not correlate with multiple
40
sequence analyses. We conclude that microbiome sequencing is useful to characterize complex
41
food microbial communities, while additional work is required for predicting specific species’
42
viability from total RNA sequencing.
43
44
KEYWORDS:
45
microbiome, food safety, bioinformatics, shotgun sequencing, microbial ecology, pathogens
46
47
48
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
3
1. INTRODUCTION:
49
Sequencing the microbiome of food may reveal characteristics about the associated
50
microbial content that culturing or targeted whole genome sequencing alone cannot. However, to
51
meet the various needs of food safety and quality, next generation sequencing (NGS) and analysis
52
techniques require additional development1 with specific consideration for accuracy, speed, and
53
applicability across the supply chain.2 Microbial communities and their characteristics have been
54
studied in relation to flavor and quality in fermented foods,3–5 agricultural processes in grape6 and
55
apple fruit7, and manufacturing processes and production batches in Cheddar cheese.8 However,
56
the advantage of using the microbiome specifically for food safety and quality has yet to be
57
demonstrated.
58
Currently, food safety regulatory agencies including the Food and Drug Administration
59
(FDA), Centers for Disease Control and Prevention (CDC), United States Department of
60
Agriculture (USDA), and European Food Safety Authority (EFSA) are converging on the use of
61
whole genome sequencing (WGS) for pathogen detection and outbreak investigation. Large scale
62
WGS of food-associated bacteria was first initiated via the 100K Pathogen Genome Project9 with
63
the goal of expanding the diversity of bacterial reference genomes— a crucial need for foodborne
64
illness outbreak investigation, traceability, and microbiome studies.10,11 However, since WGS
65
relies on culturing a microbial isolate prior to sequencing, there are inherent biases and limitations
66
in its ability to describe the microorganisms and their interactions in a food sample. Such
67
information would be very valuable for food safety and quality applications.
68
High throughput sequencing of total DNA and total RNA are promising approaches to
69
characterize microbial niches in their native state without introducing bias due to culturing.12–14
70
Additionally, total RNA sequencing has the potential to provide evidence of live and biologically
71
active components of the sample.14,15 It also provides accurate microbial naming, relative
72
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
4
microbial abundance, and better reproducibility than total DNA or amplicon sequencing.14 Total
73
RNA sequencing minimizes PCR amplification bias that occurs in single gene amplicon
74
sequencing and overcomes the decreased detection sensitivity from using DNA sequencing in
75
metagenomics.14 Total RNA metatranscriptome sequencing, however, is yet to be examined in raw
76
food ingredients as a method to provide robust characterization of the microbial communities and
77
the interacting population dynamics.
78
From a single sequenced food microbiome, numerous dimensions of the sample can be
79
characterized that may yield important indicators of safety and quality. Using total DNA or RNA,
80
evidence for the eukaryotic food matrix can be examined. In Haiminen et al.,16 we quantitatively
81
demonstrated the utility of metagenome sequencing to authenticate the composition of complex
82
food matrices. In addition, from total DNA or RNA, one can observe signatures from commensal
83
microbes, pathogenic microbes, and genetic information for functional potential (from DNA) or
84
biologically active function (from RNA).14,15 Detecting active transcription from live microbes in
85
food is very important to avoid spurious microbial observations that may instead be false positives
86
due to quiescent DNA in the sample. Use of RNA in food analytics also offers the opportunity to
87
examine expression of metabolic processes that are related to antibiotic resistance,17,18 virulence
88
factors, or replication genes, among others. Additionally, it has the potential to define viable
89
microbes that are capable of replication in the food and even microorganisms that stop replicating
90
but continue to produce metabolic activity that changes food quality and safety.19–24
91
Microorganisms are sensitive to changes in temperature, salinity, pH, oxygen content, and
92
many other physicochemical factors that alter their ability to grow, persist, and cause disease. They
93
exist in dynamic communities that change in response to environmental perturbation – just as the
94
gut microbiome shifts in response to diet.25–28 Shifts in microbiome composition or activity can be
95
leveraged in the application of microbiome characterization to monitor the food supply chain. For
96
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
5
example, Noyes et al. followed the microbiome of cattle from the feed lot to the food packaging,
97
concluding that the microbial community and antibiotic resistance characteristics change based on
98
the processing stage.17,18,29 We hypothesize that observable shifts in microbial communities of
99
food can serve as an indicator of food quality and safety.
100
In this work, we examined 31 high protein powder samples (HPP; derived from poultry meal).
101
HPP are commonly used raw materials in pet foods. They are subject to microbial growth prior to
102
preparation and continued survival in powder form.30 We subjected the HPP samples to deep total
103
RNA sequencing with ~300 million reads per sample. In order to process the 31 samples collected
104
over ~1.5 years from two suppliers at a single location, we defined and calibrated the appropriate
105
methods– from sample preparation to bioinformatic analysis– needed to taxonomically identify
106
the community members present and to detect key features of microbial growth. First, we removed
107
the HPP’s food matrix RNA content as eukaryotic background with an important bioinformatic
108
filtering step designed specifically for food analysis. The remaining sequences were used for
109
relative quantification of microbiome members and for identifying shifts based on food matrix
110
content, production source, and Salmonella culturability. This work demonstrates that total RNA
111
sequencing is a robust approach for monitoring the food microbiome for use in food safety and
112
quality applications, while additional work is required for predicting pathogen viability.
113
114
2. RESULTS:
115
2.1 Evaluation of microbial identification capability in total RNA and DNA sequencing
116
Microbial identification in microbiomes often leverages shotgun DNA sequencing; however,
117
total RNA sequencing can provide additional information about viable bacterial activity in a
118
community via transcriptional activity. Since using total RNA to study food microbiomes is novel,
119
each step of the analysis workflow (Figure 1) was carefully designed and scrutinized for accuracy.
120
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
6
For all analyses done in this study, we report relative abundance in reads per million (RPM)
121
(Equation 1) as recommended by Gloor et al31,32 and apply the conservative threshold of RPM >
122
0.1 to indicate presence as previously described by Langelier et al and Illot et al.33,34 Numerically,
123
this threshold translates to ~30 reads per genus per sample considering a sequencing depth of ~300
124
million reads per sample (Methods Section 4.4). First, we examined the effectiveness of RNA for
125
taxonomic identification and relative quantification of microbes in the presence of food matrix
126
reads. We observed that RNA sequencing results correlated (R2 = 0.93) with the genus relative
127
quantification provided by DNA sequencing (Supplementary Figure S1). RNA sequencing also
128
detected more genera demonstrated by a higher a-diversity than the use of DNA (Supplementary
129
Figure S2). Additionally, from the same starting material, total RNA sequencing resulted in 2.4-
130
fold more reads classified to microbial genera compared to total DNA sequencing (after
131
normalizing for sequencing depth). This increase is substantial as microbial reads are such a small
132
fraction of the total sequenced reads. Considering these results, we further examined the microbial
133
content from total RNA extracted from 31 high protein powder (HPP) samples (Supplementary
134
Table 1) that resulted in an average of ~300 million paired end 150 bp sequencing reads per sample
135
in this study.
136
137
2.2 Evaluation and application of in silico filtering of eukaryotic food matrix reads
138
Sequenced reads from the eukaryotic host or food matrix may lead to false positives for microbial
139
identification in microbiome studies.35 This may occur partly due to reads originating from low
140
complexity regions of eukaryotic genomes, e.g. telomeric and centromeric repeats, being
141
misclassified as spurious microbial hits.36 In total DNA or RNA sequencing of clinical or animal
142
or even plant microbiomes, eukaryotic content may often comprise > 90% of the total sequencing
143
reads. This presents an important bioinformatic challenge that we addressed by filtering matrix
144
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
7
content using a custom-built reference database of 31 common food ingredient and contaminant
145
genomes (Supplementary Table 2) using the k-mer classification tool Kraken.37 This step allows
146
for rapidly classifying all sequenced reads (~300 million reads for each of 31 samples) as matrix
147
or non-matrix. The matrix filtering process yielded an estimate of the total percent matrix content
148
for a sample. See our work in Haiminen et al.38 on quantifying the eukaryotic food matrix
149
components with further precision.
150
To validate the matrix filtering step, we constructed in silico mock food microbiomes with
151
a high proportion of complex food matrix content and low microbial content (Supplementary Table
152
3). We then computed the true positive, false positive, and false negative rates of observed
153
microbial genera and sequenced reads (Table 1). False positive viral, archaeal, and eukaryotic
154
microbial genera (as well as bacteria) were observed without matrix filtering, although bacteria
155
were the only microbes included in the simulated mixtures. Introducing a matrix filtering step to
156
the pipeline improved read classification specificity to >99.96% (from 78–93% without filtering)
157
in both simulated food mixtures, while maintaining zero false negatives. With this level of
158
demonstrated accuracy, we used bioinformatic matrix filtering prior to further microbiome
159
analysis.
160
161
2.3 High protein powder microbiome ecology
162
After filtering eukaryotic matrix sequences, we applied the remaining steps in the
163
bioinformatic workflow (Figure 1) to examine the shift in the high protein powder (HPP)
164
microbiome membership and to quantify the relative abundance of microbes at the genus level.
165
Genus is the first informative taxonomic rank for food pathogen identification that can be
166
considered accurate given current incompleteness of reference databases11,39–42 and was therefore
167
used in subsequent analyses. Overall, between 98 and 195 microbial genera (avg. 119) were
168
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
8
identified (RPM > 0.1) per HPP sample (Supplementary Table 4). When analyzing a-diversity
169
i.e. the number of microbes detected per sample, inter-sample comparisons may become skewed
170
unless a common number of reads is considered since deeper sequenced samples may contain more
171
observed genera merely due to a greater sampling depth.43,44 Thus, we utilized bioinformatic
172
rarefaction i.e. subsampling analysis to showcase how microbial diversity was altered by
173
sequencing depth. Examination of a-diversity across a range of in silico subsampled sequencing
174
depths showed that the community diversity varied across samples (Figure 2A). One sample
175
(MFMB-04) had 1.7 times more genera (195) than the average across other samples (avg. 116,
176
range 98–143) and exhibited higher a-diversity than any other sample at each in silico sampled
177
sequencing depth (Figure 2A). Rarefaction analysis further demonstrated that when considering
178
fewer than ~67 million sequenced reads, the observable microbial population was not saturated
179
(median elbow calculated as indicated in Satopää, et al.45). This observation suggests that deeper
180
sequencing or more selective sequencing of the HPP microbiomes will reveal more microbial
181
diversity.
182
Notably, between 2%–4% (approximately 5,000,000–14,000,000) of reads per sample
183
remained unclassified as either eukaryotic matrix or microbe (Supplementary Figure S3).
184
However, the unclassified reads exhibited a GC (guanine plus cytosine) distribution similar to
185
reads classified as microbial (Supplementary Figure S4) indicating these reads may represent
186
microbial content that is absent or sufficiently divergent from existing references.
187
We calculated b-diversity to study inter-sample microbiome differences and to identify any
188
potential outliers among the sample collection. The Aitchison distances46 of microbial relative
189
abundances were calculated between samples (as recommended for compositional microbiome
190
data31,32), and the samples were hierarchically clustered based on the resulting distances (Figure
191
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
9
2B). The two primary clades were mostly defined by supplier (except for MFMB-17). In Haiminen
192
et al.,38 we reported that three of the HPP samples contained unexpected eukaryotic species. We
193
hypothesized that the presence of these contaminating matrix components (beef identifiable as Bos
194
taurus and pork identifiable as Sus scrofa) would alter the microbiome as compared to chicken
195
(identifiable as Gallus gallus) alone. Clustering HPP samples using their microbiome membership
196
led to a distinctly different group of the matrix-contaminated samples, supporting this hypothesis
197
(Figure 2B). These observations indicate that samples can be discriminated based on their
198
microbiome content for originating source and supplier, which is necessary for source tracking
199
potential hazards in food.
200
2.4 Comparative analysis of high protein powder microbiome membership and
201
composition
202
We identified 65 genera present in all HPP samples (Figure 3A), whose combined
203
abundance accounted for between 88-99% of the total abundances of detected genera per sample.
204
Bacteroides, Clostridium, Lactococcus, Aeromonas, and Citrobacter were the five most abundant
205
of these microbial genera. The identified microbial genera also included viruses, the most abundant
206
of which was Gyrovirus (< 10 RPM per sample). Gyrovirus represents a genus of non-enveloped
207
DNA viruses responsible for chicken anemia which is ubiquitous in poultry. While there were only
208
65 microbial genera identified in all 31 HPP samples, the a-diversity per sample was on average
209
two-fold greater as previously indicated.
210
Beyond the collection of 65 microbes observed in all samples, there were an additional 164
211
microbes present in various HPP samples. Together, we identified a total of 229 genera among the
212
31 HPP samples tested (Figure 3B and 4, Supplementary Table 4). In order to identify genera that
213
were most variable between samples, we computed the median absolute deviation (MAD)47 using
214
the normalized relative abundance of each microbe (Figure 5A). The abundance of Bacteroides
215
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
10
was the most variable among samples (median = 148.1 RPM, MAD = 30.6) and showed increased
216
abundance in almost all samples from Supplier A compared to Supplier B (abundance for the 10
217
most variable genera shown in Figure 5B). Clostridium (median = 37.4 RPM, MAD = 24.2),
218
Lactococcus (median = 36.8 RPM, MAD = 18.2), and Lactobacillus (median = 24.2, MAD = 7.2)
219
were also highly variable and 3–4 fold more abundant in samples MFMB-04 and MFMB-20
220
compared to other samples (Figure 5B). Pseudomonas (median = 11.1 RPM, MAD = 12.2) was
221
markedly more abundant in MFMB-83 than any other sample (Figure 5B). These genera highlight
222
variability between microbiomes from a single food source and may provide insights into other
223
dissimilarities in these samples.
224
225
2.5. Microbiome shifts in response to changes in food matrix composition
226
We tested the hypothesis that the microbiome composition will shift in response to changes
227
in the food matrix and can be a unique signal to indicate contamination or adulteration. In 28 of
228
the 31 HPP samples, >99% of the matrix reads were determined in our related work38 to originate
229
from poultry (Gallus gallus), which was the only ingredient expected based on ingredient
230
specifications. However, three samples had higher pork and beef content compared to all other
231
HPP samples: MFMB-04 (7.74% pork, 8.99% beef), MFMB-20 (0.53% pork, 1.00% beef), and
232
MFMB-38 (0.92% pork, 0.29% beef) compared to the highest pork (0.01%) and beef (0.00%)
233
content among the other 28 HPP samples (Supplementary Data by Haiminen et al.38). The
234
microbiomes of these matrix contaminated samples also clustered into a separate sub-cluster
235
(Figure 2B). This demonstrated that a shift in the food matrix composition was associated with an
236
observable shift in the food microbiome.
237
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
11
MFMB-04 and MFMB-20 had the highest percentage of microbial reads compared to other
238
samples (Supplementary Figure S3). They also exhibited an increase in Lactococcus,
239
Lactobacillus, and Streptococcus relative abundances compared to other samples (Figure 5B), also
240
reflected at respective higher taxonomic levels above genus (Supplementary Figure S5).
241
There were 53 genera identified uniquely in MFMB-04 and/or MFMB-20, but not present
242
in any other sample. (MFMB-38 had a very low microbial load and contributed no uniquely
243
identified genera above the abundance threshold.) MFMB-04 contained 44 unique genera (Figure
244
4) with the most abundant being Macrococcus (35.8 RPM), Psychrobacter (23.8 RPM), and
245
Brevibacterium (18.1 RPM). Additionally, Paenalcaligenes was present only in MFMB-04 and
246
MFMB-20 with an RPM of 6.4 and 0.3, respectively, compared to a median RPM of 0.004 among
247
other samples. Notable differences in the matrix-contaminated samples’ unique microbial
248
community membership compared to other samples may provide microbial indicators associated
249
with unanticipated pork or beef presence.
250
2.6. Genus level identification of foodborne microbes
251
We evaluated the ability of total RNA sequencing to identify genera of commonly known
252
foodborne pathogens within the microbiome. We focused on fourteen pathogen-containing genera
253
including Aeromonas, Bacillus, Campylobacter, Clostridium, Corynebacterium, Cronobacter,
254
Escherichia, Helicobacter, Listeria, Salmonella, Shigella, Staphylococcus, Vibrio, and Yersinia
255
that were found to be present in the HPP samples with varying relative abundances. Of these
256
genera, Aeromonas, Bacillus, Campylobacter, Clostridium, Corynebacterium, Escherichia,
257
Salmonella, and Staphylococcus were detected in every HPP with median abundance values
258
between 0.58–48.31 RPM (Figure 6A). This indicated that a baseline fraction of reads can be
259
attributed to foodborne microbes when using NGS. Of those genera appearing in all samples, there
260
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
12
was observed sample-to-sample variation in their abundance with some genera exhibiting longer
261
tails of high abundance, e.g. Staphylococcus and Salmonella, whereas others exhibit very low
262
abundance barely above the threshold of detection, e.g. Bacillus and Yersinia (Figure 6A). None
263
of the pathogen-containing genera were consistent with higher relative abundances due to
264
differences in food matrix composition. Bacillus and Corynebacterium exhibited slightly higher
265
relative abundances in sample MFMB-04 which contained 7.7% pork and 9.0% beef (Figure 6B).
266
Yet while MFMB-04 contained higher cumulative levels of these foodborne microbes, the next
267
highest sample was MFMB-93 which was not associated with altered matrix composition, and
268
both MFMB-04 and MFMB-93 contained higher levels of Staphylococcus (Figure 6B). Thus,
269
matrix composition alone did not explain variations of these pathogen-containing genera.
270
Interestingly, low to moderate levels of Salmonella were detected within all 31 HPP
271
microbiomes (Figure 6A). The presence of Salmonella in HPP is expected but the viability of
272
Salmonella is an important indicator of safety and quality. Thus, we further sought to delineate
273
Salmonella growth capability within these microbiomes by comparing culturability with multiple
274
established bioinformatic NGS methods for Salmonella relative abundances in the samples.
275
2.7 Assessment of Salmonella culturability and total RNA sequencing
276
Total RNA sequencing of food microbiomes has the potential to provide additional
277
sensitivity beyond standard culture-based food safety testing to confirm or reject the presence of
278
potentially pathogenic microbes. In all of the examined HPP samples, some portion of the
279
sequenced reads were classified as belonging to pathogen-containing genera (Figure 6); however,
280
the presence of RNA transcripts does not necessarily indicate current growth of the organism itself.
281
We further inspected one pathogen of interest, Salmonella, to determine the congruence between
282
sequencing-based and culturability results. Of the 31 samples examined with total RNA
283
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
13
sequencing, Salmonella culture testing was applied to 27 samples, of which four were culture-
284
positive. Surprisingly, Salmonella culture-positive samples were not among those with the highest
285
relative abundance of Salmonella from sequencing (Figure 7A). When ranking the samples by
286
decreasing Salmonella abundance, the culture-positive samples were not enriched for higher ranks
287
(p=0.86 from Wilcoxon rank sum test indicating that the distributions are not significantly
288
different, Table 2). To confirm that the microbiome analysis pipeline did not miss Salmonella reads
289
present, we completed two orthogonal analyses on the same data set used in the microbial
290
identification step. The reference genomes relevant to these additional analyses were publicly
291
available and closed high quality genomes available from the sources indicated below.
292
First, for a targeted analysis, we aligned the sequenced reads using a different tool, Bowtie
293
2,48 to an augmented Salmonella-only reference database. This reference was comprised of the 264
294
Salmonella genomes extracted from NCBI RefSeq Complete (used in our previous microbial
295
identification step) as well as an additional 1,183 public Salmonella genomes which represent
296
global diversity within the genus.49 The number of reads that aligned to the Salmonella-only
297
reference was on average 370-fold higher than identified as Salmonella by Kraken using the multi-
298
microbe NCBI RefSeq Complete. In this additional analysis, the culture-positive samples had
299
overall higher ranks compared to culture-negative samples (p=0.06, Table 2) indicating that
300
additional Salmonella genomic data in the reference significantly improved discriminatory
301
identification power. Salmonella culture-positive samples were still not the most abundant (Figure
302
7B), but with an enriched database, sequencing positioned all four culturable samples within the
303
top 10 ranking.
304
The second additional analysis examined alignment of the reads to a specific gene
305
required50 for replication and protein production in actively dividing Salmonella— elongation
306
factor Tu (ef-Tu). This was done by aligning the reads to 4,846 gene sequences for ef-Tu extracted
307
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
14
for a larger corpus of Salmonella genomes from OMXWare.51 The relative abundances of this
308
transcript in culture-positive samples were still comparable to culture-negative samples (Figure
309
7C). Culture-positive samples did not exhibit higher ranks compared to culture-negative samples
310
(p=0.56, Table 2), indicating that ef-Tu relative abundance alone was not sufficient to improve the
311
lack of concordance in culturability vs sequencing. These two orthogonal analyses demonstrated
312
that results from carefully developed culture-based testing and those from current high-throughput
313
sequencing technologies, whether assessed at overall reads aligned or specific gene abundances,
314
were not conclusively in agreement when detecting active Salmonella in food samples (Figure 7
315
and Table 2). However, the use of a reference database enriched in whole genome sequences of
316
the specific organism of interested was found appropriate for food safety applications.
317
Since microbes compete for available resources within an environmental niche and
318
therefore impact one another,52 we investigated Salmonella culture results in conjunction with co-
319
occurrence patterns of other microbes in the total RNA sequencing data (Figure 8). Point-biserial
320
correlation coefficients (rpb) were calculated between Salmonella culturability results (presence or
321
absence which were available for 27 of the 31 samples) and microbiome relative abundance. We
322
observed 31 genera that positively correlated and with Salmonella presence (rpb > 0.5).
323
Erysipelothrix, Lactobacillus, Anaerococcus, Brachyspira, and Jeotgalibaca exhibited the largest
324
positive correlations. Gyrovirus was negatively correlated with Salmonella growth (rpb = -0.54).
325
In three of the four Salmonella-positive samples (MFMB-04, MFMB-20, and MFMB-38), food
326
matrix contamination was also observed (Supplementary Data in Haiminen et al.38). The
327
concurrency of Salmonella growth and matrix contamination was affirmed by the microbial co-
328
occurrence (specifically Erysipelothrix, Brachyspira, and Gyrovirus). This highlights the complex
329
dynamic and community co-dependency of food microbiomes, yet shows that multiple dimensions
330
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
15
of the data (microbiome composition, culture-based methods, and microbial load) will signal
331
anomalies from typical samples when there is an issue in the supply chain.
332
333
3. DISCUSSION:
334
Accurate and appropriate tests for detecting potential hazards in the food supply chain are key to
335
ensuring consumer safety and food quality. Monitoring and regular testing of raw ingredients can
336
reveal fluctuations within the supply chain that may be an indicator of an ingredient’s quality or
337
of a potential hazard. Such quality is assessed by standardized tests for chemical and microbial
338
composition to meet legal requirements and specifications from government agencies throughout
339
the world. For raw materials or finished products to meet these bounds of safety and quality, their
340
composition must usually have a low microbiological load (except in fermented foods) and be
341
chemically identical in macro-components such as carbohydrate, protein, and fat. Methods in this
342
space must avoid false negative results which could endanger consumers, while also minimizing
343
false positives which could lead to unnecessary recalls and food loss.
344
Existing microbial detection technologies used in food safety today such as pulse field gel
345
electrophoresis (PFGE) and whole genome sequencing (WGS) require microbial isolation. This
346
provides biased outcomes as it removes microbes from their native environment where other biotic
347
members also subsist, and selects microbes by culturability alone. Amplicon sequencing, while a
348
low-cost alternative to metagenome or metatranscriptome sequencing for bacteria, also imparts
349
PCR amplification bias and reduces detection sensitivity due to reliance on a single gene (16S
350
ribosomal RNA).14,53,54 We therefore investigated the utility of total RNA sequencing of food
351
microbiomes and demonstrated that from this single test, we are able to yield several pertinent
352
results about food safety and quality.
353
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
16
For this evaluation, we developed a pipeline to characterize the microbiome of typical food
354
ingredient samples and to detect potentially hazardous outliers. Special considerations for food
355
samples were made as computational pipelines for human or other microbiome analyses are not
356
sufficient for applications in food safety without modification. In food, the eukaryotic matrix needs
357
to be confirmed, may be mixed, and, as we and others have shown, affects the identification
358
accuracy of microbes that are present.35,36 By filtering food matrix sequence data properly, we
359
avoid incorrect microbial identification and characterization of the microbiome36 while also
360
increasing the computational efficiency for downstream processing. The addition of this filtering
361
step in the pipeline removed approximately 90% of false positive genera and provided results at
362
99.96% specificity when evaluating simulated mixtures of food matrix and microbes (Table 1).
363
Through the analysis of 31 high protein powder total RNA sequencing samples, we
364
demonstrated the pipeline’s ability to characterize food microbiomes and indicate outliers. In this
365
sample collection, we identified a core catalog of 65 microbial genera found in all samples where
366
Bacteroides, Clostridium, and Lactococcus were the most abundant (Supplementary Table 4). We
367
also demonstrated that in these food microbiomes the overall diversity was 2-fold greater than the
368
core microbe set. Fluctuations in the microbiome can indicate important differences between
369
samples as observed here, as well as in the literature for grape berry6 and apple fruit microbiomes
370
(pertaining to organic versus conventional farming)7 or indicate inherent variability between
371
production batches or suppliers as observed here and during cheddar cheese manufacturing.8
372
Specifically, we observed a shift in the microbial composition (Figure 2B) and the microbial load
373
(Supplementary Figure S3) in high protein powder samples (derived from poultry meal) where
374
unexpected pork and beef were observed. Matrix-contaminated samples were marked by increased
375
relative abundances of specific microbes including Lactococcus, Lactobacillus, and Streptococcus
376
(Figure 5B). This work shows that the microbiome shifts with observed food matrix contamination
377
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
17
from sources with similar macronutrient content and thus, the microbiome alone is a likely signal
378
of compositional change in food.
379
Beyond shifts in the microbiome, we focused on a set of well-defined foodborne-pathogen
380
containing genera and explored their relative abundances observed from total RNA sequencing.
381
Of these genera, Aeromonas, Bacillus, Campylobacter, Clostridium, Corynebacterium,
382
Escherichia, Salmonella, and Staphylococcus were detected in every HPP sample. This highlights
383
that when using NGS there may be an observable baseline of sequences assigned to potentially
384
pathogenic microbes. For this ingredient type, this result lends a range of normalcy of relative
385
abundance generated by NGS. Further work is needed to establish a definitive and quantitative
386
range of typical variation in samples of a particular food source and the degree of anomaly for a
387
new sample or genus abundance. However, preliminary studies of this nature can inform the
388
development of guidelines when working with increasingly sensitive shotgun metagenomic or
389
metatranscriptomic analysis.
390
Furthermore, sequenced DNA or RNA alone does not imply microbial viability. Therefore,
391
we investigated the relatedness of culture-based tests and total RNA sequencing for the pathogenic
392
bacterium Salmonella in the high protein powder samples. As has been reported for human gut55
393
and deep sea56 microbiomes, we also did not dretect a correlation between Salmonella read
394
abundance and culturability (Figure 7 and Table 2). Sequence reads matching Salmonella
395
references were observed for all samples (both culture-positive and culture-negative) as
396
determined by multiple analysis techniques: microbiome classification, alignment to Salmonella
397
genomes, and targeted growth gene analysis. When ranking the high protein powder samples based
398
on Salmonella abundance from whole genome alignments, the culture-positive samples were
399
enriched for higher ranks (p = 0.06). However, the culture-positive samples were still intermixed
400
in ranking with culture-negative samples. This indicated that there was no clear minimum
401
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
18
threshold of sequence data as evidence for culturability and that this analysis alone is not predictive
402
of pathogen growth. One possible reason for this is that the culture-positive variant of Salmonella
403
is missing from existing reference data sets. Potentially, Salmonella attained a nonculturable state
404
wherein it was detected by sequencing techniques yet remained nonculturable from the HPP
405
sources. Successful isolation of total RNA and DNA and gene expression analysis from
406
experimentally known nonculturable bacteria has been demonstrated by Ganesan et al. in multiple
407
studies in other genera.19,22 Physiological state should thus be taken under consideration when
408
benchmarking sequencing technologies in comparison with culture-based methods. Thus, total
409
RNA sequencing of food samples may identify shifts that standard food testing does not, but the
410
incongruity between sequencing read data and culture-based results highlights the need to perform
411
more benchmarking in food microbiome analysis for pathogen detection.
412
The characterization of HPP food microbiomes leveraged current accepted public reference
413
databases, yet it is known that these databases are still inadequate.1,2,11,57,58 Furthermore, when
414
considering congruence between Salmonella culturability and NGS read mapping techniques, the
415
genetic breadth and depth of multi-genome reference sequences is essential. For example, focusing
416
on ef-Tu, a known marker gene for Salmonella growth, was not sufficient to mirror viability of in
417
vitro culture tests. This highlights the limitations of single gene approaches for identification.
418
When the sequenced reads were examined in the context of an augmented reference collection of
419
Salmonella genomes, we observed improved ranking and read mapping rate for culture-positive
420
samples (yet we did not achieve complete concordance). This improvement underlined the
421
increased analytical robustness yielded from a multi-genome reference. We also recognize that the
422
read mapping rate may be exaggerated as reads from non-Salmonella genomes could map to
423
Salmonella in the absence of any other reference genomes. Overall for robust analysis and
424
applicability to food safety and quality, microbial references must be expanded to include more
425
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
19
genetically diverse representatives of pathogenic and spoilage organisms. Description of food
426
microbiomes will only improve as additional public sequence data is collected and leveraged.
427
In our sample collection, 2-4% (effectively 5 to 14 million) of reads remain unclassified. The
428
GC content distribution of unclassified reads matched microbial GC content distribution
429
(Supplementary Figure S4) suggesting that these reads may have been derived from microbes
430
missing from the current reference database that have not yet been isolated or sequenced. By
431
sequencing the microbiome, we sampled environmental niches in their native state in a culture-
432
independent manner and therefore collected data from diverse and potentially never-before seen
433
microbes. Tracking unclassified reads will also be essential for monitoring food microbiomes. The
434
inability to provide a name from existing references does not eliminate the possibility that the
435
sequence is from an unwanted microbe or indicates a hazard. In addition to tracking known
436
microbes, quantitative or qualitative shifts in the unclassified sequences might be used to detect
437
when a sample is different from its peers.
438
We demonstrated the potential utility of analyzing food microbiomes for food safety using raw
439
ingredients. This study resulted in the detection of shifts in the microbiome composition
440
corresponding to unexpected matrix contaminants. This signifies that the microbiome is likely an
441
important and effective hazard indicator in the food supply chain. While we have used total RNA
442
sequencing for detection of microbiome membership, the technology has future applicability for
443
detection of antimicrobial resistance, virulence, and biological function for multiple food sources,
444
and for other sample types. Notably, while this pipeline was developed for food monitoring, with
445
applicable modifications and identification of material-specific indicators, it can be applied to
446
other microbiomes including human and environmental.
447
448
4. METHODS:
449
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
20
450
4.1 Sample Collection, Preparation, and Sequencing
451
High protein powder (HPP, 2.5 kg) samples were each collected from a train car in Reno, NV,
452
USA between April 2015 and February 2016 in four batches from two suppliers and shipped to
453
the Weimer lab at the University of California, Davis (Davis, CA). Each HPP sample was
454
composed of five sub-samples from random locations within the train car prior to shipment.
455
Sample preparation, total RNA extraction and integrity confirmation, cDNA construction, and
456
library construction for these samples was previously described by Haiminen et al.38
457
Sequencing was performed by BGI@UC Davis (Sacramento, CA) using Illumina HiSeq
458
4000 (San Diego, CA) with 150 paired end chemistry for each sample except the following: HiSeq
459
3000 with 150 paired end chemistry was used for MFMB-04 and MFMB-17. All total RNA
460
sequencing data are available via the 100K Pathogen Genome Project BioProject (PRJNA186441)
461
at NCBI (Supplementary Table 1).
462
For evaluation of total RNA sequencing for microbial classification in paired processing
463
steps, total RNA and total DNA were extracted from the same sample and denoted as MFMB-03
464
and MFMB-08, respectively. Total RNA was extracted and sequenced as described above. Total
465
DNA was extracted and sequenced as described previously.10,59–64 The Illumina HiSeq 2000 with
466
100 paired end chemistry was used for MFMB-03 and MFMB-08.
467
468
4.2 Sequence Data Quality Control
469
Illumina Universal adapters were removed and reads were trimmed using Trim Galore65
470
with a minimum read length parameter 50 bp. The resulting reads were filtered using Kraken37, as
471
described below in Section 4.3, with a custom database built from the PhiX genome (NCBI
472
Reference Sequence: NC_001422.1). Removal of PhiX content is suggested as it is a common
473
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
21
contaminant in Illumina sequencing data.66 Trimmed non-PhiX reads were used in subsequent
474
matrix filtering and microbial identification steps.
475
476
4.3 Matrix Filtering Process and Validation
477
Kraken37 with a k-mer size of 31 bp (optimal size described in the Kraken reference
478
publication) was used to identify and remove reads that matched a pre-determined list of 31
479
common food matrix and potential contaminant eukaryotic genomes (Supplementary Table 2).
480
These food matrix organisms were chosen based on preliminary eukaryotic read alignment
481
experiments of the HPP samples as well as high-volume food components in the supply chain. Due
482
to the large size of eukaryotic genomes in the custom Kraken37 database, a random k-mer reduction
483
was applied to reduce the size of the database by 58% using kraken-build with option --max-db-
484
size, in order to fit the database in 188 GB for in-memory processing. A conservative Kraken score
485
threshold of 0.1 was applied to avoid filtering microbial reads. The matrix filtering database
486
includes low complexity and repeat regions of eukaryotic genomes to capture all possible matrix
487
reads. This filtering database with the score threshold was also used in the matrix filtering in silico
488
testing as described below.
489
Matrix filtering was validated by constructing synthetic paired end reads (150 bp) using
490
DWGSIM67 with mutations from reference sequences using the following parameters: base error
491
rate (e) = 0.005, outer distance between the two ends of a read pair (d) = 500, rate of mutations (r)
492
= 0.001, fraction of indels (R) = 0.15, probability an indel is extended (X) = 0.3. Reference
493
sequences are detailed in Supplementary Table 3. We constructed two in silico mixtures of
494
sequencing reads by randomly sampling reads from eukaryotic reference genomes. Simulated
495
Food Mixture 1 was comprised of nine species with the following number of reads per genome:
496
2M cattle, 2M salmon, 1M goat, 1M lamb, 1M tilapia (transcriptome), 962K chicken
497
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
22
(transcriptome), 10K duck, 1K horse, and 1K rat totaling 7.974M matrix reads. Simulated Food
498
Mixture 2 contained 5M soybean, 4M rice, 3M potato, 2M corn, 200K rat, and 10K drain fly reads,
499
totaling 14.210M matrix reads. Both simulated food mixtures included 1,000 microbial sequence
500
reads generated from 15 different microbial species for a total of 15K sequence reads
501
(Supplementary Table 3).
502
503
4.4 Microbial Identification
504
Remaining reads after quality control and matrix filtering were classified using Kraken37
505
against a microbial database with a k-mer size of 31 bp to determine the microbial composition
506
within each sample. NCBI RefSeq Complete68 genomes were obtained for bacterial, archaeal,
507
viral, and eukaryotic microorganisms (~7,800 genomes retrieved April 2017). Low complexity
508
regions of the genomes were masked using Dustmasker69 with default parameters. A threshold of
509
0.05 was applied to the Kraken score in an effort to maximize the F-score of the result (as
510
demonstrated in Kraken’s operating manual.70 Taxa-specific sequence reads were used to calculate
511
a relative abundance in reads per million (RPM; Equation 1) where
!"
represents the reads
512
classified per microbial entity (e.g. the genus Salmonella) and
!#
represents the number of
513
sequenced reads remaining after quality control (trimming and PhiX removal) for an individual
514
sample, including any reads classified as eukaryotic:
515
516
!$%& ' & !"
!#
( &)*+++*+++&&&&&&&&,-./0123&)
517
518
This value provides a relative abundance of the microbial entity of interest and was used in
519
comparisons of taxa among samples. Genera with a conservative threshold of RPM > 0.1 were
520
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
23
defined as present, as previously applied by others in the contexts of human infectious disease and
521
gut microbiome studies.33,34 Pearson correlation of resulting microbial genus counts was
522
computed.
523
524
4.5 Community Ecology Analysis
525
Rarefaction analysis at multiple subsampled read depths RD was performed by multiplying
526
the microbial genus read counts with RD/RQ and rounding the results down to the nearest integer
527
to represent observed read counts. Here RQ is the total number of reads in the sample after quality
528
control (including microbial, matrix, and unclassified reads). Resulting a-diversity at read depth
529
RD was computed as the number of genera with resulting RPM > 0.1 and plotted at five million
530
read intervals: RD = 5M, 10M, 15M, …, RQ. If, due to random sampling and rounding effects, the
531
computed a-diversity was lower than the diversity computed at any previous depth, the previous
532
higher a-diversity was used for plotting. The median elbow was calculated as previously
533
described45 using the R package kneed.45
534
In compositional data analysis,31 non-zero values are required when computing b-diversity
535
based on Aitchison distance.46 Therefore, reads counts assigned to each genus were pseudo-
536
counted by adding one in advance of computation of RPM (Eq. 1) prior to calculating the Aitchison
537
distance for the microbial table. b-diversity was calculated using the R package robCompositions71
538
and hierarchical clustering was performed using base R function hclust using the “ward.D2”
539
method as recommended for compositional data analysis.31
540
541
4.6 Unclassified Read Analysis
542
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
24
The GC percent distributions of matrix (from matrix filtering), microbial, and remaining
543
unclassified reads per sample were computed using FastQC72 and collated across samples with
544
MultiQC.73
545
546
4.7 Analysis of Salmonella Culturability
547
Growth of Salmonella was determined using a real-time quantitative PCR method for the
548
confirmation of Salmonella isolates for presumptive generic identification of foodborne
549
Salmonella. Testing was performed fully in concordance with the Bacteriological Analytical
550
Manual (BAM) for Salmonella74,75 for this approach that is also AOAC-approved. All samples
551
with positive results for Salmonella were classified as containing actively growing Salmonella. To
552
compare culture results with those from total RNA sequencing, Salmonella RPM values were
553
parsed from the genus-level microbe table (described in Section 4.4).
554
Two additional approaches were employed to examine Salmonella read mapping with a
555
more sensitive tool and broader reference databases. Quality controlled matrix-filtered reads were
556
aligned using Bowtie248 with very-sensitive-local-mode to 1. an expanded collection of whole
557
Salmonella genomes and 2. to a curated growth gene reference for elongation factor Tu (ef-Tu).
558
For results from both complete genome and ef-Tu gene alignments, the relative abundance (RPM)
559
was computed as shown in Equation 1.
560
For whole genome alignments, a reference was constructed from 1,183 recently published
561
Salmonella genomes49 in addition to the 264 Salmonella genomes extracted from the
562
aforementioned NCBI RefSeq Complete collection (see Methods Section 4.4).
563
To construct a curated growth gene (ef-Tu) reference, gene sequences annotated in
564
Salmonella genomes as “elongation factor Tu”, “EF-Tu” or “eftu” (case insensitive) were retrieved
565
from OMXWare51 using its Python package. This query yielded 4,846 unique gene sequences from
566
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
25
a total of 36,242 Salmonella genomes which were assembled or retrieved from the NCBI Sequence
567
Read Archive or RefSeq Complete Sequences as previously described.51 The retrieved ef-Tu gene
568
sequences were subsequently used to build a custom Bowtie248 reference. Read alignment was
569
completed with very-sensitive-local-mode.
570
The read counts for each sample were ranked and Wilcoxon rank sum test was computed
571
between the rank vectors of 4 Salmonella-positive and 23 Salmonella-negative samples. The 4
572
samples with unknown Salmonella status were excluded from the rankings.
573
Point-biserial correlation coefficients (rpb) were calculated between Salmonella growth
574
indicated by culture results (+1 and -1 for presence and absence, respectively) and observed
575
relative abundance from total RNA sequencing results using the R package ltm.76 The point-
576
biserial correlation is a special case of the Pearson correlation that is better suited for a binary
577
variable e.g. when Salmonella is reported as present or absent (a sample’s Salmonella status).
578
579
Data Availability:
580
All high protein powder (HPP) poultry meal sequences are available through the 100K
581
Pathogen Genome Project (PRJNA186441) in the NCBI BioProject (see Supplementary Table 1
582
for a complete list of accession numbers).
583
584
Code Availability:
585
The pipeline and microbial or matrix references were constructed from publicly available
586
tools and reference sequences as described in the Methods section. Automated usability of this
587
pipeline is available through membership in the Consortium for Sequencing the Food Supply
588
Chain.
589
590
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
26
Acknowledgements:
591
We’d like to acknowledge the IBM Research OMXWare team for their data management
592
support and availability for the retrieval and processing of microbial genomes. This research
593
project was financially supported by the Consortium for Sequencing the Food Supply Chain.
594
Funding for the total RNA sequencing of high protein powder factory ingredients was provided by
595
Mars, Incorporated to B.C.W. with specific interest in metagenomics of the food microbiome.
596
597
Contributions:
598
KLB and NH conceived of the experimental design, developed the approach, completed
599
and oversaw the experiments, performed analyses, and wrote the paper; DC, SE, MK, BK, MD,
600
RP, HK, ES developed the approach, analyzed data, and revised the manuscript; BCH completed
601
nucleic acid extraction method development and sequencing library construction, and contributed
602
to data analysis and writing; NK coordinated sample collection and processing, nucleic acid
603
extraction and contributed to writing; RB and PM conceived of the experimental design, developed
604
the approach, and reviewed the paper; BG contributed to the experimental design, developed the
605
approach, and wrote the paper; GD, CHM, SP, AQ participated to the conception of the
606
experimental design and to the review of the manuscript; LP conceived of the experiment,
607
contributed to the data analysis, and wrote the paper; JHK conceived of the experiment, developed
608
the approach, and wrote the paper; BCW conceived of the experimental design, developed the
609
approach, oversaw the experiments, performed analyses, and wrote the paper
610
611
Competing Interests:
612
The authors were employed by private or academic organizations as described in the author
613
affiliations at the time this work was completed. IBM Corporation, Mars Incorporated, and Bio-
614
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
27
Rad Laboratories are members of the Consortium for Sequencing the Food Supply Chain. The
615
authors declare no other competing interests
616
617
Supplementary information is available at npj Science of Food’s website
618
619
REFERENCES:
620
1. Kovac, J., Bakker, H. den, Carroll, L. M. & Wiedmann, M. Precision food safety: A
621
systems approach to food safety facilitated by genomics tools. TrAC Trends Anal. Chem.
622
(2017). doi:10.1016/j.trac.2017.06.001
623
2. Weimer, B. C. et al. Defining the food microbiome for authentication, safety, and process
624
management. IBM J. Res. Dev. 60, 1 (2016).
625
3. Walsh, A. M. et al. Microbial Succession and Flavor Production in the Fermented Dairy
626
Beverage Kefir. mSystems 1, (2016).
627
4. Walsh, A. M. et al. Species classifier choice is a key consideration when analysing low-
628
complexity food microbiome data. Microbiome 6, 50 (2018).
629
5. Duru, I. C. et al. Metagenomic and metatranscriptomic analysis of the microbial
630
community in Swiss-type Maasdam cheese during ripening. Int. J. Food Microbiol. 281,
631
10–22 (2018).
632
6. Martins, G. et al. Grape berry bacterial microbiota: Impact of the ripening process and the
633
farming system. Int. J. Food Microbiol. 158, 93–100 (2012).
634
7. Abdelfattah, A., Wisniewski, M., Droby, S. & Schena, L. Spatial and compositional
635
variation in the fungal communities of organic and conventionally grown apple fruit at the
636
consumer point-of-purchase. Hortic. Res. 3, 16047 (2016).
637
8. Williams, A. G., Choi, S.-C. & Banks, J. M. Variability of the species and strain
638
phenotype composition of the non-starter lactic acid bacterial population of cheddar
639
cheese manufactured in a commercial creamery. Food Res. Int. 35, 483–493 (2002).
640
9. Weimer, B. C. 100K Pathogen Genome Project. Genome Announc. 5, e00594-17 (2017).
641
10. Emond-Rheault, J.-G. et al. A Syst-OMICS Approach to Ensuring Food Safety and
642
Reducing the Economic Burden of Salmonellosis. Front. Microbiol. 8, 996 (2017).
643
11. Kaufman, J. H. et al. Insular microbiogeography. (2017).
644
12. Bashiardes, S., Zilberman-Schapira, G. & Elinav, E. Use of Metatranscriptomics in
645
Microbiome Research. Bioinform. Biol. Insights 10, 19–25 (2016).
646
13. McGrath, K. C. et al. Isolation and analysis of mRNA from environmental microbial
647
communities. J. Microbiol. Methods 75, 172–176 (2008).
648
14. Cottier, F. et al. Advantages of meta-total RNA sequencing (MeTRS) over shotgun
649
metagenomics and amplicon-based sequencing in the profiling of complex microbial
650
communities. npj Biofilms Microbiomes 4, 2 (2018).
651
15. Macklaim, J. M. et al. Comparative meta-RNA-seq of the vaginal microbiota and
652
differential expression by Lactobacillus iners in health and dysbiosis. Microbiome 1, 12
653
(2013).
654
16. Haiminen, N. et al. Food authentication from shotgun sequencing reads with an
655
application on high protein powders. npj Sci. Food 3, (2019).
656
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
28
17. Lakin, S. M. et al. MEGARes: an antimicrobial resistance database for high throughput
657
sequencing.
658
18. Noyes, N. R. et al. Resistome diversity in cattle and the environment decreases during
659
beef production. Elife 5, e13195 (2016).
660
19. Ganesan, B., Dobrowolski, P. & Weimer, B. C. Identification of the Leucine-to-2-
661
Methylbutyric Acid Catabolic Pathway of Lactococcus lactis. Appl. Environ. Microbiol.
662
72, 4264–4273 (2006).
663
20. Ganesan, B., Seefeldt, K., Koka, R. C., Dias, B. & Weimer, B. C. Monocarboxylic acid
664
production by lactococci and lactobacilli. Int. Dairy J. 14, 237–246 (2004).
665
21. Ganesan, B., Seefeldt, K. & Weimer, B. C. Fatty Acid Production from Amino Acids
666
and -Keto Acids by Brevibacterium linens BL2. Appl. Environ. Microbiol. 70, 6385–6393
667
(2004).
668
22. Ganesan, B., Stuart, M. R. & Weimer, B. C. Carbohydrate Starvation Causes a
669
Metabolically Active but Nonculturable State in Lactococcus lactis. Appl. Environ.
670
Microbiol. 73, 2498–2512 (2007).
671
23. Ganesan, B. et al. Probiotic bacteria survive in Cheddar cheese and modify populations of
672
other lactic acid bacteria. J. Appl. Microbiol. 116, 1642–1656 (2014).
673
24. Ganesan, B. & Weimer, B. C. Cheese : chemistry, physics, and microbiology. (Elsevier,
674
2004).
675
25. Sheflin, A. M., Melby, C. L., Carbonero, F. & Weir, T. L. Linking dietary patterns with
676
gut microbial composition and function. Gut Microbes 8, (2017).
677
26. McDonald, D. et al. American Gut: an Open Platform for Citizen Science Microbiome
678
Research. mSystems 3, e00031-18 (2018).
679
27. Clemente, J. C., Ursell, L. K., Parfrey, L. W. & Knight, R. The Impact of the Gut
680
Microbiota on Human Health: An Integrative View. Cell 148, 1258–1270 (2012).
681
28. Richards, J. L., Yap, Y. A., McLeod, K. H., Mackay, C. R. & Mariño, E. Dietary
682
metabolites and the gut microbiota: an alternative approach to control inflammatory and
683
autoimmune diseases. Clin Trans Immunol 5, e82 (2016).
684
29. Yang, X. et al. Use of Metagenomic Shotgun Sequencing Technology To Detect
685
Foodborne Pathogens within the Microbiome of the Beef Production Chain. Appl Env.
686
Microbiol 82, 2433–2443 (2016).
687
30. Hofacre, C. L. et al. Characterization of antibiotic-resistant bacteria in rendered animal
688
products. Avian Dis. 45, 953–61
689
31. Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V. & Egozcue, J. J. Microbiome
690
Datasets Are Compositional: And This Is Not Optional. Front. Microbiol. 8, 2224 (2017).
691
32. Gloor, G. B. & Reid, G. Compositional analysis: a valid approach to analyze microbiome
692
high-throughput sequencing data. Can. J. Microbiol. 62, 692–703 (2016).
693
33. Langelier, C. et al. Integrating host response and unbiased microbe detection for lower
694
respiratory tract infection diagnosis in critically ill adults. Proc. Natl. Acad. Sci. U. S. A.
695
115, E12353–E12362 (2018).
696
34. Ilott, N. E. et al. Defining the microbial transcriptional response to colitis through
697
integrated host and microbiome profiling. ISME J. 10, 2389–2404 (2016).
698
35. Ripp, F. et al. All-Food-Seq (AFS): a quantifiable screen for species in biological samples
699
by deep DNA sequencing. BMC Genomics 15, 639 (2014).
700
36. Lee, A. Y., Lee, C. S. & Gelder, R. N. Van. Scalable metagenomics alignment research
701
tool (SMART): a scalable, rapid, and complete search heuristic for the classification of
702
metagenomic sequences from complex sequence populations. BMC Bioinformatics 17,
703
292 (2016).
704
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
29
37. Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification
705
using exact alignments. Genome Biol. 15, R46 (2014).
706
38. Haiminen, N. et al. Food authentication from shotgun sequencing reads with an
707
application on high protein powders. npj Sci. Food in press (2019).
708
39. Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, 16048 (2016).
709
40. Wu, D. et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea.
710
Nature 462, 1056–1060 (2009).
711
41. Kyrpides, N. C. et al. Genomic Encyclopedia of Bacteria and Archaea: Sequencing a
712
Myriad of Type Strains. PLoS Biol. 12, e1001920 (2014).
713
42. Kyrpides, N. C., Eloe-Fadrosh, E. A. & Ivanova, N. N. Microbiome Data Science:
714
Understanding Our Microbial Planet. Trends Microbiol. 24, 425–427 (2016).
715
43. Thompson, L. R. et al. A communal catalogue reveals Earth’s multiscale microbial
716
diversity. Nature 551, 457 (2017).
717
44. Nayfach, S. & Pollard, K. S. Toward Accurate and Quantitative Comparative
718
Metagenomics. Cell 166, 1103–1116 (2016).
719
45. Satopää, V., Albrecht, J., Irwin, D. & Raghavan, B. Finding a ‘Kneedle’ in a Haystack:
720
Detecting Knee Points in System Behavior.
721
46. Aitchison, J., Barceló-Vidal, C., Martín-Fernández, J. A. & Pawlowsky-Glahn, V.
722
Logratio Analysis and Compositional Distance. Math. Geol. 32, 271–275 (2000).
723
47. Di Palma, M. A. & Gallo, M. A co-median approach to detect compositional outliers. J.
724
Appl. Stat. 43, 2348–2362 (2016).
725
48. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods
726
9, 357–9 (2012).
727
49. Kong, N. et al. Draft Genome Sequences of 1,183 Salmonella Strains from the 100K
728
Pathogen Genome Project. Genome Announc. 5, (2017).
729
50. Tubulekas, I. & Hughes, D. A Single Amino Acid Substitution in Elongation Factor Tu
730
Disrupts Interaction between the Ternary Complex and the Ribosome. J. Bacteriol. 240–
731
250 (1993).
732
51. Seabolt, E. E. et al. OMXWare, A Cloud-Based Platform for Studying Microbial Life at
733
Scale. arXiv 1911.02095, (2019).
734
52. Zelezniak, A. et al. Metabolic dependencies drive species co-occurrence in diverse
735
microbial communities. Proc. Natl. Acad. Sci. 112, 6449–6454 (2015).
736
53. Jones, M. B. et al. Library preparation methodology can influence genomic and functional
737
predictions in human microbiome research. Proc Natl Acad Sci U S A (2015).
738
doi:10.1073/pnas.1519288112
739
54. Pollock, J., Glendinning, L., Wisedchanwet, T. & Watson, M. The madness of
740
microbiome: Attempting to find consensus ‘best practice’ for 16S microbiome studies.
741
Appl. Environ. Microbiol. AEM.02627-17 (2018). doi:10.1128/AEM.02627-17
742
55. Browne, H. P. et al. Culturing of ‘unculturable’ human microbiota reveals novel taxa and
743
extensive sporulation. Nature 533, 543–546 (2016).
744
56. Eilers, H., Pernthaler, J., Glöckner, F. O. & Amann, R. Culturability and In situ abundance
745
of pelagic bacteria from the North Sea. Appl. Environ. Microbiol. 66, 3044–51 (2000).
746
57. Hinchliff, C. E. et al. Synthesis of phylogeny and taxonomy into a comprehensive tree of
747
life. Proc Natl Acad Sci U S A 112, 12764–12769 (2015).
748
58. Knight, R. et al. Best practices for analysing microbiomes. Nat. Rev. Microbiol. 16, 410–
749
422 (2018).
750
59. Weis, A. M. et al. Genomic Comparison of Campylobacter spp. and Their Potential for
751
Zoonotic Transmission between Birds, Primates, and Livestock. Appl. Environ. Microbiol.
752
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
30
82, 7165 LP – 7175 (2016).
753
60. Miller, B. et al. A novel, single-tube enzymatic fragmentation and library construction
754
method enables fast turnaround times and improved data quality for microbial whole-
755
genome sequencing. Kapa Biosyst. Appl. Note 1–8 (2015).
756
doi:10.13140/RG.2.1.4534.3440
757
61. Lüdeke, C. H. M., Kong, N., Weimer, B. C., Fischer, M. & Jones, J. L. Complete genome
758
sequences of a clinical isolate and an environmental isolate of Vibrio parahaemolyticus.
759
Genome Announc. 3, e00216-15 (2015).
760
62. Jeannotte, R. et al. High-Throughput Analysis of Foodborne Bacterial Genomic DNA
761
Using Agilent 2200 TapeStation and Genomic DNA ScreenTape System. Agil. Appl. Note
762
1–8 (2015). doi:doi:10.6084/m9.figshare.1372504
763
63. Arabyan, N. et al. Salmonella Degrades the Host Glycocalyx Leading to Altered Infection
764
and Glycan Remodeling. Sci. Rep. 6, 1–11 (2016).
765
64. Kong, N. et al. Draft Genome Sequences of 1,183 Salmonella Strains from the 100K
766
Pathogen Genome Project. Genome Announc. 5, e00518-17 (2017).
767
65. Krueger, F. TrimGalore. (2018).
768
66. Mukherjee, S., Huntemann, M., Ivanova, N., Kyrpides, N. C. & Pati, A. Large-scale
769
contamination of microbial isolate genomes by Illumina PhiX control. Stand. Genomic
770
Sci. 10, 18 (2015).
771
67. Homer, N. DWGSIM. (2011).
772
68. O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status,
773
taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745
774
(2016).
775
69. Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. A fast and symmetric DUST
776
implementation to mask low-complexity DNA sequences. J. Comput. Biol. 13, 1028–1040
777
(2006).
778
70. Wood, D. Kraken’s operating manual.
779
71. Templ, M., Hron, K. & Filzmoser, P. robCompositions: An R-package for Robust
780
Statistical Analysis of Compositional Data. in Compositional Data Analysis 341–355
781
(John Wiley & Sons, Ltd, 2011). doi:10.1002/9781119976462.ch25
782
72. Andrews, S. FastQC.
783
73. Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: summarize analysis results
784
for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048 (2016).
785
74. Andrews, W. H., Wang, H., Jacobson, A. & Hammack, T. Bacteriological Analytical
786
Manual (BAM) Chapter 5: Salmonella. in Bacteriological Analytical Manual (U.S. Food
787
and Drug Administration, 2018).
788
75. Grim, C. J. et al. High-Resolution Microbiome Profiling for Detection and Tracking of
789
Salmonella enterica. Front. Microbiol. 8, 1587 (2017).
790
76. Rizopoulos, D. ltm : An R Package for Latent Variable Modeling and Item Response
791
Theory Analyses. J. Stat. Softw. 17, 1–25 (2006).
792
793
794
795
796
797
798
799
800
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
31
801
802
FIGURE and TABLE LEGENDS: (corresponding to their order at end of merged document)
803
804
Figure 1: Pipeline description of bioinformatic steps applied to high protein powder
805
metatranscriptome samples. Black arrows indicate data flow and blue boxes describe outputs
806
from the pipeline.
807
808
Table 1: Accuracy of microbial identification using in silico constructed Simulated Food
809
Mixtures with expected food matrix and microbial sequences.
810
811
Figure 2A: Alpha diversity (number of genera) for all (n = 31) high protein powder
812
metatranscriptomes is compared to total number of sequenced reads for a range of in silico
813
subsampled sequencing depths. The dashed vertical line indicates the median elbow (at approx.
814
67 million reads).
815
816
Figure 2B: Hierarchical clustering of Aitchison distance values of poultry meal samples based
817
on microbial composition. Samples were received from Supplier A (blue and red) and Supplier B
818
(green). Matrix-contaminated samples are additionally marked in red.
819
820
Figure 3A: Phylogram of the 65 microbial genera present in all samples with RPM > 0.1
821
822
Figure 3B: Phylogram of all microbes observed in any sample. Log of the median RPM value
823
across samples is indicated. Grey indicating a median RPM value of 0.
824
825
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
32
826
Figure 4: Heatmap (log10-scale) of high protein powder microbial composition and relative
827
abundance (RPM) where absence (RPM < 0.1) is indicated in grey. Genera are ordered by
828
summed abundance across samples. Samples were received from Supplier A (blue) and Supplier
829
B (green). Red stars indicate matrix-contaminated samples (from Supplier A).
830
831
Figure 5A: All identified microbial general are plotted with median value and median absolute
832
deviation (MAD) of RPM abundance. Genera with MAD > 5 are labeled with the genus name.
833
834
Figure 5B: Heatmap (log10-scale) of ten microbial genera with the largest median absolute
835
deviation (MAD) across samples. Genera are ordered by decreasing MAD from top to bottom.
836
Samples were received from Supplier A (blue) and Supplier B (green). Red stars indicate matrix
837
contaminated samples (from Supplier A).
838
839
Figure 6A: Relative abundance of microbes with high relevance to food safety and quality from
840
high protein powder total RNA sequenced microbiomes. Width of violin plot indicates density of
841
samples with relative abundance at that value. Observation threshold of RPM = 0.1 is indicated
842
with the horizontal black line.
843
844
Figure 6B: Foodborne microbe relative abundances are shown across samples of high protein
845
powder total RNA sequenced samples.
846
847
Figure 7: Salmonella culturability status and high-throughput sequencing read abundance
848
(RPM) from k-mer classification to NCBI Microbial RefSeq Complete (A), from alignments to
849
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
33
1,447 Salmonella genomes (B), and from alignments to 4,846 EF-Tu gene sequences (C).
850
Salmonella presence (red) indicates culture-positive result, absence (green) indicates culture-
851
negative result, and no record (black) indicates samples for which no culture test was completed.
852
853
Table 2: The ranks for Salmonella-positive samples and the associated p-values from Wilcoxon
854
rank sum test are shown for high-throughput sequencing read abundance (RPM) for multiple
855
analyses: k-mer classification to NCBI Microbial RefSeq Complete (left), alignments to 1,447
856
Salmonella genomes (middle), and alignments to 4,846 ef-Tu gene sequences (right). The
857
corresponding Salmonella relative abundances are shown in Figure 7A–C.
858
859
Figure 8: Salmonella status correlations with genus relative abundances. Only those genera with
860
absolute value of the correlation coefficient > 0.5 are shown. Positive and negative correlations
861
are indicated in grey and blue, respectively.
862
863
864
SUPPLEMENTAL INFORMATION:
865
Supplemental Figures (pdf): Supplemental Figures S1–S5
866
Supplemental Table 1 (.xlsx) - Sample Descriptions
867
Supplemental Table 2 (.xlsx) - Matrix Filtering Genomes
868
Supplemental Table 3 (.xlsx) - Simulated Food Mixtures
869
Supplemental Table 4 (.xlsx) - Microbial Genera
870
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
RNA
sequencing
Reads
(FASTQ)
Relative
Abund ance
(RPM)
Microbial
Identification
Classificati on with
microbial RefSeq
Complete
Microbial
Quantification
Genus-level relative
abundance (Eq. 1)
Matrix Filtering
Classificati on with
Common Food
Ingredient
Database
Unclassified
Reads
Food Matrix
Reads
Comparative
Statistics
Intrasample
Intersample
Relative change
Sequence
Quality Control
TrimGalore
PhiX Filtering
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
Table 1: Microbial Identification Accuracy from Simulated Food Microbiome Mixtures
Simulated Mixture 1 Simulated Mixture 2
With Matrix Filtering No Matrix Filtering With Matrix Filtering No Matrix Filtering
# GENERA GENUS
READS # GENERA GENUS
READS # GENERA GENUS
READS # GENERA GENUS
READS
Bacteria in Simulated Mixture
(Expected Content)
14
15,000
14
15,000
14
15,000
14
15,000
Observed Microbial Content
Bacteria
18
13,517
34
13,700
15
13,551
33
13,999
Viruses
0
0
9
563
0
0
4
328
Archaea
0
0
1
1
0
0
1
3
Eukaryota
0
0
4
104
0
0
4
799
Total Observed
18
13,517
48
14,368
15
13,551
42
15,129
True Positives
(as a % of total observed)
14
(78%)
13,511
(99.96%)
14
(29%)
13,571
(94.45%)
14
(93%)
13,548
(99.98%)
14
(33%)
13,623
(90.05%)
False Positives
(as a % of total observed)
4
(22%)
6
(0.04%)
34
(71%)
797
(5.55%)
1
(7%)
3
(0.02%)
28
(67%)
1,506
(9.95%)
False Positives Removed
with Matrix Filtering
(as a % of false positives
without filtering)
30
(88.2%)
791
(99.2%)
27
(96.4%)
1,503
(99.8%)
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint
Salmonella-positive sample k-mer Classification Whole Genome
Alignment
ef-Tu Alig nment
MFMB-04 8th 10th 1st
MFMB-20 9th 9th 4th
MFMB-38 20th 3rd 21st
MFMB-41 30th 6th 28th
Rank sum test p-value
p=0.86 p=0.06 p=0.56
was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (whichthis version posted May 19, 2020. . https://doi.org/10.1101/2020.05.18.102574doi: bioRxiv preprint