ArticlePDF Available

Abstract and Figures

Applied machine learning in bioinformatics is growing as computer science slowly invades all research spheres. With the arrival of modern next-generation DNA sequencing algorithms, metagenomics is becoming an increasingly interesting research field as it finds countless practical applications exploiting the vast amounts of generated data. This study aims to scope the scientific literature in the field of metagenomic classification in the time interval 2008–2019 and provide an evolutionary timeline of data processing and machine learning in this field. This study follows the scoping review methodology and PRISMA guidelines to identify and process the available literature. Natural Language Processing (NLP) is deployed to ensure efficient and exhaustive search of the literary corpus of three large digital libraries: IEEE, PubMed, and Springer. The search is based on keywords and properties looked up using the digital libraries’ search engines. The scoping review results reveal an increasing number of research papers related to metagenomic classification over the past decade. The research is mainly focused on metagenomic classifiers, identifying scope specific metrics for model evaluation, data set sanitization, and dimensionality reduction. Out of all of these subproblems, data preprocessing is the least researched with considerable potential for improvement.
Content may be subject to copyright.
biology
Article
Literature on Applied Machine Learning in
Metagenomic Classification: A Scoping Review
Petar Tonkovic 1,* , Slobodan Kalajdziski 1, Eftim Zdravevski 1, Petre Lameski 1,
Roberto Corizzo 2, Ivan Miguel Pires 3,4,5 , Nuno M. Garcia 3, Tatjana Loncar-Turukalo 6
and Vladimir Trajkovik 1
1Faculty of Computer Science and Engineering, Saints Cyril and Methodius University,
1000 Skopje, Macedonia
; slobodan.kalajdziski@finki.ukim.mk (S.K.); eftim.zdravevski@finki.ukim.mk (E.Z.);
petre.lameski@finki.ukim.mk (P.L.); vladimir.trajkovik@finki.ukim.mk (V.T.)
2Department of Computer Science, American University, Washington, DC 20016, USA;
rcorizzo@american.edu
3Instituto de Telecomunicações, Universidade da Beira Interior, 6200-001 Covilhã, Portugal;
impires@it.ubi.pt (I.M.P.); ngarcia@di.ubi.pt (N.M.G.)
4Computer Science Department, Polytechnic Institute of Viseu, 3504-510 Viseu, Portugal
5Health Sciences Research Unit: Nursing, School of Health, Polytechnic Institute of Viseu,
3504-510 Viseu, Portugal
6Faculty of Technical Sciences, University of Novi Sad, 21102 Novi Sad, Serbia; turukalo@uns.ac.rs
*Correspondence: petar.tonkovikj@students.finki.ukim.mk
Received: 25 October 2020; Accepted: 3 December 2020; Published: 9 December 2020


Simple Summary:
Technological advancements have led to modern DNA sequencing methods,
capable of generating large amounts of data describing the microorganisms that live in samples
taken from the environment. Metagenomics, the field that studies the different genomes within
these samples, is becoming increasingly popular, as it has many real-world applications,
such as
the
discovery of new antibiotics, personalized medicine, forensics, and many more. From a computer
science point of view, it is interesting to see how these large volumes of data can be processed
efficiently to accurately identify (classify) the microorganisms from the input DNA data. This scoping
review aims to give an insight into the existing state of the art computational methods for processing
metagenomic data through the prism of machine learning, data science, and big data. We provide
an overview of the state of the art metagenomic classification methods, as well as the challenges
researchers face when tackling this complex problem. The end goal of this review is to help
researchers be up to date with current trends, as well as identify opportunities for further research
and improvements.
Abstract:
Applied machine learning in bioinformatics is growing as computer science slowly invades
all research spheres. With the arrival of modern next-generation DNA sequencing algorithms,
metagenomics is becoming an increasingly interesting research field as it finds countless practical
applications exploiting the vast amounts of generated data. This study aims to scope the scientific
literature in the field of metagenomic classification in the time interval 2008–2019 and provide
an evolutionary timeline of data processing and machine learning in this field. This study follows the
scoping review methodology and PRISMA guidelines to identify and process the available literature.
Natural Language Processing (NLP) is deployed to ensure efficient and exhaustive search of the
literary corpus of three large digital libraries: IEEE, PubMed, and Springer. The search is based on
keywords and properties looked up using the digital libraries’ search engines. The scoping review
results reveal an increasing number of research papers related to metagenomic classification over the
past decade. The research is mainly focused on metagenomic classifiers, identifying scope specific
metrics for model evaluation, data set sanitization, and dimensionality reduction. Out of all of these
subproblems, data preprocessing is the least researched with considerable potential for improvement.
Biology 2020,9, 453; doi:10.3390/biology9120453 www.mdpi.com/journal/biology
Biology 2020,9, 453 2 of 25
Keywords: metagenomics; scoping review; classification; data preprocessing
1. Introduction
Metagenomics is becoming an increasingly popular field in bioinformatics since with the evolution
of technology and machine learning models, we are able to create increasingly more competent models
to tackle the problems of DNA sequencing and genome classification. The genome is defined as the full
genetic information of an organism, and genomics deals with obtaining the genome from a cultivated
sample of a given organism. In contrast, metagenomics deals with samples from the environment
that likely contain many organisms. The goal in this case is to analyze the different genomes within
this environmental sample. Over the years, many Bacterial Artificial Chromosome (BAC) libraries
have been gathered, even ones that can be used to sequence the entire human genome [
1
]; however,
we were only able to tackle one genome at a time. With technology becoming more sophisticated,
new more precise DNA sequencing techniques have been developed, and the computation power of
modern computers has greatly increased. As a result of the latter, we can now process much larger
quantities of data and train more complex machine learning models that were previously not feasible
due to hardware limitations. This opens the gates for metagenomics to be one of the most trending
topics in Big Data, as it can be used extensively in medicine. Such exemplary applications are in the
identification of novel biocatalysts and the discovery of new antibiotics [
2
], as well as personalized
medicine [
3
5
], bioremediation of industrial, agricultural, and domestic wastes [
6
,
7
], resulting in
a reduction of environmental pollution, as well as forensics [8].
One example of a recent initiative in the field of forensics is the Critical Assessment of Massive Data
Analysis (CAMDA) [
9
] Metagenomic Forensics Challenge, which encourages researchers to try and
construct multi-source microbiome fingerprints and predict the geographical origin of mystery samples,
given a large data set of city microbiome profiles in a global context (for example, the MetaSUBdata
set [
10
]). An efficient and reliable model that can accurately determine the source of a microbiome
fingerprint can be a valuable tool in forensics as it can allow us to answer many questions, for example:
Where has a given individual been in the past six months given the microbiological sample obtained
from the skin? However, to do this, we would have to deal with challenges, each of which offers much
room for research on different parts of the problem:
1.
Data science: How do we preprocess the data and deal with large quantities of unknown samples
in the training set (erroneous data, unclassified DNA sequences, etc.)?
2.
Big Data: How do we design a data pipeline to efficiently process exorbitant amounts of data
samples, while not suffering extreme performance losses (model complexity vs. performance
tradeoff)?
3.
Machine learning: Which are the state-of-the-art models that can be used for metagenomic
classification, and how can they be specialized towards the metagenomics domain?
These are all trending topics in the computer science research world, which is why it is to be
expected that there is much related work out there. The goal of this study is to conduct a literature
scoping review and discover work that can give potential answers to the questions presented above
in terms of the CAMDA challenges. Ideally, we would like to analyze the trends from the past five
years in data processing and metagenomic classification models to identify the current state-of-the-art
approaches and discover their drawbacks. The ultimate goal is to narrow down the interesting research
topics in this field for the foreseeable future. We would like to answer the following questions:
What are the research trends about metagenomics in general, and what is being researched in
terms of the CAMDA MetaSUB challenges?
How do we efficiently deal with a large number of unknown samples within the data set from
a data processing viewpoint?
Biology 2020,9, 453 3 of 25
What are the current machine learning models that are suitable to be applied for the annual
CAMDA MetaSUB challenges?
Based on these questions, we manually analyzed 11,764 articles, where Europe is the region with
the most studies in the field of metagenomic sequencing with major collaborations with the United
States and China. According to the different years of publication, “genome sequencing evaluation”
is the most relevant search keyword.
The remainder of the review is organized as follows: Section 2presents the methodology
implemented. The results of the findings are presented in Section 3, and they are discussed in
Section 4. This study is finalized in Section 5with the main conclusions of the study.
2. Methods
This study adopts a scoping review method to identify and process the available literature.
To this end, a Natural Language Processing (NLP) toolkit [
11
] is used to search the literature
corpus. The toolkit follows the “Preferred Reporting Items for Systematic reviews and Meta-Analyses”
(PRISMA) methodological framework [
12
]. The goal of this methodology is to identify and gather
relevant articles based on certain criteria using some search keywords, sanitize the results by removing
duplicates and other irrelevant or incomplete articles from the result set, and pick the articles
that should undergo thorough screening after performing a qualitative analysis on the sanitized
result set. The NLP toolkit automates this process and additionally provides a visual summary to
report the results. This allows us to follow a methodological framework for conducting a scoping
study [
13
] consisting of five stages: identification of the research question, identification of relevant
studies, study selection, charting the data, and collating, summarizing, and reporting the results
(Supplementary Materials Data S1).
2.1. Identification of Relevant Studies
In this stage, we specify the search parameters for the scoping study: Which digital libraries does
our literature corpus consist of; what are the keywords we are looking for in an article; what is the
publication time interval we are interested in, as well as other parameters of the NLP toolkit [
11
,
14
]?
Currently, the toolkit indexes the following digital libraries: IEEE Xplore, Springer, and PubMed.
All PubMed articles that match the given search criteria (i.e., a keyword) are analyzed. IEEE Xplore
results include the top 2000 articles that match the given criteria, sorted by relevance determined
by IEEE Xplore. For the Springer digital library, the search for each keyword separately is limited
to 1000 articles or 50 pages with results (whichever comes first) sorted by relevance determined by
Springer. The parameters that the toolkit requests as the input are the following:
Keywords are search terms or phrases that are used to query a digital library. Duplicates that
might occur in the results are removed in a later phase.
Roperties are words or phrases that are searched in the title, abstract, or keywords section of the
articles identified with the keywords.
Property groups are thematically, semantically, or otherwise grouped properties for a more
comprehensive presentation of the results.
Start year indicates the starting year of publishing (inclusive) for the papers to be included in
the study.
End year is the last year of publishing (inclusive) to be considered in the study.
Minimum relevant properties is a number denoting the minimum number of properties that an
article has to contain to be considered as relevant.
For this scoping review, we would like to give a summary of the research work done in the field of
metagenomic classification and data processing in the past ten years. The input parameters provided
to the toolkit in our scoping review are shown in Table 1.
Biology 2020,9, 453 4 of 25
Table 1. NLP toolkit input parameters for the metagenomic sequencing scoping review.
Input Parameter Value
Keywords
CAMDA, MetaSUB, metagenomic classification, preprocessing genome data, deep learning, lightweight
model, fast genome comparison, k-mers, Burrows–Wheeler transform, genome sequencing evaluation, model
benchmarking
Property groups
(properties)
metagenomics
(metagenomic sequence/DNA sequence, species/organism/bacteria, metagenomic subsequence/k-mer,
metagenomic classification/metagenomic sequence classification/metagenomic sequencing, genome/genetic
material/DNA genome, nucleotides/monomer, DNA/deoxyribonucleic acid, CAMDA/CAMDA Pub,
MetaSUB/forensics challenge)
machine learning
(classification/sequencing/categorization, deep learning/deep model/DL, neural network/deep neural
network/DNN, deep forest, Kraken)
data preprocessing
(unknown data/unknown sequence/incomplete data, noise/white noise/error, data reduction,
features/feature extraction, Big Data, data cleaning)
model evaluation
(performance, F1 measure, false positive/false-positive, accuracy, benchmarking, model validation,
T-test/student test/statistic test)
Start year 2008
End year 2019
Minimum relevant properties 3
Biology 2020,9, 453 5 of 25
2.2. Study Selection and Eligibility Criteria
After collecting the initial set of relevant studies, further filtering is done using the PRISMA
methodology [
12
]. The workflow of the methodology is illustrated in Figure 1. The identification phase
was already described above in Section 2.1. It is followed by the screening phase, which removes the
duplicate articles, articles not published in the specified time period, and articles for which the title or
abstract could not be analyzed due to parsing errors or unavailability. In the eligibility phase, the NLP
toolkit applies natural language processing techniques to further filter the articles. In short, the toolkit
takes the titles and abstracts of the articles and performs tokenization of sentences, English stop word
removal, and stemming and lemmatization of the words [
15
]. Next, the stemmed and lemmatized
words are matched against the set of properties that are given as input. If the article contains the
required amount of relevant properties (in our case, 3, as shown in Table 1), the article is marked as
relevant. The toolkit then automatically generates a BibTex file containing the citations to the relevant
articles and an excel file containing the Digital Object Identifier (DOI), link, title, authors, publication
date, publication year, number of citations, abstract, keyword, source, publication title, affiliations,
number of different affiliations, countries, number of different countries, number of authors, BibTex
cite key, number of found property groups, and number of found properties. This reduced set of
relevant articles can then be analyzed further manually for potential inclusion in the qualitative and
quantitative synthesis. While the manual part of the review cannot be avoided, the toolkit helps by
reducing the domain of potentially interesting articles, making it easier for the researcher to find
interesting articles relevant for the research topic, while reducing the number of irrelevant articles the
researcher has to read through in the process.
Figure 1. PRISMA statement workflow with total number of articles for the current survey.
Biology 2020,9, 453 6 of 25
2.3. Charting the Data
To visualize the results, the relevant articles are aggregated according to several criteria:
source (digital library) and relevance selection criteria
publication year
digital library and publication year
search keyword and digital library
search keyword and year
property group and year
property and year, generating separate charts for each property group
number of countries, distinct affiliations, and authors, aiming to simplify the identification of
collaboration patterns (e.g., written by multiple authors with different affiliations).
These aggregated metrics are available in the form of CSV files and charts. The plotting of the
aggregate results was integrated and streamlined using the Matplotlib library [
16
] and NetworkX [
17
].
The NLP toolkit enables graph visualization of the results, where the nodes are the properties,
and the edges have weights determined by the number of articles that contain the two properties the
edge connects. Articles that do not contain at least two properties and properties that were not present
in at least two articles were excluded.
A similar graph for the countries of affiliations was generated. The top 50 countries by the number
of collaborations were considered for this graph. Countries and an edge between them were shown if
the number of bilateral or multilateral collaborations was in the top 10% (above 90th percentile) within
those 50 countries.
3. Results
For this scoping review, it can be seen in Figure 1that initially, 30,831 articles were identified in
the database search phase. After the removal of duplicates, this number dropped to 18,040,
which is
almost half. Further screening reduced this number to 12,871. Applying the eligibility criteria described
above left us with 11,764 articles to manually analyze, which was slightly over one-third of the initial
articles discovered in the identification phase. We provide all the data generated by the NLP toolkit for
reproduction purposes as supplementary material available at [18].
The numbers of collected articles, duplicates, articles with invalid time span or incomplete data,
and relevant articles for each digital library are shown in Figure 2. The majority of articles considered
are from PubMed and Springer with IEEE Xplore having negligible representation. Springer has the
largest amount of initial articles; however, most of them are duplicates. In the end, the highest amount
of relevant articles were drawn from PubMed. If we look at the number of relevant articles per year
from each source presented in Figure 3, it can be seen that PubMed is consistently the prevalent digital
library source when it comes to metagenomic sequencing; however, Springer was narrowing this gap
in 2019. One additional thing to note in Figure 3is that the data were collected on 8 November 2019,
and as a result, the data for 2019 are incomplete. Even though only publications in the last two months
of 2019 were not considered, due to the publication and indexing delays, this difference might be
actually larger, and more articles published in 2019 would be excluded from the analysis. The same
observation also applies to all other figures in the continuation of this section.
Biology 2020,9, 453 7 of 25
Figure 2.
Number of articles collected and discarded in each PRISMA phase from each digital library.
Figure 3.
Number of relevant articles per year from 2008 to November 2019, aggregated by digital
library source.
It is also interesting to see whether the total amount of relevant articles from all sources is
increasing or decreasing over the years, which gives an idea of whether the research topic is becoming
more or less trendy. The number of collected and relevant articles between 2008 and 2019 is shown
in Figure 4. It can be seen that metagenomic sequencing has been a popular research topic in the past
decade and continues to stay relevant going forward.
Biology 2020,9, 453 8 of 25
Figure 4.
Number of collected and analyzed articles vs. number of relevant articles per year from 2008
to November 2019.
3.1. Geographical Distribution and Collaboration Evidence
Another point of interest is which countries are producing the most relevant papers in the
field of metagenomic sequencing and whether collaboration between countries exists. The toolkit
presents this information in a weighted property graph where the node properties include the country
name and number of published articles discovered, while the edge attribute (weight) is the number
of joint articles with authors from the two countries. For clarity, only the pairs with a number of
collaborations greater than the 90th percentile are illustrated in Figure 5. The graph covers 32 countries
(nodes) and 65 collaborations (edges). The stronger collaborations are color-coded with stronger
ones in violet color, whereas weaker collaborations are pale. The same holds for the nodes and
the number of articles. We can easily distinguish several hubs: the United States, China, Germany,
the United Kingdom, Japan, France, and Italy with a large number of articles, as well as many
collaborations with the rest of the hubs. An interesting observation is that while the United States
is the biggest hub, if the European Union countries are aggregated together, they amount to a total
number of 3262 articles, which outnumbers the U.S., making Europe the leader in research in the field
of metagenomic sequencing. It is also interesting to notice that the U.S. is collaborating with China the
most, while the EU countries collaborate more with the U.S. and China than among themselves.
3.2. Keyword Statistics
As already mentioned in Section 2.1, the articles were discovered using keywords (properties)
to query the digital libraries. Metagenomic sequencing is a very broad field, and as a result, it is
important to be able to identify the most trendy topics within the field. The distribution of relevant
articles with respect to the publication year is shown in Figure 6. Note that the results for 2019 are
truncated due to the analysis being done until November. In addition, the internals of their search
engines are not known, meaning that the libraries might differ in the way they look for the keywords:
only in the title, keywords section, abstract, or the whole article. It can be seen that the interest in
genome sequencing is slightly increasing over the years, with a shift in interest from fast genome
comparison to data preprocessing and model evaluation. This is in line with the rise of the popularity
Biology 2020,9, 453 9 of 25
of the CAMDA challenges in the past two years, as well as the MetaSUB data set used in the CAMDA
forensics challenge, where dealing with a large volume of data and adequate preprocessing is crucial
to obtain good results.
United
States
3166
Germany
697
Canada
375
Turkey
47
Spain
251
United
Kingdom
572
Italy
390
Japan
434 Austria
84
96154
Colombia
30
Switzer-
land
139
Australia
263
China
1080
South
Korea
279
France
396
Denmark
124
Brazil
198
Nether-
lands
230
Saudi
Arabia
47
Egypt
32
Sweden
147
Belgium
126
South
Africa
67
Portugal
86
India
345
Russia
88
Poland
103
Greece
56
Hong
Kong
75
Israel
69
Mexico
43
Singapore
81
21
20
15
11
11
12
20
12
11
54
12
9
19
14
76
12
31
102
40
31
19
9
24
27
49
13 11
24
34
20
28
20
32
14
12
18
11
23
12 9
20
20
12
28
14
10
17
12
14
9
22
10
17
12
14
10
23
12
13
22
11
9
9
Min Max
Figure 5.
Number of research papers per country and collaboration links with the annotated number
of joint articles.
On the other hand, the distribution of relevant articles with respect to the digital library of
publishing is shown in Figure 7. We have already seen that IEEE had an insignificant amount of relevant
articles; however, it is interesting that the relevant articles from PubMed focus almost exclusively
on model evaluation for genome sequencing, while Springer is more focused on the anatomy of the
models, the classification process, as well as the data preprocessing techniques. In short, PubMed is
more focused on the state-of-the-art approaches for constructing the data sets, while Springer is more
focused on processing these data sets and constructing classifiers that work with them.
Biology 2020,9, 453 10 of 25
Figure 6.
Distribution of the number of relevant articles for each keyword with respect to the publication
year from 2008 to November 2019.
Figure 7.
Distribution of the number of relevant articles for each keyword with respect to the
digital library.
3.3. Property Statistics
Before going into deeper discussions for each property group, it is worth observing the annual
distributions of relevant articles across property groups, shown in Figure 8. A slight increase in
relevant articles can be noticed over the years; however, the relative difference between the number
of relevant articles for the different property groups remains the same. With machine learning lately
being at the center of attention in the computer science research world, it is a positive surprise that the
metagenomics property group takes the lead with the most relevant articles consistently every year,
indicating that bioinformatics is a greatly trending topic in computer science even outside of the
machine learning domain. It is then followed by the property groups: machine learning, model
evaluation, and data preprocessing, respectively. The low amount of relevant articles related to data
preprocessing may indicate that there is much opportunity for novel ideas, explaining the motivation
behind the CAMDA challenge where one of the biggest goals is to efficiently deal with large amounts
of uncategorized data.
Biology 2020,9, 453 11 of 25
Furthermore, Figure 9illustrates a weighted property graph with properties as nodes along with
the number of relevant articles discovered containing each property and the number of co-occurrences
in relevant articles between properties as edge weights. For clarity reasons, only the pairs with
a number of co-occurrences greater than the 75th percentile are illustrated. Genome classification is
clearly the most popular combination. It is worth noting that model benchmarking also appears to be
a popular topic and is strongly linked with statistical tests.
Figure 8. Annual distribution of the number of relevant articles for each property group.
classification
7496
genome
6588
metagenomic
classification
601
species
3091
unknown data
578
benchmarking
3125
accuracy
967
features
1899
statistical
test
2362 DNA
2998
metagenmoic
sequence
2373
model
validation
1039
noise
646
nucleotides
1563
5037
601
2453
455
2243
734
1369
2417
2372
668
472
1639
1164
2034
1980
1170
1450
1990
701
444
1614
623
1153
881
486
649
733
547
425
704
897
853 675
425
498
479
544
416
675
495
389
2372
471
405
Min Max
Figure 9.
Number of relevant articles per property and number of co-occurrences of properties in
relevant articles.
Biology 2020,9, 453 12 of 25
4. Discussion
The following sections provide an in-depth analysis of the most recent research related to the
property groups (metagenomics, machine learning, model evaluation, and data preprocessing). Each of
the property groups has a dedicated section for discussing the latest trends, tools, and inventions that
relate to our primary research topic: metagenomic classification.
4.1. Metagenomics
Evidence in the literature, as shown in Figure 10, shows that metagenomics as a field of study
is becoming an increasingly popular research topic. The amount of relevant articles in the field is
steadily increasing over the years as metagenomic analysis finds applications in many fields, including
medicine [
3
], waste management [
6
], and forensics [
8
]. Computer science is not an exception, indicated
by the high amount of relevant articles on metagenomic classification and DNA sequencing.
Figure 10.
Number of relevant articles containing each property grouped into the metagenomics domain.
4.1.1. Metagenomics in the Real World
The real-world applications of metagenomics are abundant. According to the work done
in [
19
], research trends between 1996 and 2016 show that metagenomics is applied in the following
fields (sorted in increasing order according to the number of publications per field): neuroscience,
pharmacology, toxicology, chemistry and chemical engineering, mathematics, computer science,
environmental science, agricultural and biological sciences, immunology and microbiology, medicine,
biochemistry, genetics, and molecular biology. Most of the documented work (69.95%) is published
as articles, followed by reviews and conference papers.
It is no surprise that medicine and biochemistry are the top fields, given that the roots
of metagenomics hail from biology. There are applications of metagenomics in the human gut
microbiome [
20
], where next-generation sequencing technology is used to study intestinal microbiome
diversity and dysbiosis, leading to the identification of new functional genes, microbial pathways,
and antibiotic resistance genes. However, the work also mentions that there are still some limitations
including difficulties identifying microbial expression and the need for higher sequence coverage than
the one provided by the 16S rDNA sequence analysis [
21
]. International projects studying the diversity
of the human gut microbiome include the European project MetaHIT [
22
], and the American Human
Microbiome Project [23].
Next-generation sequencing technology is also used for pathogen detection [
24
], as nearly
all infectious agents contain DNA or RNA genomes. This raises the need for an optimized
Biology 2020,9, 453 13 of 25
sequencing methodology that will allow the simultaneous and independent sequencing of billions
of DNA fragments. Metagenomic Next-Generation Sequencing (mNGS) can be targeted towards
microbial culture samples or untargeted. The untargeted approaches use shotgun sequencing [
25
] of
clinical samples, whereas targeted approaches are based on singleplex or multiplex Polymerase Chain
Reaction (PCR) [
26
], primer extension [
27
], or the more modern bait probe enrichment methods [
28
]
to restrict detection to a list of targets. In summary, the study claims that mNGS has reduced the
cost of high-throughput sequencing by several orders of magnitude since 2004. In addition, the work
in [
29
] shows that mNGS approaches can be effectively used to assist the diagnosis of bloodstream
infections using pathogen detection, while [
30
] shows that mNGS can be used to predict antibiotic and
antiviral resistance.
On top of antibiotic resistance, metagenomics is also applied in field of pharmacy [
31
] where
different techniques are used to understand the effect of antibiotics on microbial communities in
order to synthesize new antibiotics that are highly effective against a target pathogen. In the absence
of cultured microorganisms, metagenomics provides a strong alternative to research microbes and
potentially finds their weaknesses. Techniques applied include descriptive metagenomics, where the
goal is to describe the structure of the microbial populations, and functional metagenomics, where new
antimicrobials can be found by analyzing the absence of certain pathogens in different environments.
4.1.2. Metagenomics in the Context of CAMDA and MetaSUB
In addition to the clinical applications of metagenomics, the field is extensively researched in the
Critical Assessment of Massive Data Analysis (CAMDA) [
9
] conferences. These conferences provide
annual challenges in the field of metagenomics based on the MetaSUB data set of microorganism
samples from various subways from all over the world. Works have been done to unravel bacterial
fingerprints of city subways from microbiome 16S gene profiles [
32
34
] and to show that bacterial
composition across different cities is significantly different. This is of crucial importance as it potentially
allows us to deduce the location of a given sample, which can have many applications. For instance,
we can train a model to identify where a person has been based on bacterial samples from his/her skin
left over from the subways he/she was using.
One of the biggest challenges of the CAMDA conferences is to deal with extremely large quantities
of data. Abundance-based machine learning techniques [
35
] have proven effective on the MetaSUB
data set in the attempt to classify unknown samples. In this work, two approaches are proposed:
the first one is a read-based taxonomy profiling of each sample, and the second one is a reduced
representation assembly-based method. What is interesting to note is the fact that out of various
machine learning techniques tested, random forests, in particular, have shown very good results with
an accuracy of 91% with a 95% confidence interval between 80% and 93% on the read-based taxonomic
profiling and 90% accuracy on the assembly-based model.
Another challenge is having a large amount of unlabeled data in the data set. One solution
is the MetaBinG2model [
36
]. This model creates a reference database for caching and populates it
with mappings between the unknown samples from the data set and their estimated “real” labels.
The label estimation for a sequence of length k is obtained using a
kth
-order Markov model with
transition probabilities to all other possible sequences (4
k+1
in total given there are four nucleotides)
based on sequence (k-mer substring) overlap. This model has been tested on GPUs for performance
gains,
and results
show that a million 100 bp Illumina sequences can be classified in about 1 min on a
computer with one GPU card.
While these are all significant findings, it is interesting to explore what else can be done with
modern machine learning approaches in the scope of the CAMDA MetaSUB challenges. In the
following sections, we investigate more deeply what can be done with various models in metagenomics,
how to properly evaluate the models, and what can be done to clean the data before model training.
Biology 2020,9, 453 14 of 25
4.2. Machine Learning Techniques and Metagenomics
Machine learning is the current buzz word in computer science, especially after the boom of
deep learning and convolutional neural networks. The number of relevant articles for each property
within the machine learning property group is shown in Figure 11. Classification is the main task in
supervised machine learning, and it continues to be an interesting field of research going into the future.
What is interesting is the increase in the popularity of deep learning and neural networks over the past
two years, with a strong possibility of this trend to continue in the future.
Next, we discuss
the current
machine learning trends in metagenomic classification that are picked up by this scoping review.
Figure 11.
Number of relevant articles containing each property grouped into the machine
learning domain.
4.2.1. K-mer-Based Approaches
The simplest approach to classify metagenomic sequences is the Kraken classifier [
37
]. The idea is
simple and revolves around building a database to store the taxonomy tree of genomes. The database
has records that map every k-mer (a subsequence of length k) to the Lowest Common Ancestor (LCA)
in the taxonomy tree. To classify a sequence, a result set of LCAs is constructed for each k-mer in
the sequence, after which a label is determined according to the number of hits. If no k-mers have
a mapping in the database, the sequence is not assigned a class. This is a very basic lazy learning
approach relying on a pre-built database and is used in many of the other works as a baseline for
performance evaluation. It is worth noting that there have been several attempts to optimize the basic
Kraken classifier, including KrakenUniq [
38
] (an adaptation of Kraken that contains an algorithm for
assessing the coverage of unique k-mers found in each species in a data set) and LiveKraken [
39
]
(real-time classifier based on Kraken).
In the previous section, it was mentioned that ensemble models, in particular random forests,
have proven to be quite effective in the CAMDA MetaSUB challenges. This is not a novel idea
as random forest-based classifiers for metagenomic sequences [
40
,
41
] date back to as early as 2013.
These approaches apply binning techniques to fragments obtained through the Whole Genome Shotgun
(WGS) sequencing. However, these methods are now outdated and outperformed even by the
simple Kraken classifier. A more modern solution uses deep forests based on the phylogenetic
tree [
42
]. Namely, this approach extends a standard deep forest model by embedding phylogenetic tree
information to obtain the so-called cascade deep forest. Each cascade consists of two parts: random
forests and completely random forests. Completely random forests are defined such that the features
chosen for each split are taken randomly until all features are used, as opposed to random forests where
splits are chosen to maximize information gain using some metric (entropy, Gini index, etc.). Each of
Biology 2020,9, 453 15 of 25
the forests produces a class vector, which together with the original feature vector is forwarded as input
to the next cascade. By connecting multiple cascades, we obtain the deep random forest. We can add as
many cascades as we see fit considering the complexity of the task at hand, and this way,
we define a
structure similar in concept to Deep Neural Networks (DNNs). Furthermore,
the cascades
can be
flattened out to reduce the number of decision trees per cascade and increase the depth.
The authors
claimed that the performance of this model is competitive with DNNs with the following advantages:
(1) it can adaptively determine the depth of the model; (2) there are fewer parameters to tune.
Another approach is the MetaNN model [
43
] based on deep neural networks. This model is
implemented in the following pipeline:
1. Extract the raw data set;
2.
Filter out microbes that appear in less than 10% of the total samples for each data set
(preprocessing);
3.
Create an augmented data set using a Negative Binomial (NB) distribution to fit the training data,
and then, sample the fitted distribution;
4.
Train a DNN on the augmented training set (multilayer perceptron or convolutional
neural network).
When using a Multilayer Perceptron (MLP), the authors used two to three hidden layers to avoid
over-fitting of the microbial data. When using a Convolutional Neural Network (CNN), the authors
arranged the bacterial species based on their taxonomic annotation, ordered them alphabetically,
and concatenated
their taxonomies (phylum, class, order, family, and genus). As a result, the CNN is
able to extract the evolutionary relationships based on the phylogenetic sorting. The authors claimed
that their model outperforms several other popular classification models including Support Vector
Machines (SVM), random forest, gradient boosting, logistic regression, and multinomial naive Bayes.
Alternative classification approaches adopt one-class classification methods such as isolation forest,
Local Outlier Factor (LOF), and mixed ensembles [
44
,
45
]. These methods appear particularly useful
in challenging domains that present extremely imbalanced data sets, such as those characterized by
almost only negative data sequences.
4.2.2. Non-K-mer-Based Approaches
One of the biggest weaknesses of k-mer-based classifiers is the weak scalability for large amounts
of data [
46
], as the size of the data set influences both the training time and the accuracy of the models.
To combat this, some approaches try to utilize feature extraction to simplify the data set and reduce
dimensionality. One such approach is the use of Non-negative Matrix Factorization (NMF) [
47
].
The main idea is to represent the data points as a linear combination of non-negative features that
can be computed from the data. Namely, a non-negative
p×n
matrix
X
can be approximated by
TW
,
where
T
is a non-negative
p×k
matrix called the type (feature) matrix and
W
is a non-negative
k×n
weight matrix. If
k
is chosen such that
(p+n)×k<< np
, the dimensionality of the data is significantly
reduced. In the case of metagenomics, the
X
matrix counts the occurrences of genes in microbes,
i.e.,
Xij
is the number of observations of gene
i
in sample
j
. This approach provides dimensionality
reduction similar to Principal Component Analysis (PCA). The weight matrix
W
can then be calculated
via non-negative Poisson regression of each sample in
X
on
T
. The weight matrix can then be used in
a supervised classifier; however, it can also be applied in unsupervised learning approaches.
Another approach is a lightweight metagenomic classifier using an extension of the
Burrows–Wheeler Transform (eBWT) [
48
]. Similar to the previous approach, this model takes
advantage of the combinatorial properties of the eBWT to minimize internal memory usage required
for the analysis of unknown sequences. The approach is alignment- and assembly-free (does not
perform subsequence aligning) and compares each unknown sequence in the sample to all known
genomes in a given collection. As a result, the approach is very lightweight compared to k-mer
approaches while providing competitive classification results.
Biology 2020,9, 453 16 of 25
Other non-k-mer-based approaches [
49
51
] perform link prediction for GRN (Gene Regulatory
Network) reconstruction on homogeneous networks, where the known existing gene interactions are
considered as positive labeled links and all the possible remaining pairs of genes as unlabeled links.
These methods exploit predictive models to classify unknown gene regulations. On the other hand,
reference [
52
] performed link prediction on heterogeneous networks with the aim to detect associations
between ncRNAs and diseases.
4.3. Model Evaluation
Finding the right classification model is just part of the job. The model needs to be evaluated
in a realistic manner, given the scope of the data. This section focuses on modern model evaluation
techniques in the metagenomics field. The number of relevant articles for each property within
the model evaluation property group is shown in Figure 12. It can be seen that model evaluation
and benchmarking are very hot topics, while statistical tests are a very popular tool to conduct
model evaluation.
Figure 12.
Number of relevant articles containing each property grouped into the model
evaluation domain.
4.3.1. Model Validation Metrics
To begin model evaluation, we first need a metric of quality of the model. Popular metrics include:
Accuracy: the percentage of correct classifications on the testing set (ACC =T P+T N
P+N);
Precision (positive predictive value): the percentage of true positives from all positively classified
samples (PPV =TP
TP+F P );
Recall (sensitivity or true positive rate): the percentage of true positives from all positive samples
(TPR =TP
TP+F N );
Specificity (selectivity or true negative rate): the percentage of true negatives from all negative
samples (TNR =TN
TN +FP );
F1 score: a metric taking both precision and recall into account (F1=2PPVT PR
PPV +TP R ).
These metrics can be calculated multiple times with different training and test partitions of
the data set. Usually, the best approach is to use stratified k-fold, where the data set is split into
k
partitions with equal class distributions, then the training is done
k
times with a different fold being
used for testing each time. This way, we obtain
k
values for each performance metric and can calculate
the mean values, after which we can use the t-test to see if the obtained mean is statistically significant
or not.
Biology 2020,9, 453 17 of 25
While these metrics are very useful for model evaluation, they are very generic (they can
be used to assess any machine learning model). The best scenario is if these metrics can be
complemented by additional domain-specific metrics for the problem at hand. In the case of
metagenomics where we are looking at DNA sequences, bit alignment scores are the most popular.
In the
case of the FunGAPgenome annotation pipeline [
53
], several bit alignment scores are aggregated
and used for evaluating the model: Pfam [
54
], Benchmarking Universal Single-Copy Orthologs
(BUSCO) [
55
],
and BLAST [56]
. These are combined to calculate the so-called evidence score using the
following formula:
Evidence score = (BLAST score coverage) + BUSCO score +Pfam scores (1)
BLAST is the oldest of the three and generates a score based on bit overlap between the query
sequence and a database(s) of know sequences. It is used as a basis of other more sophisticated
alignment tools such as MEGAN [
57
] and the already mentioned Kraken classifier. Pfam is more
sophisticated and uses seed alignments to construct the database of protein domain families.
It is designed with incremental updating in mind, meaning that when new sequences are
released/discovered, it is very easy to add them to the database. BUSCO is an open-source tool
that provides a quantitative assessment of genome assembly and annotation completeness based on
evolutionarily informed expectations of gene content. It is capable of picking up complete or partial
(fragmented) matches between input query sequences and the database of known sequences.
4.3.2. Benchmarks
After defining the metrics that can be used for model performance measuring, we need something
with which to compare the values. As already mentioned, Kraken is one simple model that is a good
baseline for model evaluation. Alexa B. R. McIntyre, Rachid Ounit et al.provided us a benchmark for
performance evaluation of 11 metagenomic classifiers in their work “Comprehensive benchmarking
and ensemble approaches for metagenomic classifiers” [
58
]. The tests were done on a large data set
containing the data of 846 species. The classifiers in question have different classification strategies:
1.
K-mer-based (CLARK [
59
], CLARK-S [
60
], Kraken, Kraken_filtered, LMAT [
61
], naive Bayes
classifier)
2.
Alignment-based (BlastMegan_filtered, BlastMegan_filtered_liberal, DiamondMegan_filtered,
MetaFlow [62])
3. Marker-based (GOTTCHA [63], METAPhlAn [64], PhyloSift [65], PhyloSift_filtered)
The models are evaluated on classifications on the genus, species, and subspecies level.
The metrics used to evaluate the extent of the problems caused by false positives are the Area Under
the Precision-Recall curve (AUPR) and the
F1
score. According to the mean AUPR (mAUPR), all tools
perform best at the genus level (45.1%
mAUPR
86.6%), with small decreases in performance
at the species level (40.1%
mAUPR
84.1%). Calls at the subspecies (strain) level show a more
marked decrease in all measures for the subset of 12 data sets that include complete strain information
(17.3%
mAUPR
62.5%). It is interesting to note that for k-mer-based tools, adding an abundance
threshold increases the precision and F1 score, making them competitive with the marker-based tools.
The study also showed that the tools can be combined pairwise in an ensemble to further
increase precision in taxonomic classification. In this manner, the pair between GOTTCHA and
Diamond-MEGAN and BLAST-MEGAN paired with either Diamond-MEGAN, naive Bayes classifier,
or GOTTCHA achieve precisions over 95%, while the other 24 pairs achieve precisions over 90%.
In addition, memory consumption is also benchmarked by measuring maximum memory usage and
time to load files into memory with respect to the file size. The results show that Clark, Clark-S,
and Kraken
seem to be the most memory-intensive, while PhyloSift and METAPhlAn seem to be the
most memory efficient.
Biology 2020,9, 453 18 of 25
All things considered, the authors provided us with the decision tree shown in Figure 13, which
can help us choose a suitable model, given the problem we want to solve and the constraints that we
have. We can follow the tree by deciding whether our priority is to decrease the false-positive rate,
decrease the false-negative rate, memory efficiency when using large databases, using a single model
or an ensemble of pairs, faster or slower processing time, etc.
Tool Selection
limit false
positives
limit false
negatives use largest
database
fast(er) any speedfast any speed fast any speed
multiple toolssingle tool multiple toolssingle tool
DiamondMegan+GOTTCHA,
DiamondEnsemble majority
vote
BlastMegan+LMAT,
BlastEnsemble
majority vote
GOTTCHA, MetaPhlAn,
Kraken filtered or LMAT +
abundance filter
BlastMegan
filtered
CLARK-S, Kraken,
CLARK, LMAT
Community predictor
with abundance-
based certaintes
DiamondMegan,
BlastMegan
Figure 13. Algorithm for tool selection.
These benchmarks can be used for model valuation for any model, as they give us some
performance metrics for comparison. While these benchmarks take various kinds of models based on
different approaches into consideration and expose their strong and weak points, they do not consider
one important thing: data quality. The next section explores ways to clean the input data.
4.4. Data Preprocessing
In practice, it has been shown many times that the difference between a clean data set and a dirty
one can affect the final results more than a bad choice of a classifier. While the optimal situation
is to have clean data with the minimal relevant features and the best model for the task at hand,
many would argue that the former is even more important than the latter. Hence, fields like data
mining and data science are very important in the machine learning world. These disciplines put
much emphasis on the so-called data preprocessing step before we even start looking into choosing
the right model for the job [66]. Data preprocessing steps include, but are not limited to:
Dealing with missing data in the form of missing features or class labels (removing records with
missing data or replacing missing data by best effort speculation about the missing value);
Removing redundant features (features that contribute a little to the class variance and/or features
that are strongly correlated or derived mathematically from other features) [67];
Data discretization (converting continuous data into discrete values);
Removing records with outlier values for certain features;
Data sampling if the data set is too big to process [68].
Data curation is no simple task, and entire research projects are dedicated to cleaning data and
extracting relevant data from the data set in order to reduce model training time and complexity [
68
].
The situation is no different in the field of metagenomics. The number of relevant articles for each
property within the data preprocessing property group is shown in Figure 14. It can be seen that
Biology 2020,9, 453 19 of 25
feature extraction, dealing with unknown data, and data reduction are all popular topics, with feature
extraction becoming increasingly relevant since 2018.
Figure 14.
Number of relevant articles containing each property grouped into the data
preprocessing domain.
4.4.1. Dealing with Unknown (Unidentified) Sequences
One of the biggest issues with the large MetaSUB data set is the fact that there are many
unknown (unidentified) sequences. These kinds of discrepancies happen for various reasons,
the biggest one being imprecise measurements and faulty measuring equipment. In the case of
metagenomics, this problem happens as a result of erroneous DNA sequencing when applying NGS
techniques. An attempt to combat this problem is the Sequencing Error Correction in RNA-sequences
(SEECER) [
69
] tool. It is based on Hidden Markov Models (HMMs). This method is capable of
performing error correction of RNA-Seq data without the need for a reference genome. The method
can cover non-uniform coverage and alternative splicing and has been shown to outperform other
similar error correction methods on diverse human RNA-Seq data sets.
Another tool for dealing with unmapped sequences is DecontaMiner [
70
]. Compared with
SEECER, this tool does not actually correct the errors, and it needs a reference genome to work;
however, it can help us discover contaminating sequences that might be causing the sequencing errors
leading to unknown sequences. Contaminating sequences can be ones from bacteria, fungi, and viruses.
DecontaMiner runs on the data set and produces a visualization of the summary statistics and plots
using D3 JavaScript libraries. This tool can be integrated easily in a data cleaning pipeline as a step
to find potential contamination in the data, after which measures can be taken to reduce the sources
of contamination. It may be interesting if any of these tools can be used on the MetaSUB data set to
reduce the number of unknown sequences to improve classification performance.
4.4.2. Feature Extraction and Data Reduction
In addition to cleaning the data set of erroneous data, it is also very useful to reduce the data
set by removing data that are redundant, i.e., data that do not provide meaningful information to
the classifier [
66
]. This can be a double-edged sword: if the data set is too big, it can increase model
training time without providing better testing performance as a result of the larger data input; however,
by removing too much data, we can end up with a data set that is too small and is not representative
(bias has been introduced due to poor data reduction).
In metagenomics, the data set consists of the reads obtained from NGS. One way to reduce the
data is to filter the data set to include only sequences from a given list of species. This can be done with
Biology 2020,9, 453 20 of 25
the tool MetaObtainer [
71
]. The authors claimed that their tool works well on short reads, which is the
biggest shortcoming of other similar tools. In addition, the list of sequences we want to filter does not
necessarily have to contain known species; it can also find unknown species using reference genomes
of species similar to the query sequence.
On top of reducing the data set, we can also reduce the dimensionality of the data, which can
drastically speed up training for some models. This can safely be done for redundant features,
i.e., features that do not give meaningful information about the class, are correlated, or are derived
from other features in the data set. Amani Al-Ajlan and Achraf El Allali proposed a methodology [
72
]
for feature selection using maximum Relevance Minimum Redundancy (mRMR) to find the most
relevant features. The feature extraction algorithm has shown good results in improving classification
results from Support Vector Machine (SVM)-based models. Another feature extraction algorithm [
73
]
was proposed for efficient metagenomic fragment binning. Binning is the process of grouping random
fragments obtained from WGS data into groups. The algorithm uses sub-sequence blocks extracted
from organism protein domains as features. Binning predictions are then made using a classifier,
such as a naive Bayes classifier or a random forest. Besides feature extraction, dimensionality can be
reduced via other techniques such as the popular eigenvector-based Principal Component Analysis
(PCA) and the above-mentioned NMF.
To conclude, although it will not improve model training time or reduce dimensionality,
compressing the data set in some way may be useful to store large databases if we have limited
disk memory. One tool that can do CRAM-based compression in parallel, while also providing the
user with taxonomic and assembly information generated during compression, is MetaCRAM [
74
].
The tool provides reference-based, lossless compression of metagenomic data while boasting two to
four-fold compression ratio improvements compared to gzip. The authors claim that the compressed
file sizes are 2–13% of the original raw files.
5. Conclusions
This paper presents an overview of the state-of-the-art machine learning tools and techniques
for metagenomic classification from the last decade. Topics covered include general novelties in the
field of metagenomics, metagenomic sequencing, applied metagenomics in the CAMDA/MetaSUB
challenges, efficient machine learning models, model evaluation, and data preprocessing techniques.
In a way, this study provides an evolution timeline of metagenomic related computer science research.
Our methodology applies the NLP toolkit to search three digital libraries: PubMed, IEEE,
and Springer for relevant papers in the domain of research. The toolkit takes keywords organized
as properties and property groups as input and is configurable with respect to how many properties
need to be present in the publication for it to be considered relevant. The tool later uses the PRISMA
methodology to filter the results down to the core relevant research work. The property groups used
for this scoping review were the following (sorted in descending order according to the number of
relevant articles found): metagenomics, machine learning, model evaluation, and data preprocessing.
After a manual examination of the relevant articles, it can safely be said that insightful information
was obtained related to all the property groups.
This study confirms that metagenomics is a very hot topic in bioinformatics. More and
more people are joining in on the CAMDA organized challenges for metagenomic data processing.
The field is very broad, and all property groups represent valid subfields with a lot of potential for
further research, whether it is trying out different new classifiers on the data to get better performance
or solving the problems from a Big Data point of view by sanitizing the database, figuring out how to
handle metagenomic unknown sequences in the data set, or reducing the dimensionality of the data.
In all of these fields, we have baseline research that can be used as a starting point and then improved
upon incrementally, and the community seems to be very active and involved in continuing the trend
of organizing further machine learning challenges related to metagenomics.
Biology 2020,9, 453 21 of 25
One obvious shortcoming of this study is that the scoping is limited to three digital libraries.
Even though these libraries are very popular and rich with content, potentially interesting research
material may be located elsewhere. Furthermore, the technicalities of the search engines of the three
libraries remain unknown (amount and format of search results returned may differ). As a result,
the search queries used to obtain the results are the same for all platforms (no customization was
done to try to optimize the search results). All in all, the scoping review yielded a large number of
relevant articles, many of which provided us with truly insightful information, which is the main point
of the review.
Supplementary Materials:
Data S1: We provide all data used for generating these figures online at
https://zenodo.org/record/4289228#.X71rls1KguV, so that any interested researchers can use it to generate
charts with a better quality.
Author Contributions:
Conceptualization, P.T., V.T., E.Z. and S.K.; methodology and software, V.T. and E.Z.;
validation, V.T., E.Z. and S.K.; formal analysis, investigation, resources, data curation, and writing, original
draft preparation, P.T.; writing, review and editing, P.T., V.T., E.Z., R.C., S.K., I.M.P., N.M.G., P.L. and T.L.-T.;
visualization, E.Z. and R.C.; supervision, S.K.; project administration, S.K. and V.T. All authors read and agreed to
the published version of the manuscript.
Funding:
This work was partially funded by FCT/MEC through national funds and co-funded by the
FEDER–PT2020 partnership agreement under the project UIDB/50008/2020 (Este trabalho é financiado pela
FCT/MEC através de fundos nacionais e cofinanciado pelo FEDER, no âmbito do Acordo de Parceria PT2020
no âmbito do projeto UIDB/50008/2020). This work was partially funded by National Funds through the
FCT-Foundation for Science and Technology, I.P., within the scope of the project UIDB/00742/2020.
Acknowledgments:
This work was partially funded by FCT/MEC through national funds and co-funded by
the FEDER–PT2020 partnership agreement under the project UIDB/50008/2020 (Este trabalho é financiado pela
FCT/MEC através de fundos nacionais e cofinanciado pelo FEDER, no âmbito do Acordo de Parceria PT2020
no âmbito do projeto UIDB/50008/2020). This work was partially funded by National Funds through the
FCT-Foundation for Science and Technology, I.P., within the scope of the project UIDB/00742/2020. This article
is based on work from COST Action IC1303–AAPELE–Architectures, Algorithms and Protocols for Enhanced
Living Environments and COST Action CA16226–SHELD-ON–Indoor living space improvement: Smart Habitat
for the Elderly, supported by COST (European Cooperation in Science and Technology). More information at
www.cost.eu. Furthermore, we would like to thank the Politécnico de Viseu for their support.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
AUPR Precision-Recall Curve
BAC Bacterial Artificial Chromosome
BUSCO Benchmarking Universal Single-Copy Orthologs
CAMDA Critical Assessment of Massive Data Analysis
CNN Convolutional Neural Network
DNN Deep Neural Network
DOI Digital Object Identifier
eBWT extension of Burrows–Wheeler Transform
HMM Hidden Markov Model
LCA Lowest Common Ancestor
MLP Multilayer Perceptron
mRMR Maximum Relevance Minimum Redundancy
NB Negative Binomial
NGS Next-generation Sequencing
NLP Natural Language Processing
NMF Non-negative Matrix Factorization
Biology 2020,9, 453 22 of 25
PCA Principal Component Analysis
PCR Polymerase Chain Reaction
PRISMA Preferred Reporting Items for Systematic Reviews and Meta-analyses
SEECER Sequencing Error Correction in RNA-sequences
SVM Support Vector Machine
WGS Whole Genome Shotgun
References
1.
Asakawa, S.; Abe, I.; Kudoh, Y.; Kishi, N.; Wang, Y.; Kubota, R.; Kudoh, J.; Kawasaki, K.; Minoshima, S.;
Shimizu, N. Human BAC library: Construction and rapid screening. Gene 1997,191, 69–79. [CrossRef]
2.
Steele, H.L.; Jaeger, K.E.; Daniel, R.; Streit, W.R. Advances in recovery of novel biocatalysts from
metagenomes. J. Mol. Microbiol. Biotechnol. 2009,16, 25–37. [CrossRef]
3. Virgin, H.W.; Todd, J.A. Metagenomics and personalized medicine. Cell 2011,147, 44–56. [CrossRef]
4.
Pires, I.M.; Marques, G.; Garcia, N.M.; Flórez-Revuelta, F.; Ponciano, V.; Oniani, S. A Research on the
Classification and Applicability of the Mobile Health Applications. J. Pers. Med.
2020
,10, 11. [CrossRef]
[PubMed]
5.
Villasana, M.V.; Pires, I.M.; Sá, J.; Garcia, N.M.; Zdravevski, E.; Chorbev, I.; Lameski, P.; Flórez-Revuelta, F.
Promotion of Healthy Nutrition and Physical Activity Lifestyles for Teenagers: A Systematic Literature
Review of The Current Methodologies. J. Pers. Med. 2020,10, 12. [CrossRef]
6.
Mani, D.; Kumar, C. Biotechnological advances in bioremediation of heavy metals contaminated ecosystems:
An overview with special reference to phytoremediation. Int. J. Environ. Sci. Technol.
2014
,11, 843–872.
[CrossRef]
7.
Pires, I.; Souza, G.; Junior, J. An Analysis of the Relation between Garbage Pickers and Women’s Health Risk.
Acta Sci. Agric. 2020,4, 12–16.
8.
Pechal, J.L.; Crippen, T.L.; Benbow, M.E.; Tarone, A.M.; Dowd, S.; Tomberlin, J.K. The potential use of
bacterial community succession in forensics as described by high throughput metagenomic sequencing.
Int. J. Leg. Med. 2014,128, 193–205. [CrossRef] [PubMed]
9.
Kreil, D.P.; Hu, L. Proceedings of the Critical Assessment of Massive Data Analysis conferences: CAMDA
2011 (Vienna, Austria) and CAMDA 2012 (Long Beach, CA USA). Syst. Biomed. 2013,1. [CrossRef]
10.
Consortium, M.I.; The MetaSUB International Consortium. The metagenomics and metadesign of the
subways and urban biomes (MetaSUB) international consortium inaugural meeting report. Microbiome
2016
,
4, 24.
11.
Zdravevski, E.; Lameski, P.; Trajkovik, V.; Chorbev, I.; Goleva, R.; Pombo, N.; Garcia, N.M. Automation in
systematic, scoping and rapid reviews by an NLP toolkit: A case study in enhanced living environments.
In Enhanced Living Environments; Springer: Berlin, Germany, 2019; pp. 1–18.
12.
Moher, D.; Shamseer, L.; Clarke, M.; Ghersi, D.; Liberati, A.; Petticrew, M.; Shekelle, P.; Stewart, L.A.;
PRISMA-P Group. Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P)
2015 statement. Syst. Rev. 2015,4, 1. [CrossRef] [PubMed]
13.
Levac, D.; Colquhoun, H.; O’Brien, K.K. Scoping studies: Advancing the methodology. Implement. Sci.
2010,5, 69. [CrossRef]
14.
Loncar-Turukalo, T.; Zdravevski, E.; Machado da Silva, J.; Chouvarda, I.; Trajkovik, V. Literature on Wearable
Technology for Connected Health: Scoping Review of Research Trends, Advances, and Barriers. J. Med.
Internet Res. 2019,21, e14017. [CrossRef] [PubMed]
15.
Manning, C.D.; Surdeanu, M.; Bauer, J.; Finkel, J.R.; Bethard, S.; McClosky, D. The Stanford CoreNLP
natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics: System Demonstrations, Baltimore, Maryland, 22–27 June 2014; pp. 55–60.
16. Hunter, J.D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007,9, 90–95. [CrossRef]
17.
Hagberg, A.; Swart, P.; S Chult, D. Exploring Network Structure, Dynamics, and Function Using NetworkX.
Technical report; Los Alamos National Lab. (LANL): Los Alamos, NM, USA, 2008.
18.
Tonkovic, P.; Zdravevski, E.; Trajkovik, V. Metagenomic classification scoping review results. Zenodo
2020
.
[CrossRef]
Biology 2020,9, 453 23 of 25
19.
Garrido-Cardenas, J.A.; Manzano-Agugliaro, F. The metagenomics worldwide research. Curr. Genet.
2017,63, 819–829. [CrossRef]
20.
Wang, W.L.; Xu, S.Y.; Ren, Z.G.; Tao, L.; Jiang, J.W.; Zheng, S.S. Application of metagenomics in the human
gut microbiome. World J. Gastroenterol. WJG 2015,21, 803. [CrossRef]
21.
Hold, G.L.; Pryde, S.E.; Russell, V.J.; Furrie, E.; Flint, H.J. Assessment of microbial diversity in human colonic
samples by 16S rDNA sequence analysis. FEMS Microbiol. Ecol. 2002,39, 33–39. [CrossRef]
22.
Ehrlich, S.D.; Consortium, M.; MetaHIT Consortium. MetaHIT: The European Union Project on
metagenomics of the human intestinal tract. In Metagenomics of the Human Body; Springer: Berlin, Germany,
2011; pp. 307–316.
23.
Turnbaugh, P.J.; Ley, R.E.; Hamady, M.; Fraser-Liggett, C.M.; Knight, R.; Gordon, J.I. The human microbiome
project. Nature 2007,449, 804–810. [CrossRef]
24.
Gu, W.; Miller, S.; Chiu, C.Y. Clinical metagenomic next-generation sequencing for pathogen detection.
Annu. Rev. Pathol. Mech. Dis. 2019,14, 319–338. [CrossRef]
25.
Venter, J.C.; Adams, M.D.; Sutton, G.G.; Kerlavage, A.R.; Smith, H.O.; Hunkapiller, M. Shotgun sequencing
of the human genome. Science 1998,280, 1540–1542. [CrossRef]
26.
Saiki, R.; Gyllenstein, U.; Erlich, H. Polymerase chain reaction. Science
1988
,239, 487. [CrossRef] [PubMed]
27.
Goelet, P.; Knapp, M.R.; Anderson, S. Method for Determining Nucleotide Identity through Primer Extension.
U.S. Patent 5,888,819, 30 March 1999.
28.
Bossert, S.; Danforth, B.N. On the universality of target-enrichment baits for phylogenomic research. Methods
Ecol. Evol. 2018,9, 1453–1460. [CrossRef]
29.
Greninger, A.L.; Naccache, S.N. Metagenomics to assist in the diagnosis of bloodstream infection. J. Appl.
Lab. Med. 2019,3, 643–653. [CrossRef] [PubMed]
30. Chiu, C.Y.; Miller, S.A. Clinical metagenomics. Nat. Rev. Genet. 2019,20, 341. [CrossRef] [PubMed]
31.
Garmendia, L.; Hernandez, A.; Sanchez, M.; Martinez, J. Metagenomics and antibiotics. Clin. Microbiol.
Infect. 2012,18, 27–31. [CrossRef]
32.
Walker, A.R.; Grimes, T.L.; Datta, S.; Datta, S. Unraveling bacterial fingerprints of city subways from
microbiome 16S gene profiles. Biol. Direct 2018,13, 10. [CrossRef]
33.
Ryan, F.J. Application of machine learning techniques for creating urban microbial fingerprints. Biol. Direct
2019,14, 13. [CrossRef]
34.
Zhu, C.; Miller, M.; Lusskin, N.; Mahlich, Y.; Wang, Y.; Zeng, Z.; Bromberg, Y. Fingerprinting cities:
Differentiating subway microbiome functionality. Biol. Direct 2019,14, 19. [CrossRef]
35.
Harris, Z.N.; Dhungel, E.; Mosior, M.; Ahn, T.H. Massive metagenomic data analysis using abundance-based
machine learning. Biol. Direct 2019,14, 12. [CrossRef]
36.
Qiao, Y.; Jia, B.; Hu, Z.; Sun, C.; Xiang, Y.; Wei, C. MetaBinG2: A fast and accurate metagenomic sequence
classification system for samples with many unknown organisms. Biol. Direct
2018
,13, 1–21. [CrossRef]
[PubMed]
37.
Wood, D.E.; Salzberg, S.L. Kraken: Ultrafast metagenomic sequence classification using exact alignments.
Genome Biol. 2014,15, 1–12. [CrossRef] [PubMed]
38.
Breitwieser, F.; Baker, D.; Salzberg, S.L. KrakenUniq: Confident and fast metagenomics classification using
unique k-mer counts. Genome Biol. 2018,19, 1–10. [CrossRef] [PubMed]
39.
Tausch, S.H.; Strauch, B.; Andrusch, A.; Loka, T.P.; Lindner, M.S.; Nitsche, A.; Renard, B.Y.
LiveKraken—real-time metagenomic classification of illumina data. Bioinformatics
2018
,34, 3750–3752.
[CrossRef] [PubMed]
40.
Saghir, H.; Megherbi, D.B. A random-forest-based efficient comparative machine learning predictive
DNA-codon metagenomics binning technique for WMD events & applications. In Proceedings of the
2013 IEEE International Conference on Technologies for Homeland Security (HST), Waltham, MA, USA,
12–14 November 2013; pp. 171–177.
41.
Saghir, H.; Megherbi, D.B. An efficient comparative machine learning-based metagenomics binning
technique via using Random forest. In Proceedings of the 2013 IEEE International Conference
on Computational Intelligence and Virtual Environments for Measurement Systems and Applications
(CIVEMSA), Milan, Italy, 15–17 July 2013; pp. 191–196.
Biology 2020,9, 453 24 of 25
42.
Zhu, Q.; Zhu, Q.; Pan, M.; Jiang, X.; Hu, X.; He, T. The phylogenetic tree based deep forest for metagenomic
data classification. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and
Biomedicine (BIBM), Madrid, Spain, 3–6 December 2018; pp. 279–282.
43.
Lo, C.; Marculescu, R. MetaNN: Accurate classification of host phenotypes from metagenomic data using
neural networks. BMC Bioinform. 2019,20, 314. [CrossRef] [PubMed]
44.
Kaufmann, J.; Asalone, K.; Corizzo, R.; Saldanha, C.; Bracht, J.; Japkowicz, N. One-Class Ensembles for Rare
Genomic Sequences Identification. In International Conference on Discovery Science; Springer: Berlin, Germany,
2020; pp. 340–354.
45.
Ceci, M.; Corizzo, R.; Japkowicz, N.; Mignone, P.; Pio, G. ECHAD: Embedding-Based Change Detection
From Multivariate Time Series in Smart Grids. IEEE Access 2020,8, 156053–156066. [CrossRef]
46.
Nasko, D.J.; Koren, S.; Phillippy, A.M.; Treangen, T.J. RefSeq database growth influences the accuracy of
k-mer-based lowest common ancestor species identification. Genome Biol. 2018,19, 1–10. [CrossRef]
47.
Cai, Y.; Gu, H.; Kenney, T. Learning microbial community structures with supervised and unsupervised
non-negative matrix factorization. Microbiome 2017,5, 110. [CrossRef]
48.
Guerrini, V.; Rosone, G. Lightweight metagenomic classification via eBWT. In International Conference on
Algorithms for Computational Biology; Springer: Berlin, Germany, 2019; pp. 112–124.
49.
Cerulo, L.; Elkan, C.; Ceccarelli, M. Learning gene regulatory networks from only positive and unlabeled
data. BMC Bioinform. 2010,11, 228. [CrossRef]
50.
Mignone, P.; Pio, G. Positive unlabeled link prediction via transfer learning for gene network reconstruction.
In Proceedings of the 24th International Symposium on Methodologies for Intelligent Systems, Limassol,
Cyprus, 29–31 October 2018; Springer: Berlin, Germany, 2018; pp. 13–23.
51.
Mignone, P.; Pio, G.; D’Elia, D.; Ceci, M. Exploiting transfer learning for the reconstruction of the human
gene regulatory network. Bioinformatics 2020,36, 1553–1561. [CrossRef]
52.
Barracchia, E.P.; Pio, G.; D’Elia, D.; Ceci, M. Prediction of new associations between ncRNAs and diseases
exploiting multi-type hierarchical clustering. BMC Bioinform. 2020,21, 1–24. [CrossRef] [PubMed]
53.
Min, B.; Grigoriev, I.V.; Choi, I.G. FunGAP: Fungal Genome Annotation Pipeline using evidence-based gene
model evaluation. Bioinformatics 2017,33, 2936–2937. [CrossRef] [PubMed]
54.
Sonnhammer, E.L.; Eddy, S.R.; Durbin, R. Pfam: A comprehensive database of protein domain families
based on seed alignments. PRoteins Struct. Funct. Bioinform. 1997,28, 405–420. [CrossRef]
55.
Seppey, M.; Manni, M.; Zdobnov, E.M. BUSCO: Assessing genome assembly and annotation completeness.
In Gene Prediction; Springer: Berlin, Germany, 2019; pp. 227–245.
56. Korf, I.; Yandell, M.; Bedell, J. Blast; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2003.
57.
Huson, D.H.; Auch, A.F.; Qi, J.; Schuster, S.C. MEGAN analysis of metagenomic data. Genome Res.
2007
,
17, 377–386. [CrossRef]
58.
McIntyre, A.B.; Ounit, R.; Afshinnekoo, E.; Prill, R.J.; Hénaff, E.; Alexander, N.; Minot, S.S.; Danko, D.;
Foox, J.; Ahsanuddin, S.; et al. Comprehensive benchmarking and ensemble approaches for metagenomic
classifiers. Genome Biol. 2017,18, 182. [CrossRef]
59.
Ounit, R.; Wanamaker, S.; Close, T.J.; Lonardi, S. CLARK: Fast and accurate classification of metagenomic
and genomic sequences using discriminative k-mers. BMC Genom. 2015,16, 236. [CrossRef]
60.
Ounit, R.; Lonardi, S. Higher classification sensitivity of short metagenomic reads with CLARK-S.
Bioinformatics 2016,32, 3823–3825. [CrossRef]
61.
Ames, S.K.; Hysom, D.A.; Gardner, S.N.; Lloyd, G.S.; Gokhale, M.B.; Allen, J.E. Scalable metagenomic
taxonomy classification using a reference genome database. Bioinformatics 2013,29, 2253–2260. [CrossRef]
62.
Sobih, A.; Tomescu, A.I.; Mäkinen, V. MetaFlow: Metagenomic profiling based on whole-genome coverage
analysis with min-cost flows. In Proceedings of the International Conference on Research in Computational
Molecular Biology, Philadelphia, PA, USA, 22–23 August 2016; Springer: Berlin, Germany, 2016; pp. 111–121.
63.
Freitas, T.; Chain, P.; Lo, C.C.; Li, P.E. GOTTCHA Database, Version 1; Technical report; Los Alamos National
Laboratory: Los Alamos, NM, USA, 2015.
64.
Segata, N.; Waldron, L.; Ballarini, A.; Narasimhan, V.; Jousson, O.; Huttenhower, C. Metagenomic microbial
community profiling using unique clade-specific marker genes. Nat. Methods 2012,9, 811–814. [CrossRef]
65.
Darling, A.E.; Jospin, G.; Lowe, E.; Matsen IV, F.A.; Bik, H.M.; Eisen, J.A. PhyloSift: Phylogenetic analysis of
genomes and metagenomes. PeerJ 2014,2, e243. [CrossRef] [PubMed]
Biology 2020,9, 453 25 of 25
66.
Zdravevski, E.; Lameski, P.; Trajkovik, V.; Kulakov, A.; Chorbev, I.; Goleva, R.; Pombo, N.; Garcia, N.
Improving Activity Recognition Accuracy in Ambient-Assisted Living Systems by Automated Feature
Engineering. IEEE Access 2017,5, 5262–5280. [CrossRef]
67.
Zdravevski, E.; Lameski, P.; Kulakov, A.; Jakimovski, B.; Filiposka, S.; Trajanov, D. Feature Ranking Based
on Information Gain for Large Classification Problems with MapReduce. In Proceedings of the 2015 IEEE
Trustcom/BigDataSE/ISPA, Helsinki, Finland, 20–22 August 2015; Volume 2, pp. 186–191. [CrossRef]
68.
Zdravevski, E.; Lameski, P.; Apanowicz, C.; Slezak, D. From Big Data to business analytics: The case study
of churn prediction. Appl. Soft Comput. 2020,90, 106164. [CrossRef]
69.
Le, H.S.; Schulz, M.H.; McCauley, B.M.; Hinman, V.F.; Bar-Joseph, Z. Probabilistic error correction for RNA
sequencing. Nucleic Acids Res. 2013,41, e109. [CrossRef] [PubMed]
70.
Sangiovanni, M.; Granata, I.; Thind, A.S.; Guarracino, M.R. From trash to treasure: Detecting unexpected
contamination in unmapped NGS data. BMC Bioinform. 2019,20, 1–12. [CrossRef]
71.
Pan, W.; Chen, B.; Xu, Y. MetaObtainer: A Tool for Obtaining Specified Species from Metagenomic Reads of
Next-generation Sequencing. Interdiscip. Sci. Comput. Life Sci. 2015,7, 405–413. [CrossRef]
72.
Al-Ajlan, A.; El Allali, A. Feature selection for gene prediction in metagenomic fragments. BioData Min.
2018,11, 9. [CrossRef] [PubMed]
73.
Saghir, H.; Megherbi, D.B. Big data biology-based predictive models via DNA-metagenomics binning for
WMD events applications. In Proceedings of the 2015 IEEE International Symposium on Technologies for
Homeland Security (HST), Waltham, MA, USA, 14–16 April 2015; pp. 1–6.
74.
Kim, M.; Zhang, X.; Ligo, J.G.; Farnoud, F.; Veeravalli, V.V.; Milenkovic, O. MetaCRAM: An integrated
pipeline for metagenomic taxonomy identification and compression. BMC Bioinform.
2016
,17, 94. [CrossRef]
Publisher’s Note:
MDPI stays neutral with regard to jurisdictional claims in published maps and institutional
affiliations.
©
2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
... Contigs are longer sequences generated by assembly tools and are often 50 used to discover genes, that make up more complex reference catalogs. Sequenced reads 51 can be short (75 to 150 bp) or long (today with an average between 10 to 30k bp)( [14]) 52 -both having their advantages and drawbacks. 53 It is important here to remember the concept of k-mer. ...
... Methods like Predomics [50] allow discovering highly predictive and 209 very simple models that generalize well, while providing clear focus on the importance 210 of the features involved. Some interesting reviews of these methods have been done 211 by [51] and [52]. 212 Finally, in the specific context of metagenomics, ML faces different problems, 213 including the high-dimensional nature of the data compared to the number of samples, 214 the vast sparsity in the data and their compositionality nature. ...
Preprint
Full-text available
The ever-decreasing cost of sequencing and the growing potential applications of metagenomics have led to an unprecedented surge in data generation. One of the most prevalent applications of metagenomics is the study of microbial environments, such as the human gut. The gut microbiome plays a crucial role in human health, providing vital information for patient diagnosis and prognosis. However, analyzing metagenomic data remains challenging due to several factors, including reference catalogs, sparsity, and compositionality. Deep learning (DL) enables novel and promising approaches that complement state-of-the-art microbiome pipelines. DL-based methods can address almost all aspects of microbiome analysis, including novel pathogen detection, sequence classification, patient stratification, and disease prediction. Beyond generating predictive models, a key aspect of these methods is also their interpretability. This article reviews deep learning approaches in metagenomics, including convolutional networks (CNNs), autoencoders, and attention-based models. These methods aggregate contextualized data and pave the way for improved patient care and a better understanding of the microbiome’s key role in our health. Author summary In our study, we look at the vast world of research in metagenomics, the study of genetic material from environmental samples, spurred by the increasing affordability of sequencing technologies. Our particular focus is the human gut microbiome, an environment teeming with microscopic life forms that plays a central role in our health and well-being. However, navigating through the vast amounts of data generated is not an easy task. Traditional methods hit roadblocks due to the unique nature of metagenomic data. That’s where deep learning (DL), a today well known branch of artificial intelligence, comes in. DL-based techniques complement existing methods and open up new avenues in microbiome research. They’re capable of tackling a wide range of tasks, from identifying unknown pathogens to predicting disease based on a patient’s unique microbiome. In our article, we provide a very comprehensive review of different DL strategies for metagenomics, including convolutional networks, autoencoders, and attention-based models. We are convinced that these techniques significantly enhance the field of metagenomic analysis in its entirety, paving the way for more accurate data analysis and, ultimately, better patient care. The PRISMA augmented diagram of our review is illustrated in Fig 1 .
... The complexity of these models obscures the logic driving their decision-making process, underlining the significance of 'interpretability' in the field of DL [69]. Some interesting reviews of these methods have already been published [70,71]. ...
Article
Full-text available
The ever-decreasing cost of sequencing and the growing potential applications of metagenomics have led to an unprecedented surge in data generation. One of the most prevalent applications of metagenomics is the study of microbial environments, such as the human gut. The gut microbiome plays a crucial role in human health, providing vital information for patient diagnosis and prognosis. However, analysing metagenomic data remains challenging due to several factors, including reference catalogues, sparsity and compositionality. Deep learning (DL) enables novel and promising approaches that complement state-of-the-art microbiome pipelines. DL-based methods can address almost all aspects of microbiome analysis, including novel pathogen detection, sequence classification, patient stratification and disease prediction. Beyond generating predictive models, a key aspect of these methods is also their interpretability. This article reviews DL approaches in metagenomics, including convolutional networks, autoencoders and attention-based models. These methods aggregate contextualized data and pave the way for improved patient care and a better understanding of the microbiome’s key role in our health.
... These span from the need for substantially larger data sets to improve predictions in clinically relevant tasks, to more precisely pinpointing microbiological aspects linked with relevant host characteristics and to the development and adoption of advanced deep learning approaches that are still suffering the high-dimensionality and low sample size of many microbiological applications 117 . The lack of precise and comprehensive metadata annotation of microbiological samples and their frequently very partial public availability are other factors currently limiting machine learning usage in this field, due to a combination of practical and ethical reasons 118 . Updated policies favouring open data sharing as well as supporting machine learning approaches such as semi-supervised learning can mitigate these issues in the future. ...
Article
Machine learning is increasingly important in microbiology where it is used for tasks such as predicting antibiotic resistance and associating human microbiome features with complex host diseases. The applications in microbiology are quickly expanding and the machine learning tools frequently used in basic and clinical research range from classification and regression to clustering and dimensionality reduction. In this Review, we examine the main machine learning concepts, tasks and applications that are relevant for experimental and clinical microbiologists. We provide the minimal toolbox for a microbiologist to be able to understand, interpret and use machine learning in their experimental and translational activities.
... This process is repeated, forming a Stacked Convolutional Autoencoder (SCAE) that effectively captures hierarchical features in the data. The encoder and decoder structures are symmetric, allowing for the extraction of low-dimensional hierarchical features [38]. ...
Article
Full-text available
The application of deep learning for taxonomic categorization of DNA sequences is investigated in this study. Two deep learning architectures, namely the Stacked Convolutional Autoencoder (SCAE) with Multilabel Extreme Learning Machine (MLELM) and the Variational Convolutional Autoencoder (VCAE) with MLELM, have been proposed. These designs provide precise feature maps for individual and inter-label interactions within DNA sequences, capturing their spatial and temporal properties. The collected features are subsequently fed into MLELM networks, which yield soft classification scores and hard labels. The proposed algorithms underwent thorough training and testing on unsupervised data, whereby one or more labels were concurrently taken into account. The introduction of the clade label resulted in improved accuracy for both models compared to the class or genus labels, probably owing to the occurrence of large clusters of similar nucleotides inside a DNA strand. In all circumstances, the VCAE-MLELM model consistently outperformed the SCAE-MLELM model. The best accuracy attained by the VCAE-MLELM model when the clade and family labels were combined was 94%. However, accuracy ratings for single-label categorization using either approach were less than 65%. The approach’s effectiveness is based on MLELM networks, which record connected patterns across classes for accurate label categorization. This study advances deep learning in biological taxonomy by emphasizing the significance of combining numerous labels for increased classification accuracy.
... Some could benefit from Short Term Scientific Mission (STMS) grants (16 in total) to work with research teams in different countries on ML4Microbiome-related projects with the view to publish the results of their activities in peer-reviewed journals. 8 In terms of publication output, to date ML4Microbiome members have published work on specific ML applications for particular diseases, such as Cancer Diagnostics and Therapeutics (Cekikj et al., 2022), classification of patients with Celiac Disease (Arcila-Galvis et al., 2022), Coronary Artery Disease Risk Prediction (Vilne et al., 2022), novel paradigms in human gut microbiome metabolism , Parkinson's disease (Rosario et al., 2021), Type 2 Diabetes (Ruuskanen et al., 2022), oral and related gut diseases (Di Stefano et al., 2023), along with systematic or scoping reviews on ML applications on microbiome data (Tonkovic et al., 2020;Marcos-Zambrano et al., 2021) and its challenges and solutions (Moreno-Indias et al., 2021) of which all are available from the complete list of the Action's publications on the ML4Microbiome website. ...
Article
Full-text available
The rapid development of machine learning (ML) techniques has opened up the data-dense field of microbiome research for novel therapeutic, diagnostic, and prognostic applications targeting a wide range of disorders, which could substantially improve healthcare practices in the era of precision medicine. However, several challenges must be addressed to exploit the benefits of ML in this field fully. In particular, there is a need to establish “gold standard” protocols for conducting ML analysis experiments and improve interactions between microbiome researchers and ML experts. The Machine Learning Techniques in Human Microbiome Studies (ML4Microbiome) COST Action CA18131 is a European network established in 2019 to promote collaboration between discovery-oriented microbiome researchers and data-driven ML experts to optimize and standardize ML approaches for microbiome analysis. This perspective paper presents the key achievements of ML4Microbiome, which include identifying predictive and discriminatory ‘omics’ features, improving repeatability and comparability, developing automation procedures, and defining priority areas for the novel development of ML methods targeting the microbiome. The insights gained from ML4Microbiome will help to maximize the potential of ML in microbiome research and pave the way for new and improved healthcare practices.
... The application of AI/ML/DL/EL/SSL methods has already shown remarkable performance in other fields of women health care [30][31][32], using different types of data such as clinical data, computed tomography (CT), cardiotocography (CTG), electromyography (EMG), genomic, metabolomic, biophysical, and biochemical data [33][34][35][36]. ...
Article
Full-text available
Background: IntraUterine Growth Restriction (IUGR) is a global public health concern and has major implications for neonatal health. The early diagnosis of this condition is crucial for obtaining positive outcomes for the newborn. In recent years Artificial intelligence (AI) and machine learning (ML) techniques are being used to identify risk factors and provide early prediction of IUGR. We performed a systematic review (SR) and meta-analysis (MA) aimed to evaluate the use and performance of AI/ML models in detecting fetuses at risk of IUGR. Methods: We conducted a systematic review according to the PRISMA checklist. We searched for studies in all the principal medical databases (MEDLINE, EMBASE, CINAHL, Scopus, Web of Science, and Cochrane). To assess the quality of the studies we used the JBI and CASP tools. We performed a meta-analysis of the diagnostic test accuracy, along with the calculation of the pooled principal measures. Results: We included 20 studies reporting the use of AI/ML models for the prediction of IUGR. Out of these, 10 studies were used for the quantitative meta-analysis. The most common input variable to predict IUGR was the fetal heart rate variability (n = 8, 40%), followed by the biochemical or biological markers (n = 5, 25%), DNA profiling data (n = 2, 10%), Doppler indices (n = 3, 15%), MRI data (n = 1, 5%), and physiological, clinical, or socioeconomic data (n = 1, 5%). Overall, we found that AI/ML techniques could be effective in predicting and identifying fetuses at risk for IUGR during pregnancy with the following pooled overall diagnostic performance: sensitivity = 0.84 (95% CI 0.80-0.88), specificity = 0.87 (95% CI 0.83-0.90), positive predictive value = 0.78 (95% CI 0.68-0.86), negative predictive value = 0.91 (95% CI 0.86-0.94) and diagnostic odds ratio = 30.97 (95% CI 19.34-49.59). In detail, the RF-SVM (Random Forest-Support Vector Machine) model (with 97% accuracy) showed the best results in predicting IUGR from FHR parameters derived from CTG. Conclusions: our findings showed that AI/ML could be part of a more accurate and cost-effective screening method for IUGR and be of help in optimizing pregnancy outcomes. However, before the introduction into clinical daily practice, an appropriate algorithmic improvement and refinement is needed, and the importance of quality assessment and uniform diagnostic criteria should be further emphasized.
... More specifically, we are developing a risk prediction model based on regression algorithms. Existing prediction studies focusing on binary classification via one-class learning, deep learning, and ensemble learning have shown great promise in classifying biological data [12][13][14][15][16]. However, we are interested in finding out whether it is possible to quantify an individual's risk of malaria based on SNP genotype data for facilitating personalized prevention and treatment. ...
Article
Full-text available
In recent malaria research, the complexity of the disease has been explored using machine learning models via blood smear images, environmental, and even RNA-Seq data. However, a machine learning model based on genetic variation data is still required to fully explore individual malaria risk. Furthermore, many Genome-Wide Associations Studies (GWAS) have associated specific genetic markers, i.e., single nucleotide polymorphisms (SNPs), with malaria. Thus, the present study improves the current state-of-the-art genetic risk score by incorporating SNPs mutation location on large-scale genetic variation data obtained from GWAS. Nevertheless, it becomes computationally expensive for hyperparameter optimization on large-scale datasets. Therefore, this study proposes a machine learning model that incorporates mutation location as well as a Genetic Algorithm (GA) to optimize hyperparameters. Besides that, a deep learning model is also proposed to predict individual malaria risk as an alternative approach. The analysis is performed on the Malaria Genomic Epidemiology Network (MalariaGEN) dataset comprising 20,817 individuals from 11 populations. The findings of this study demonstrated that the proposed GA could overcome the curse of dimensionality and improve resource efficiency compared to commonly used methods. In addition, incorporating the mutation location significantly improved the machine learning models in predicting the individual malaria risk; a Mean Absolute Error (MAE) score of 8.00E−06. Moreover, the deep learning model obtained almost similar MAE scores to the machine learning models, indicating an alternative approach. Thus, this study provides relevant knowledge of genetic and technical deliberations that can improve the state-of-the-art methods for predicting individual malaria risk.
... Machine learning (ML) has emerged to play a vital role in bioinformatics due to its ability to handle exponentially increasing amount of data [18]. Many researchers have applied ML to DNA data [19][20][21][22]. In particular, several applications of ML in epigenomics have assisted medical professionals and researchers to perform human disease-related tasks such as disease detection, subtype classification, prognosis, and treatment response prediction [23][24][25][26][27][28]. ...
Article
Full-text available
DNA methylation modification plays a vital role in the pathophysiology of high blood pressure (BP). Herein, we applied three machine learning (ML) algorithms including deep learning (DL), support vector machine, and random forest for detecting high BP using DNA methylome data. Peripheral blood samples of 50 elderly individuals were collected three times at three visits for DNA methylome profiling. Participants who had a history of hypertension and/or current high BP measure were considered to have high BP. The whole dataset was randomly divided to conduct a nested five-group cross-validation for prediction performance. Data in each outer training set were independently normalized using a min–max scaler, reduced dimensionality using principal component analysis, then fed into three predictive algorithms. Of the three ML algorithms, DL achieved the best performance (AUPRC = 0.65, AUROC = 0.73, accuracy = 0.69, and F1-score = 0.73). To confirm the reliability of using DNA methylome as a biomarker for high BP, we constructed mixed-effects models and found that 61,694 methylation sites located in 15,523 intragenic regions and 16,754 intergenic regions were significantly associated with BP measures. Our proposed models pioneered the methodology of applying ML and DNA methylome data for early detection of high BP in clinical practices.
Thesis
Full-text available
Deep learning is one of the most prominent machine learning approaches today because of its capacity to autonomously extract features from massive volumes of data and automatically learn meaningful representations from them. Image and speech recognition, as well as robotics, are some of the domains where it is used. Deep learning has found usage in the biology area as a result of the recent increase of biological 'omics' data, including applications in early cancer detection and protein-protein interactions. In this research, we have used one-hot encoding on the DNA strands to convert the text into numbers and unique color platelets for each 4-mer in a sequence. Additionally, each DNA sequence number is tagged with labels such as; genus, family, order, class, phylum, and clade. The labels dataset, along with the one-hot encoding or image dataset, is then fed into the deep learning algorithms to classify the taxonomic labels of the DNA strand. The deep learning architectures proposed in this research are Stacked Convolutional Autoencoder (SCAE) with Multi-label Extreme Learning Machine (MLELM) and Variational Convolutional Autoencoder (VCAE) with MLELM. SCAE and VCAE generate the detailed feature map for individuals and between taxonomic labels of a DNA sequence from the one hot encoding of the DNA sequence input data by identifying the spatial and temporal salient qualities. The feature vector is then fed to the first MLELM network to produce a soft classification score for each data point. based on which the second MLELM network would generate hard labels. The suggested methods were excessively trained and tested on unsupervised data by considering one or more labels at a time. The model is also able to classify the DNA sequence characteristics based on the Phylogenetic tree. Through experimentation, it was found that the model is able to generate a better accuracy score label when classifying the host of the DNA sequence when considering the clade label rather than the class or genus label for both models. Due to the presence of large, similar groups of nucleotides within a DNA strand. Moreover, it was also observed that VCAE-MLEM performs much better than SCAE-MLELM under any circumstances. Due to this neural network structure. The highest accuracy obtained by VCAE-MLELM model when classifying the DNA sequences with consideration of clade and family label together is 94% accuracy. While SCAE-MLELM obtains 78% with consideration of clade-family labels. Single label classification for either of the algorithms generates accuracy scores lower than 65%. It is because of the MLELMs networks that it is possible to classify labels based on linked patterns between classes.
Conference Paper
Surgical risk assessments are central to the decision-making process utilized by medical professionals in the planning and execution of surgical procedures. In this regard, several tools are currently relied upon by surgeons to evaluate potential outcomes and associated risk. A popular and widely used assessment tool is the American College of Surgeons (ACS) National Surgical Quality Improvement Program (NSQIP) Universal Surgical Risk Calculator (ACS-SRC). One important limitation of this approach stands in its inability to exploit complex relationships among data variables, which may limit its classification accuracy. This paper tries to fill this gap by proposing a viable, low-resourced, and generalized decision support model using machine learning techniques that emulates the same general risk assessment functionality as the ACS-SRC, but utilizes more complex models such as gradient-boosted trees and deep neural networks. The experimental results show that our approach performs competitively with respect to state-of-the-art tools.
Preprint
Full-text available
High-throughput sequencing (HTS) of metagenomes is proving essential in understanding the environment and diseases. State-of-the-art methods for discovering the species and their abundances in an HTS metagenomic sample are based on genome-specific markers, which can lead to skewed results, especially at species level. We present MetaFlow, the first method based on coverage analysis across entire genomes that also scales to HTS samples. We formulated this problem as an NP-hard matching problem in a bipartite graph, which we solved in practice by min-cost flows. On synthetic data sets of varying complexity and similarity, MetaFlow is more precise and sensitive than popular tools such as MetaPhlAn, mOTU, GSMer and BLAST, and its abundance estimations at species level are two to four times better in terms of ℓ 1 -norm. On a real human stool data set, MetaFlow identifies B.uniformis as most predominant, in line with previous human gut studies, whereas marker-based methods report it as rare. MetaFlow is freely available at http://cs.helsinki.fi/gsa/metaflow
Article
Full-text available
Smart grids are power grids where clients may actively participate in energy production, storage and distribution. Smart grid management raises several challenges, including the possible changes and evolutions in terms of energy consumption and production, that must be taken into account in order to properly regulate the energy distribution. In this context, machine learning methods can be fruitfully adopted to support the analysis and to predict the behavior of smart grids, by exploiting the large amount of streaming data generated by sensor networks. In this paper, we propose a novel change detection method, called ECHAD (Embedding-based CHAnge Detection), that leverages embedding techniques, one-class learning, and a dynamic detection approach that incrementally updates the learned model to reflect the new data distribution. Our experiments show that ECHAD achieves optimal performances on synthetic data representing challenging scenarios. Moreover, a qualitative analysis of the results obtained on real data of a real power grid reveals the quality of the change detection of ECHAD. Specifically, a comparison with state-of-the-art approaches shows the ability of ECHAD in identifying additional relevant changes, not detected by competitors, avoiding false positive detections.
Conference Paper
Full-text available
The waste influences women's health. The quantity of waste is influencing the quality of the environment. The different cooperatives produce different types and amounts of garbage, virus, bacteria and fungi that affect the quality of life. Several protections must be used, such as masks, gloves, and boots. Solid waste is a common health problem. The analysis of different health problems is essential to verify the consequences of its inadequate management and final disposition. This paper analyses three different cooperatives about the prevalence of various diseases. The risk of chronic obstructive pulmonary disease was associated with sex and smoking, verifying that it is higher in smokers. Finally, the consequences of the inadequate management and final disposition can reflect on the health of the population.
Article
Full-text available
Amid obesity problems in the young population and apparent trends of spending a significant amount of time in a stationary position, promoting healthy nutrition and physical activities to teenagers is becoming increasingly important. It can rely on different methodologies, including a paper diary and mobile applications. However, the widespread use of mobile applications by teenagers suggests that they could be a more suitable tool for this purpose. This paper reviews the methodologies for promoting physical activities to healthy teenagers explored in different studies, excluding the analysis of different diseases. We found only nine studies working with teenagers and mobile applications to promote active lifestyles, including the focus on nutrition and physical activity. Studies report using different techniques to captivate the teenagers, including questionnaires and gamification techniques. We identified the common features used in different studies, which are: paper diary, diet diary, exercise diary, notifications, diet plan, physical activity registration, gamification, smoking cessation, pictures, game, and SMS, among others.
Article
Full-text available
Mobile health applications are applied for different purposes. Healthcare professionals and other users can use this type of mobile applications for specific tasks, such as diagnosis, information, prevention, treatment, and communication. This paper presents an analysis of mobile health applications used by healthcare professionals and their patients. A secondary objective of this article is to evaluate the scientific validation of these mobile health applications and to verify if the results provided by these applications have an underlying sound scientific foundation. This study also analyzed literature references and the use of mobile health applications available in online application stores. In general, a large part of these mobile health applications provides information about scientific validation. However, some mobile health applications are not validated. Therefore, the main contribution of this paper is to provide a comprehensive analysis of the usability and user-perceived quality of mobile health applications and the challenges related to scientific validation of these mobile applications.
Article
Full-text available
Background. The study of functional associations between ncRNAs and human diseases is a pivotal task of modern research to develop new and more effective therapeutic approaches. Nevertheless, it is not a trivial task since it involves entities of different types, such as microRNAs, lncRNAs or target genes whose expression also depends on endogenous or exogenous factors. Such a complexity can be faced by representing the involved biological entities and their relationships as a network and by exploiting network-based computational approaches able to identify new associations. However, existing methods are limited to homogeneous networks (i.e., consisting of only one type of objects and relationships) or can exploit only a small subset of the features of biological entities, such as the presence of a particular binding domain, enzymatic properties or their involvement in specific diseases. Results. To overcome the limitations of existing approaches, we propose the system LP-HCLUS, which exploits a multi-type hierarchical clustering method to predict possibly unknown ncRNA-disease relationships. In particular, LP-HCLUS analyzes heterogeneous networks consisting of several types of objects and relationships, each possibly described by a set of features, and extracts multi-type clusters that are subsequently exploited to predict new ncRNA-disease associations. The extracted clusters are overlapping, hierarchically organized, involve entities of different types, and allow LP-HCLUS to catch multiple roles of ncRNAs in diseases at different levels of granularity. Our experimental evaluation, performed on heterogeneous attributed networks consisting of microRNAs, lncRNAs, diseases, genes and their known relationships, shows that LP-HCLUS is able to obtain better results with respect to existing approaches. The biological relevance of the obtained results was evaluated according to both quantitative (i.e., TPR@k, Areas Under the TPR@k, ROC and Precision-Recall curves) and qualitative (i.e., according to the consultation of the existing literature) criteria. Conclusions. The obtained results prove the utility of LP-HCLUS to conduct robust predictive studies on the biological role of ncRNAs in human diseases. The produced predictions can therefore be reliably considered as new, previously unknown, relationships among ncRNAs and diseases.
Article
Full-text available
The success of companies hugely depends on how well they can analyze the available data and extract meaningful knowledge. The Extract-Transform-Load (ETL) process is instrumental in accomplishing these goals, but requires significant effort, especially for Big Data. Previous works have failed to formalize, integrate, and evaluate the ETL process for Big Data problems in a scalable and cost-effective way. In this paper, we propose a cloud-based ETL framework for data fusion and aggregation from a variety of sources. Next, we define three scenarios regarding data aggregation during ETL: (i) ETL with no aggregation; (ii) aggregation based on predefined columns or time intervals; and (iii) aggregation within single user sessions spanning over arbitrary time intervals. The third scenario is very valuable in the context of feature engineering, making it possible to define features as “the time since the last occurrence of event X”. The scalability was evaluated on Amazon AWS Hadoop clusters by processing user logs collected with Kinesis streams with datasets ranging from 30 GB to 2.6 TB. The business value of the architecture was demonstrated with applications in churn prediction, service-outage prediction, fraud detection, and more generally — decision support and recommendation systems. In the churn prediction case, we showed that over 98% of churners could be detected, while identifying the individual reason. This allowed support and sales teams to perform targeted retention measures.
Article
Full-text available
Background: Accumulating evidence suggests that the human microbiome impacts individual and public health. City subway systems are human-dense environments, where passengers often exchange microbes. The MetaSUB project participants collected samples from subway surfaces in different cities and performed metagenomic sequencing. Previous studies focused on taxonomic composition of these microbiomes and no explicit functional analysis had been done till now. Results: As a part of the 2018 CAMDA challenge, we functionally profiled the available ~ 400 subway metagenomes and built predictor for city origin. In cross-validation, our model reached 81% accuracy when only the top-ranked city assignment was considered and 95% accuracy if the second city was taken into account as well. Notably, this performance was only achievable if the similarity of distribution of cities in the training and testing sets was similar. To assure that our methods are applicable without such biased assumptions we balanced our training data to account for all represented cities equally well. After balancing, the performance of our method was slightly lower (76/94%, respectively, for one or two top ranked cities), but still consistently high. Here we attained an added benefit of independence of training set city representation. In testing, our unbalanced model thus reached (an over-estimated) performance of 90/97%, while our balanced model was at a more reliable 63/90% accuracy. While, by definition of our model, we were not able to predict the microbiome origins previously unseen, our balanced model correctly judged them to be NOT-from-training-cities over 80% of the time. Our function-based outlook on microbiomes also allowed us to note similarities between both regionally close and far-away cities. Curiously, we identified the depletion in mycobacterial functions as a signature of cities in New Zealand, while photosynthesis related functions fingerprinted New York, Porto and Tokyo. Conclusions: We demonstrated the power of our high-speed function annotation method, mi-faser, by analysing ~ 400 shotgun metagenomes in 2 days, with the results recapitulating functional signals of different city subway microbiomes. We also showed the importance of balanced data in avoiding over-estimated performance. Our results revealed similarities between both geographically close (Ofa and Ilorin) and distant (Boston and Porto, Lisbon and New York) city subway microbiomes. The photosynthesis related functional signatures of NYC were previously unseen in taxonomy studies, highlighting the strength of functional analysis.
Chapter
The next-generation sequencing revolution has impacted biological research by allowing the collection and analysis of very large datasets. However, despite the large availability of data, current computational methods used by biologists present some limitations in challenging domains, such as extremely imbalanced datasets characterized by almost only negative examples. In this paper, we address the problem of identifying sequences from the zebra finch (songbird) germline-restricted chromosome (GRC), which is present only in reproductive tissues and missing from all other cells. Since the germline contains the GRC in addition to other chromosomes, sequencing germline DNA must be followed by separation into GRC or non-GRC sequences. The complexity of this task depends on the limited availability of known GRC sequences. In this paper, we propose a one-class ensemble learning method to solve this problem, and we compare its performance with state-of-the-art methods for one-class classification. Our results show that the proposed method is able to identify positive sequences with high accuracy, having been trained only with negative sequences, and tuned with a limited number of positive sequences. Moreover, a biological analysis revealed that positive sequences from a verified GRC gene were ranked in the top third of all the sequences, showing that our method is successful in demarcating GRC from non-GRC sequences. Our method thus represents a valuable tool for biologists, since model predictions can allow them to focus their limited resources towards the experimental validation of a subset of higher confidence sequences.
Article
Motivation: The reconstruction of Gene Regulatory Networks (GRNs) from gene expression data has received increasing attention in recent years, due to its usefulness in the understanding of regulatory mechanisms involved in human diseases. Most of the existing methods reconstruct the network through machine learning approaches, by analyzing known examples of interactions. However, i) they often produce poor results when the amount of labeled examples is limited, or when no negative example is available and ii) they are not able to exploit information extracted from GRNs of other (better studied) related organisms, when this information is available. Results: In this paper we propose a novel machine learning method which overcomes these limitations, by exploiting the knowledge about the GRN of a source organism for the reconstruction of the GRN of the target organism, by means of a novel transfer learning technique. Moreover, the proposed method is natively able to work in the Positive-Unlabeled setting, where no negative example is available, by fruitfully exploiting a (possibly large) set of unlabeled examples. In our experiments we reconstructed the human GRN, by exploiting the knowledge of the GRN of M. musculus. Results showed that the proposed method outperforms state-of-the-art approaches and identifies previously unknown functional relationships among the analyzed genes. Availability: http://www.di.uniba.it/~mignone/systems/biosfer/index.html. Supplementary information: Supplementary data are available at Bioinformatics online.