ArticlePDF Available



Abstract and Figures

The substantial cost reduction and massive production of next-generation sequencing (NGS) data have contributed to the progress in the rapid growth of metagenomics. However, production of the massive amount of data by NGS has revealed the challenges in handling the existing bioinformatics tools related to metagenomics. Therefore, in this research we have investigated an equal set of DNA metagenomics data from palm oil mill effluent (POME) sample using three different freeware bioinformatics pipelines’ websites of metagenomics RAST server (MG-RAST), Integrated Microbial Genomes with Microbiome Samples (IMG/M) and European Bioinformatics Institute (EBI) Metagenomics, in term of the taxonomic assignment and functional analysis. We found that MG-RAST is the quickest among these three pipelines. However, in term of analysis of results, IMG/M provides more variety of phylum with wider percent identities for taxonomical assignment and IMG/M provides the highest carbohydrates, amino acids, lipids, and coenzymes transport and metabolism functional annotation beside the highest in total number of glycoside hydrolase enzymes. Next, in identifying the conserved domain and family involved, EBI Metagenomics would be much more appropriate. All the three bioinformatics pipelines have their own specialties and can be used alternately or at the same time based on the user’s functional preference
Content may be subject to copyright.
IIUM Engineering Journal, Vol. 20, No. 1, 2019 Parman et al.
1Bioprocess & Molecular Engineering Research Unit (BPMERU),
Department of Biotechnology Engineering, Kulliyyah of Engineering,
International Islamic University Malaysia,
2Malaysia Genome Institute, Jalan Bangi, 43000 Kajang, Selangor, Malaysia
3International Institute for Halal Research and Training (INHART),
International Islamic University Malaysia,
P.O. Box 10, 50728 Kuala Lumpur, Malaysia.
*Corresponding authors:,
(Received: 9th March 2018; Accepted: 25th Jan 2019; Published on-line: 1st June 2019)
ABSTRACT: The substantial cost reduction and massive production of next-generation
sequencing (NGS) data have contributed to the progress in the rapid growth of
metagenomics. However, production of the massive amount of data by NGS has revealed
the challenges in handling the existing bioinformatics tools related to metagenomics.
Therefore, in this research we have investigated an equal set of DNA metagenomics data
from palm oil mill effluent (POME) sample using three different freeware bioinformatics
pipelines’ websites of metagenomics RAST server (MG-RAST), Integrated Microbial
Genomes with Microbiome Samples (IMG/M) and European Bioinformatics Institute
(EBI) Metagenomics, in term of the taxonomic assignment and functional analysis. We
found that MG-RAST is the quickest among these three pipelines. However, in term of
analysis of results, IMG/M provides more variety of phylum with wider percent
identities for taxonomical assignment and IMG/M provides the highest carbohydrates,
amino acids, lipids, and coenzymes transport and metabolism functional annotation
beside the highest in total number of glycoside hydrolase enzymes. Next, in identifying
the conserved domain and family involved, EBI Metagenomics would be much more
appropriate. All the three bioinformatics pipelines have their own specialties and can be
used alternately or at the same time based on the user’s functional preference.
ABSTRAK: Pengurangan kos dalam skala besar dan pengeluaran data ‘next-generation
sequencing’ (NGS) secara besar-besaran telah menyumbang kepada pertumbuhan pesat
metagenomik. Walau bagaimanapun, pengeluaran data dalam skala yang besar oleh NGS
telah menimbulkan cabaran dalam mengendalikan alat-alat bioinformatika yang sedia
ada berkaitan dengan metagenomik. Justeru itu, dalam kajian ini, kami telah menyiasat
satu set data metagenomik DNA yang sama dari sampel effluen kilang minyak sawit
dengan menggunakan tiga laman web bioinformatik percuma iaitu dari laman web
‘metagenomics RAST server’ (MG-RAST), ‘Integrated Microbial Genomes with
Microbiome Samples’ (IMG/M) dan ‘European Bioinformatics Institute’ (EBI)
Metagenomics dari segi taksonomi dan analisis fungsi. Kami mendapati bahawa MG-
RAST ialah yang paling cepat di antara ketiga-tiga ‘pipeline’, tetapi mengikut keputusan
analisa, IMG/M mengeluarkan maklumat philum yang lebih pelbagai bersama peratus
identiti yang lebih luas berbanding yang lain untuk pembahagian taksonomi dan IMG/M
IIUM Engineering Journal, Vol. 20, No. 1, 2019 Parman et al.
juga mempunyai bacaan tertinggi dalam hampir semua anotasi fungsional karbohidrat,
amino asid, lipid, dan koenzima pengangkutan dan metabolisma malah juga paling tinggi
dalam jumlah enzim hidrolase glikosida. Kemudian, untuk mengenal pasti ‘domain’
terpelihara dan keluarga yang terlibat, EBI metagenomics lebih bersesuaian. Ketiga-tiga
saluran ‘bioinformatics pipeline’ mempunyai keistimewaan mereka yang tersendiri dan
boleh digunakan bersilih ganti dalam masa yang sama berdasarkan pilihan fungsi
KEYWORDS: bioinformatics pipeline; metagenomics analysis; MG-RAST; IMG/M; EBI;
metagenomics; palm oil mill effluent
In the last few decades, metagenomics has become one of the crucial tools in mining
the hidden microbial treasure without the use of conventional laboratory culture
techniques. Metagenomics involves the study of genetic material extracted from the
diverse microbial population of environmental samples. The early stage of genomics relied
on the standard laboratory cultivation method which is insufficient to identify the entire
microbial population as compared to metagenomics. Furthermore, the change in
biotechnology development within this era, such as inexpensive next-generation
sequencing (NGS) technologies, high throughput screening technique for metagenomics
library and advances in bioinformatics tools, have left a huge impact in the field of
metagenomics [1].
Illumina is the most widely used NGS platform in metagenomics studies. The
Illumina system has advantages to other NGS platforms in terms of its high throughput
sequencing at an economical price with high accuracy (> 99%) reads [2]. The Illumina
platform could initially only produce a short-read sequence length which has gradually
been improved to a readable length and consequently made it more popular compared to
the other platforms in NGS tools [2]. This fast evolution by NGS technologies allows
researchers to achieve more variety of data with a high level of detailed sequencing
results. NGS has also been developed continuously and rapidly, starting from its launch in
2006, resulting in the accumulation of massive amounts of sequences [2]. Hence, several
bioinformatics tools for metagenomics annotation are needed to accurately analyze the
enormous amount of data.
Palm oil mill effluent (POME) is a colloidal suspension of the final stage effluent in
the palm oil industry production. The composition of POME includes 95-96% water, 4-5%
total solids, and 0.6-0.7% oil [3]. Besides that, the raw POME also contains significant
concentrations of carbohydrates, proteins, nitrogenous compounds, lipids, and minerals
that enable this effluent to be used in various biotechnological applications like
fermentation media, production of antibiotics, bio-insecticides, polyhydroxyalkanoates
(PHA), organic acids, enzymes, and hydrogen [3]. The present study attempts to make a
comparative bioinformatics analysis of POME’s sample metagenome constructed using
different automated bioinformatics pipelines of MG-RAST, IMG/MER, and EBI
Metagenomics to evaluate their accuracy in the taxonomical assignment and functional
annotation for future research directed to the industrial application.
Metagenome Rapid Annotation using Subsystem Technology (MG-RAST) is a free
web-based server with a fully automated system that provides sequence alignment, gene
prediction, structural and functional annotation, comparative metagenomics and archiving
services [4]. It was launched at the Argonne National Library in 2007 to address the
computational needs of huge metagenomics data production analysis [5]. This
IIUM Engineering Journal, Vol. 20, No. 1, 2019 Parman et al.
bioinformatic pipeline is being used by researchers around the world with the analysis
record of over 250,000 datasets with 100 tera-basepairs of DNA being successfully and
completely analyzed to date [6]. Besides, it also has a graphical user interface (GUI) that
allows the researcher to study the composition of microbial communities with their
specific function [6]. MG-RAST is also one of the bioinformatics pipelines that allows the
submission of raw sequence data in the fastq, fasta, and sff format which will then be
normalized and processed until annotation is completed by several integrated
bioinformatic tools [6].
The Integrated Microbial Genomes with Microbiomes (IMG/M) is quite similar to
MG-RAST, which can also examine the taxonomy and function or metabolic potential of
microbiomes [7]. IMG/M is a metagenomics data management system supported by the
DOE-JGI metagenome annotation pipeline (MAP V.4) which allows the submission of
fasta or fastq format assembled and unassembled 454, Illumina, and pacBio nucleotide
sequences [8]. In early 2016, all these unassembled reads could no longer be accepted;
meanwhile the sequence data generated outside JGI has been limited to the fasta format in
assembled data formed only. Until now, IMG/M still supports the external submission of
assembled genomes only with the condition that the metagenomes submission and
metadata have to be registered with Genomes Online Database (GOLD) version 5 [9].
European Bioinformatics Institute (EBI) Metagenomics is an expanding
metagenomics analysis and archiving resource that uses the European Nucleotide Archive
(ENA) data scheme developed by the European Molecular Biology Laboratory (EMBL).
ENA is needed for the initial submission and archiving purposes in a long-term period
storage for reuse in the future [10]. Besides, EBI Metagenomics is a free web-based server
that enables users to perform analysis on large scale platforms from Ion Torrent, Roche
454, and Illumina metagenomic sequence data [11]. Similar to MG-RAST and IMG/M,
EBI Metagenomics also has an established standardized system and analysis pipeline that
includes a variety of analytical and visualization tools in generating the analysis of
taxonomic and functional features of user-submitted sequence [12].
A comparison of MG-RAST to Qualitative Insights into Microbial Ecology (QIIME)
based on 16S rRNA method found QIIME to be more accurate in term of taxonomic
assignment compared to MG-RAST [13]. When MG-RAST was compared to QIIME but
with MOTHUR as an additional bioinformatic tool, the results showed that QIIME was
again the fastest compared to the other two [14]. QIIME is a bioinformatic tool used by
EBI Metagenomics Version 3.0 to perform the taxonomical annotation that currently is
being replaced with MAPseq in EBI Metagenomics Version 4.0. Even though in previous
research QIIME produced a better result, it lacks the facility to manage, store, and analyze
the metagenomics data. In this work, we provide a comparative analysis of the
metagenomics data using a fully automated bioinformatics pipeline that integrates the
work of management, analysis, storage, and sharing of metagenomics projects [15] instead
of analysis of 16S rRNA data only.
The three bioinformatics pipelines, namely MG-RAST, IMG/M, and EBI
Metagenomics are chosen because within the existing web resources in metagenomic
studies, the three bioinformatics tools are highly rated in terms of ease in data uploading,
online user support availability, analysis spectrum, citation, and storage capacity [16].
Hence, this research will only focus on the metagenomics analysis by these three
bioinformatic pipelines: MG-RAST, IMG/MER, and EBI Metagenomics. These three
tools, especially MG-RAST, have been used repeatedly to analyze many metagenome
sequencing datasets from a variety of sources. In the present work, we compare the
IIUM Engineering Journal, Vol. 20, No. 1, 2019 Parman et al.
analysis result of the same input data using the common three web-based automated
bioinformatics pipelines of MG-RAST, IMG/MER, and EBI metagenomics to evaluate
their accuracy in taxonomy and functional annotation on the microbial diversity and
several functional genes contain in POME.
2.1 Collection of Samples and Creation of Metagenomics Libraries
POME samples were collected from FELDA Palm Industries Sdn. Bhd. of KKS
Mempaga, Pahang, Malaysia. After sample collection, the metagenomic DNA was
extracted using Meta-G-Nome™ DNA Isolation Kit (Epicentre, U.S), sheared, end-
repaired, ligated to pCC1FOS fosmid vector and phage-mediated transfection to surrogate
host EPI300T1 resistant E.coli to perform the cloning of metagenomic DNA grown in LB
agar with the recommended antibiotic. The libraries (>100,000) were constructed by
preparing 384-well transparent microplates with LB broth and 20% glycerol and each
colony were inoculated from LB agar to the microplates and of glycerol media and stored
at -80 oC.
2.2 High Throughput Screening and Next-Generation Sequencing (NGS)
Colonies from each plate were inoculated to 384-well microplates filled with LB
broth with antibiotic and inducer (respective to their positions in the library plates).
Screening buffer (potassium acetate) and lysis mix (10% Triton X-100, 100 mM Tris and
10 mM EDTA) were added to break open the cells and methylumbelliferyl-β-D-
glucopyranoside (MUGlc), methylumbelliferyl-β-D-cellobioside (MUC) and
chlorocoumarin-xylobioside (CCX) fluorogenic substrates were added to each well. After
an overnight incubation, the plates were screened using a microplate reader for the
presence of cellulose- and xylan-degrading enzymes. The relative fluorescence units given
by the microplate reader were changed to robust z-score to select the high rated hits.
Robust z-score was calculated for each microplate independently and results of all plates
(109,834 clones) were combined to select the 100 high rated clones [17,18] and sent for
Illumina HiSeq2000 next-generation sequencing (NGS) at the Malaysia Genome Institute
(MGI), Bangi, Selangor.
2.3 Post-Processing of NGS Data
Initial raw data of NGS in the fastq format had undergone sequence quality trimming
and short read removal by SolexaQA++ program [19]. The fosmid vector and internal
control phiX sequences were removed using bowtie2 [20]. The high-quality sequences
were assembled with Velvet based on the de Bruijn graph algorithm to organize the
sequences in contigs. Velvet is a new strategy developed to merge very short reads in
combination with read pairs to produce useful assemblies [21].
2.4 Metagenomics Analysis using Three Bioinformatics Pipelines
The same dataset of assembled data from POME’s NGS results with fasta format was
uploaded into three automated bioinformatics pipelines of MG-RAST, IMG/M, and EBI
Metagenomics. These free web-based bioinformatic pipelines were automatically run after
submission of the file with related metadata. The details on the uploading method,
databases, and system included are summarized in Table 1.
IIUM Engineering Journal, Vol. 20, No. 1, 2019 Parman et al.
2.4.1 MG-RAST
The fasta format file and metadata were uploaded in the upload segment of MG-
RAST. The metadata was also uploaded using Microsoft Excel template prior to the
submission. Throughout the process, the percentage of analysis and process done could be
seen in the progress segment. In MG-RAST, the file had undergone the quality control for
data hygiene first and before proceeding with feature identification of coding DNA
sequence (CDS) using FragGeneScan. This database could predict the DNA coding region
of higher than 75 bp. The similarity searches on taxonomic classification in MG-RAST
were conducted using the BLAST-Like Alignment Tool (BLAT) that is used to find
sequence hits in seven taxonomic categories. For functional annotation, m5nr was used by
providing non-redundant integration of several databases like SEED, KEGG, Genbank,
IMG, UniProt, and eggNOGs [6].
Table 1: Technical comparison of MG-RAST, IMG/M, and EBI Metagenomics
EBI Metagenomics
5 h 21 mins
5 days 22 h 46 mins
3 days
Pipeline Version 4.0
CDS Prediction
UniProt, IMG/M,
COGs, eggNOGs,
COGs, Pfam, KO, EC,
MetaCyc, KEGG
Comparing CDS using
Comparing rRNA using
2.4.2 IMG/M
To submit our own data for metagenomics analysis in IMG/M, we used the Integrated
Microbial Genomes with Microbiomes for expert review (IMG/MER). IMG/M has been
associated with GOLD v.5 and DOE’s Joint Genome Institute (JGI). The initial step
involved is by login to IMG/MER and then registering the project metadata at GOLD prior
to uploading the data at JGI. After completion of data submission, the quality control (QC)
pre-processing like trimming and removing sequences shorter than 150 bp would take
IIUM Engineering Journal, Vol. 20, No. 1, 2019 Parman et al.
place. Next, the genes prediction of CRISPR, tRNA, rRNA, and CDS was conducted
using several databases like GeneMark, Prodigal, MetaGeneAnnotator, and
FragGeneScan. Finally, the functional annotation was completed by associating the
protein-coding genes with the COG, Pfam, KO, EC, MetaCyc, and KEGG [8].
2.4.3 EBI Metagenomics (EMG)
To analyze the data from POME’s NGS results using EBI metagenomics, the data file
needs to be uploaded to the European Nucleotide Archive (ENA) using either Webin
Uploader, FileZilla Client-Server, or Aspera. In this research, FileZilla Client-Server was
utilized, which is easier to use compared to the other methods and has been used widely by
many researchers. After QC and filtering out ncRNA reads, the CDS of proteins were
predicted using FragGeneScan for short reads and Prodigal for longer reads or assembled
ones. In EBI Metagenomics, MAPseq was used for taxonomic classification by assigning
taxonomy and OTU classification to rRNA sequence. In the meantime, InterProScan was
also used for functional annotation by predicting the domains and classifying them into
families using a compilation of several databases like Pfam, TIGRFAM, PRINTS,
PROSITE patterns, and Gene3d. Finally the functional results were produced in Gene
Ontology (GO).
To ensure the similar type of dataset for data comparison, the file in fasta format with
assembled form has been selected for this research. The assembled format is used instead
of the raw sequence because one of the bioinformatics pipelines, IMG/M cannot process
the unassembled reads if they are generated outside the Joint Genome Institute (JGI) [9].
The data was submitted to get a clear result related to the taxonomic and functional
annotation involving POME DNA metagenome sample. Specifically, the results of major
phylum and genus distribution existing in POME will represent the taxonomic
classification. On the other hand, major functional annotation and glycoside hydrolase
enzyme present in POME’s metagenomics DNA results will represent the functional
annotation part of this research. Overall, MG-RAST was found to be the quickest in 5
hours 20 minutes (Table 1) compared to IMG/M and EBI Metagenomics. However,
QIIME was the quickest compared to MG-RAST and MOTHUR [14]. On the contrary,
MG-RAST is the quickest in this research, even though QIIME was included in EBI
Metagenomics system Version 3.0. In this condition, QIIME is not a stand-alone system,
and EBI metagenomics also involves other bioinformatic tools like FragGeneScan,
Prodigal, and InterProScan which explains the slowness of EBI Metagenomics containing
QIIME compared to MG-RAST.
Figure 1(a) showed the results of major phylum distribution present in effluent of
FELDA palm oil mill, Mempaga, Pahang by each bioinformatic pipelines. All three
bioinformatic pipelines have similar dominant phylum of Proteobacteria and Firmicutes
with MG-RAST 73.7% and 25.3%, IMG/M 48% and 28%, and EBI Metagenomics 67%
and 33%, respectively. The difference can be seen in IMG/M ER with additional dominant
phylum of Actinobacteria (12%), dsDNA viruses (5%), and Bacteroidetes (4%). In this
phylum composition, IMG/M ER appeared to be more diverse and details compared to the
other two bioinformatics pipeline. MG-RAST using a BLAT algorithm while IMG/M ER
used BLAST which is the most commonly used bioinformatics tool for sequence
similarity analysis. Besides that, IMG/M could provide the percent identities greater than
30% which ensured the larger amount of results produced, while the setting in MG-RAST
is only for the percent identities of 60% and above.
IIUM Engineering Journal, Vol. 20, No. 1, 2019 Parman et al.
For genus distribution result, the comparison table only involves MG-RAST and
IMG/MER as shown in Fig. 1(b). EBI Metagenomics was excluded because no details
genes distribution could be detected except Staphylococcus. The latest version of EBI
Metagenomics Pipeline 4.0 uses MAPseq framework for taxonomic classification
whichallows the metagenomic analysis of reference based rRNA only. Consequently, only
small amount of 16S rRNA could be found in this sequence and finally bring the limitation
of Staphylococcus as an output. On the contrary, the other two pipelines of MG-RAST and
IMG/M, besides rRNA as reference-based, they also used coding DNA sequence (CDS) of
protein as the reference which results thousands of CDS can be found from the data
Fig. 1: (a) Phyla distribution of palm oil mill effluent (POME) computed by each tool and
(b) Major genus distribution in POME analyzed by MG-RAST and IMG/M only.
Genes Annotated (x10^3)
Taxonomic Hits (Genus)
img/ mer
Phyla Relative Abundance (%)
DsDNA viruses
IIUM Engineering Journal, Vol. 20, No. 1, 2019 Parman et al.
The major genus found in MG-RAST in Fig. 1(b) is similar to IMG/MER which are
Pseudomonas, Staphylococcus, Serratia, Escherichia, Burkholderia, Yersinia, and
Salmonella. From the same figure too, the quantity of major genes found in all MG-
RAST’s genus is greater than IMG/M, and the total number of the genus found in MG-
RAST initially also is greater than IMG/M. Soleimaninanadegani and Manshad‘s work on
POME reported that they found Bacillus, Micrococcus, Pseudomonas and Staphylococcus
genus in POME sample [22]. This finding explains the existence of Pseudomonas and
Staphylococcus as one of the major genus found in POME while the other genus like
Serratia, Burkholderia, Yersinia, and Salmonella are new findings in this research.
Besides that, all the genus are identified under the family of Firmicutes and Proteobacteria
which also explain the two families as the main families present in POME.
For functional annotation, several major categories of functional annotation can been
extracted from the results obtained by each pipeline which are transport and metabolism of
nucleotide, lipid, inorganic ion, coenzyme, carbohydrate, and amino acid as shown in Fig.
2. Raw POME contains a high concentration of carbohydrate, protein, nitrogenous
compounds, lipids, and minerals [23], which explains why in all three pipelines’ analysis
of the metabolism that lipid, carbohydrate, amino acid, inorganic ion, and coenzyme are
among the highest annotated reads of functional annotation. Once again, IMG/M has the
highest number of annotated reads nearly in all the functional categories.
Fig. 2: Major functional annotation of palm oil mill effluent (POME)
analyzed by MG-RAST, IMG/M, and EBI.
All the glycoside hydrolase enzymes analyzed by each pipeline are listed in Table 2 in
order for future research related to the study of useful glycoside hydrolases related to
POME. The POME glycoside hydrolase enzymes identified by bioinfomatics tools suggest
that POME can be a potential local source for novel enzymes and future potential enzymes
for industrial applications. The blank spaces of EBI Metagenomics part (Table 2) probably
occurred because EBI metagenomics used InterProScan in functional annotation.
InterProScan is a database that works by classifying the CDS into respected families and
also by predicting any important domains and sites related to protein sequence [24].
Therefore, the output result of EBI Metagenomics will be more related to predicted
domains and families rather than the protein or enzyme names.
0 200 400 600 800 1000 1200 1400 1600
Amino acid transport and metabolism
Carbohydrate transport and metabolism
Coenzyme transport and metabolism
Inorganic ion transport and metabolism
Lipid transport and metabolism
Functional Annotation
IIUM Engineering Journal, Vol. 20, No. 1, 2019 Parman et al.
Table 2: Total number of enzyme glycoside hydrolase (EC 3.2.1.-) in palm oil mill
effluent (POME) identified by MG-RAST, IMG/M ER, and EBI Metagenomics
Enzyme Name
Throughout the results of taxonomic and functional annotation of palm oil mill
effluent (POME) DNA metagenomics sample, IMG/M could be seen to provide a better
analysis result compared to the other pipelines. IMG/M could give a diverse phylum result
as it involves a much wider range of percent identity. Besides that, this bioinformatics
pipeline also is the one with the highest in nearly all major functional annotation of
carbohydrate, amino acid, lipid and coenzyme transport and metabolism as well as the
highest in the total number of glycoside hydrolase enzymes contained in POME. On the
other hand, the other two pipelines of MG-RAST and EBI metagenomics also have their
own specialties. These two bioinformatic tools allow the submission of raw data compared
with IMG/M, which is limited to the assembled one only. MG-RAST is also good for
obtaining data analysis in a short period as it uses the BLAT system which is 500 times
faster for nucleic acid sequence alignment and 50 times faster for protein sequence
alignment than other famous existing tools including the one being used in IMG/M and
EBI Metagenomics server [25]. Next, EBI metagenomics is more suitable for a deeper
understanding of the protein families and domains involved in any specific sequence
rather than focusing on taxonomical annotation. In conclusion, all the three bioinformatics
pipelines have their own specialty that the researcher needs to know in detail so that the
specific bioinformatic pipeline can be used based on individual functional interest
IIUM Engineering Journal, Vol. 20, No. 1, 2019 Parman et al.
The authors would like to thank Ministry of Education, Malaysia for giving the financial
support through the Fundamental Research Grant Scheme (FRGS) with project ID FRGS
13-086-0327. Besides that, the authors also deeply grateful to Malaysia Genome Institute
(MGI) for the technical assistance in analyzing the metagenomic data of POME. The
authors also would like to express their appreciation to Professor S.G. Withers and Dr
Hong Ming (Chemistry Department, University of British Columbia, Vancouver, Canada)
for providing substrates used for the high-throughput screening of this work.
[1] Kumar S, Krishnani KK, Bhushan B, and Brahmane MP. (2015) Metagenomics : Retrospect
and Prospects in High Throughput Age, vol. 2015.
[2] Mincheol K, Lee KH, Yoon SW, Kim BS, Chun J, and Hana Y. (2013) Analytical Tools and
Databases for Metagenomics in the Next-Generation Sequencing Era, 11(3): 102-113.
[3] Abdullah N and Sulaiman F. (2013) The oil palm wastes in Malaysia, Biomass Now –
Sustain. Growth Use, pp. 75-100.
[4] Meyer F, Paarmann D, Souza MD, Olson R, Glass EM, Kubal M, Edwards RA. (2008) The
metagenomics RAST server – a public resource for the automatic phylogenetic and
functional analysis of metagenomes, 8: 1-8.
[5] Tang W, Bischof J, Desai N, Mahadik K, Gerlach W, Harrison T, Meyer F. (2014)
Workload characterization for MG-RAST metagenomic data analytics service in the cloud.
2014 IEEE International Conference on Big Data (Big Data), 56-63.
[6] Wilke A, Gerlach W, Harrison T, Paczian T, Trimble WL, & Meyer F. (2016) MG-RAST
Manual for version 4, revision 1.
[7] Markowitz VM, Chen IMA, Chu K, Szeto E, Palaniappan K, Pillay M, Kyrpides NC. (2014)
IMG/M 4 version of the integrated metagenome comparative analysis system. Nucleic Acids
Research, 42(D1): 568-573.
[8] Huntemann M, Ivanova NN, Mavromatis K, Tripp HJ, Paez-espino D, Tennessen K,
Kyrpides NC. (2016) The standard operating procedure of the DOE-JGI Metagenome
Annotation Pipeline (MAP v . 4). Standards in Genomic Sciences, 1: 1-5.
[9] Chen IMA, Markowitz VM, Chu K, Palaniappan K, Szeto E, Pillay M, Kyrpides NC. (2017)
IMG/M: Integrated genome and metagenome comparative data analysis system. Nucleic
Acids Research, 45(D1): D507-D516.
[10] Hunter S, Corbett M, Denise H, Fraser M, Gonzalez-beltran A, Hunter C, Sansone S. (2014)
EBI metagenomics — a new resource for the analysis and archiving of metagenomic data,
42: 600-606.
[11] Denise H, Mitchell A, Bucchini F, Cochrane G, Denise H, Hoopen P, Finn RD. (2016) EBI
metagenomics in 2016 - An expanding and evolving resource for the analysis and archiving
of metagenomic data EBI metagenomics in 2016 - An expanding and evolving resource for
the analysis and archiving of metagenomic data, (November 2015).
[12] Mitchell AL, Scheremetjew M, Denise H, Potter S, Tarkowska A, Qureshi M, Finn RD.
(2017) EBI Metagenomics in 2017: Enriching the analysis of microbial communities, from
sequence reads to assemblies. Nucleic Acids Research, 46(D1): D726-D735.
[13] D’Argenio V, Casaburi G, Precone V, & Salvatore F. (2014) Comparative metagenomic
analysis of human gut microbiome composition using two different bioinformatic pipelines.
Biomed Res Int, 325340.
IIUM Engineering Journal, Vol. 20, No. 1, 2019 Parman et al.
[14] Plummer E, Twin J. (2015) A Comparison of Three Bioinformatics Pipelines for the
Analysis of Preterm Gut Microbiota using 16S rRNA Gene Sequencing Data. J. Proteomics
& Bioinformatics, 8(12): 283–291.
[15] Oulas A, Pavloudi C, Polymenakou P, Pavlopoulos GA, Papanikolaou N, Kotoulas G,
Iliopoulos I. (2015) Metagenomics: Tools and Insights for Analyzing Next-Generation
Sequencing Data Derived from Biodiversity Studies. Bioinformatics and Biology Insights, 9:
[16] Dudhagara P, Bhavsar S, Bhagat C, Ghelani A, Bhatt S, & Patel R. (2015) Web Resources
for Metagenomics Studies, 13: 296–30.
[17] Malo N, Hanley JA, Cerquozzi S, Pelletier J, & Nadon R. (2006) Statistical practice in high-
throughput screening data analysis. Nature Biotechnology, 24(2): 167-175.
[18] Birmingham A, Selfors LM, Forster T, Wrobel D, Kennedy CJ, Shanks E, Shamu CE.
(2009) Statistical methods for analysis of high-throughput RNA interference screens. Nat
Methods, 6(8): 569-575.
[19] Cox MP, Peterson DA, & Biggs PJ. (2010) SolexaQA: At-a-glance quality assessment of
Illumina second-generation sequencing data. BMC Bioinformatics, 11: 485.
[20] Langmead B, & Salzberg SL. (2013) Fast gapped-read alignment with Bowtie 2. Nature
Methods, 9(4): 357-359.
[21] Zerbino DR, Birney E. (2008) Velvet Manual Version 1.1. Genome Research, 18(5): 821-
[22] Soleimaninanadegani M, Manshad S. (2014) Enhancement of Biodegradation of Palm Oil
Mill Effluents by Local Isolated Microorganisms. International Scholarly Research Notices,
[23] Habib MAB, Yusoff FM, Phang SM., Ang KJ, & Mohamed S. (1997) Nutritional values of
chironomid larvae grown in palm oil mill effluent and algal culture. Aquaculture, 158(1-2):
[24] Finn RD, Attwood TK, Babbitt PC, Bateman A, Bork P, Bridge AJ, Mitchell AL. (2017)
InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Research,
45(D1): D190-D199.
[25] Kent WJ. (2002) BLAT — The BLAST -Like Alignment Tool. Genome Research, 12: 656-
... The groups that varied stronger, notably verrucomicrobia, Actinobacteria and Synergistetes, might be more sensitive to local or regional micro environmental factors or growth media composition than other microbial groups, with a more uniform presence across the various areas or production types. The dominance of Firmicutes, Bacteroidetes and Proteobacteria in POME was consistent with previous reports on POME 22,26 and other wastewaters or solid wastes 27,28 . In all POME samples dominated by the Firmicutes phylum, Lactobacillaceae family, with its main representative genus Lactobacillus was the major sub-taxa of this phylum (except samples from Zouan Hounien dominated by Ruminococcaceae with genus Ruminoclostridium), as their presence was associated to the acidic characteristic of POME the by production of lactic acid 22 . ...
Full-text available
Palm Oil Mill Effluents (POME) are complex fermentative substrates which habour diverse native microbial contaminants. However, knowledge on the microbiota community shift caused by the anthropogenic effects of POME in the environment is up to date still to be extensively documented. In this study, the bacterial and archaeal communities of POME from two palm oil processing systems (artisanal and industrial) were investigated by Illumina MiSeq Platform. Despite the common characteristics of these wastewaters, we found that their microbial communities were significantly different with regard to their diversity and relative abundance of their different Amplicon Sequence Variants (ASV). Indeed, POME from industrial plants harboured as dominant phyla Firmicutes (46.24%), Bacteroidetes (34.19%), Proteobacteria (15.11%), with the particular presence of Spirochaetes, verrucomicrobia and Synergistetes, while those from artisanal production were colonized by Firmicutes (92.06%), Proteobacteria (4.21%) and Actinobacteria (2.09%). Furthermore, 43 AVSs of archaea were detected only in POME from industrial plants and assigned to Crenarchaeota, Diapherotrites, Euryarchaeota and Nanoarchaeaeota phyla, populated mainly by many methane-forming archaea. Definitively, the microbial community composition of POME from both type of processing was markedly different, showing that the history of these ecosystems and various processing conditions have a great impact on each microbial community structure and diversity. By improving knowledge about this microbiome, the results also provide insight into the potential microbial contaminants of soils and rivers receiving these wastewaters.
Full-text available
Chironomid fly larvae was grown very well in palm oil mill effluent. Larvae contained around 60% which are very good live food for fishes.
Full-text available
EBI metagenomics ( ) provides a free to use platform for the analysis and archiving of sequence data derived from the microbial populations found in a particular environment. Over the past two years, EBI metagenomics has increased the number of datasets analysed 10-fold. In addition to increased throughput, the underlying analysis pipeline has been overhauled to include both new or updated tools and reference databases. Of particular note is a new workflow for taxonomic assignments that has been extended to include assignments based on both the large and small subunit RNA marker genes and to encompass all cellular micro-organisms. We also describe the addition of metagenomic assembly as a new analysis service. Our pilot studies have produced over 2400 assemblies from datasets in the public domain. From these assemblies, we have produced a searchable, non-redundant protein database of over 50 million sequences. To provide improved access to the data stored within the resource, we have developed a programmatic interface that provides access to the analysis results and associated sample metadata. Finally, we have integrated the results of a series of statistical analyses that provide estimations of diversity and sample comparisons.
Full-text available
InterPro ( is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences.
Full-text available
The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provided via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation is followed by functional annotation including assignment of protein product names and connection to various protein family databases.
Full-text available
EBI metagenomics ( is a freely available hub for the analysis and archiving of metagenomic and metatranscriptomic data. Over the last 2 years, the resource has undergone rapid growth, with an increase of over five-fold in the number of processed samples and consequently represents one of the largest resources of analysed shotgun metagenomes. Here, we report the status of the resource in 2016 and give an overview of new developments. In particular, we describe updates to data content, a complete overhaul of the analysis pipeline, streamlining of data presentation via the website and the development of a new web based tool to compare functional analyses of sequence runs within a study. We also highlight two of the higher profile projects that have been analysed using the resource in the last year: the oceanographic projects Ocean Sampling Day and Tara Oceans.
Full-text available
Advances in next-generation sequencing (NGS) have allowed significant breakthroughs in microbial ecology studies. This has led to the rapid expansion of research in the field and the establishment of “metagenomics”, often defined as the analysis of DNA from microbial communities in environmental samples without prior need for culturing. Many metagenomics statistical/computational tools and databases have been developed in order to allow the exploitation of the huge influx of data. In this review article, we provide an overview of the sequencing technologies and how they are uniquely suited to various types of metagenomic studies. We focus on the currently available bioinformatics techniques, tools, and methodologies for performing each individual step of a typical metagenomic dataset analysis. We also provide future trends in the field with respect to tools and technologies currently under development. Moreover, we discuss data management, distribution, and integration tools that are capable of performing comparative metagenomic analyses of multiple datasets using well-established databases, as well as commonly used annotation standards.
Full-text available
This study was designed to investigate the microorganisms associated with palm oil mill effluent (POME) in Johor Bahru state, Malaysia. Biodegradation of palm oil mill effluents (POME) was conducted to measure the discarded POME based on physicochemical quality. The bacteria that were isolated are Micrococcus species, Bacillus species, Pseudomonas species, and Staphylococcus aureus, while the fungi that were isolated are Aspergillus niger, Aspergillus fumigatus, Candida species, Fusarium species, Mucor species, and Penicillium species. The autoclaved and unautoclaved raw POME samples were incubated for 7 days and the activities of the microorganisms were observed each 12 hours. The supernatants of the digested POME were investigated for the removal of chemical oxygen demand (COD), color (ADMI), and biochemical oxygen demand (BOD) at the end of each digestion cycle. The results showed that the unautoclaved raw POME sample degraded better than the inoculated POME sample and this suggests that the microorganisms that are indigenous in the POME are more effective than the introduced microorganisms. This result, however, indicates the prospect of isolating indigenous microorganisms in the POME for effective biodegradation of POME. Moreover, the effective treatment of POME yields useful products such as reduction of BOD, COD, and color.
The Integrated Microbial Genomes with Microbiome Samples (IMG/M: system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support for examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review (ER) companion system (IMG/M ER: Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.
Objective and Methods: Analysis of massive parallel sequencing 16S rRNA data requires the use of sophisticated bioinformatics pipelines. Several pipelines are available, however there is limited literature available comparing the features, advantages and disadvantages of each pipeline. This makes the choice of which method to use often unclear. Using gut microbial read data collected from a cohort of very preterm babies, we compared three pipelines commonly used for 16S rRNA gene analysis: MetaGenome Rapid Annotation using Subsystem Technology (MG-RAST), Quantitative Insights into Microbial Ecology (QIIME) and mothur. Using primarily default parameters, the three pipelines were compared in terms of taxonomic classification, diversity analysis and usability. Results: Overall, the three pipelines detected the same phylum in similar abundances (P>0.05). A difference was observed between the pipelines in terms of taxonomic classification of genera from the Enterobacteriaceae family, specifically Enterobacter and Klebsiella (P<0.0001 and P=0.0026 respectively). We found the analysis time to be quickest with QIIME compared to mothur and MG-RAST (approximately 1 hour as compared to 10 hours and 2 days respectively). Conclusion: This study showed that QIIME, mothur and MG-RAST produce comparable results and that regardless of which pipeline or algorithm is selected for the analysis of 16S rRNA gene sequencing data you are likely to generate a reliable high-level overview of sample composition when analysing faecal samples. The differences we observed at the genus level highlight that a key limitation of using 16S rRNA gene analysis for genus and species level classification is that related bacterial species may be indistinguishable due to near identical 16S rRNA gene sequences. This is important to keep in mind when analysing 16S rRNA gene sequencing data.
Conference Paper
The cost of DNA sequencing has plummeted in recent years. The consequent data deluge has imposed big burdens for data analysis applications. For example, MG-RAST, a production open-public metagenome annotation service, has experienced increasingly large amount of data submission and has demanded scalable resources for the computational needs. To address this problem, we have developed a scalable platform to port MG-RAST workloads into the cloud, where elastic computing resources can be used on demand. To efficiently utilize such resources, however, one must understand the characteristics of the application workloads. In this paper, we characterize the MG-RAST workloads running in the cloud, from the perspectives of computation, I/O, and data transfer. Insights from this work will help guide application enhancement, service operation, and resource management for MG-RAST and similar big data applications demanding elastic computing resources.