ArticlePDF Available

BioMart Central Portal—Unified access to biological data

Authors:

Abstract and Figures

BioMart Central Portal (www.biomart.org) offers a one-stop shop solution to access a wide array of biological databases. These include major biomolecular sequence, pathway and annotation databases such as Ensembl, Uniprot, Reactome, HGNC, Wormbase and PRIDE; for a complete list, visit, http://www.biomart.org/biomart/martview. Moreover, the web server features seamless data federation making cross querying of these data sources in a user friendly and unified way. The web server not only provides access through a web interface (MartView), it also supports programmatic access through a Perl API as well as RESTful and SOAP oriented web services. The website is free and open to all users and there is no login requirement.
Content may be subject to copyright.
Published online 6 May 2009 Nucleic Acids Research, 2009, Vol. 37, Web Server issue W23–W27
doi:10.1093/nar/gkp265
BioMart Central Portal—unified access to biological
data
Syed Haider
1,2
, Benoit Ballester
1
, Damian Smedley
1
, Junjun Zhang
3
, Peter Rice
1
and
Arek Kasprzyk
3,
*
1
EMBL-European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD,
2
Computer Laboratory, University of
Cambridge, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK and
3
Ontario Institute for Cancer Research,
MaRS Centre, 101 College Street, Toronto M5G 0A3, Canada
Received March 4, 2009; Revised and Accepted April 8, 2009
ABSTRACT
BioMart Central Portal (www.biomart.org) offers
a one-stop shop solution to access a wide array
of biological databases. These include major biomo-
lecular sequence, pathway and annotation data-
bases such as Ensembl, Uniprot, Reactome,
HGNC, Wormbase and PRIDE; for a complete list,
visit, http://www.biomart.org/biomart/martview.
Moreover, the web server features seamless data
federation making cross querying of these data
sources in a user friendly and unified way. The
web server not only provides access through a
web interface (MartView), it also supports program-
matic access through a Perl API as well as RESTful
and SOAP oriented web services. The website is
free and open to all users and there is no login
requirement.
INTRODUCTION
The advancements in sequencing technologies and subse-
quent growth in the repertoire of biological information
are posing serious data-management challenges. The
volume of these data is expected to continue to grow
exponentially. Projects such as GenBank (1), HapMap
(2) and the SNP Consortium are prime examples of the
high-throughput data-management challenges that we are
experiencing. Querying different biological data sources
in an integrated manner generally involves moving all the
data into a centralized data warehouse, necessitating sub-
stantial resources for keeping it up to date with compo-
nent data sources. New generation sequencing projects
such as the 1000 Genomes Project and International
Cancer Genome Consortium (ICGC) are expected to
produce data on an unprecedented scale. Moving this
type of data into a central location for integrated query-
ing with other resources presents considerable organiza-
tional and physical transfer challenges. One solution to
this challenge lies in federated databases whereby indi-
vidual data providers are responsible for updates and
release cycles. The federated model eliminates the need
to aggregate and manage all the data in any one central
location. Another dimension of this problem is the pro-
vision of fast and robust access to such large quantities
of data; how do we bring this data to end-users without
having to expose any of the back-end issues pertaining to
discovering repository location, information retrieval and
merging with other datasets to support cross querying
which is often the case in biological queries. Lastly, the
results to be returned from these databases must be in
standard formats and where possible, semantically anno-
tated to ensure interoperability with other databases and
tools. The Distributed Annotation System (DAS) (3) as
well as BioMart (4) are functional examples of such fra-
meworks. The BioMart software system offers a generic
framework for biological data storage and retrieval par-
ticularly suited for large scale ‘omics data through a
single point of access. The web server, BioMart Central
Portal, provides access to variety of datasets that can be
queried independently or in a federated way enabling
users to ask complex questions over data sources that
may be located at different geographical locations.
These inculde Ensembl genomic, Uniprot protein,
Reactome pathway, HGNC gene name, Wormbase geno-
mic and PRIDE proteomic data (5–10). As of March
2009, BioMart Central Portal brings together an exten-
sive range of databases (see Figure 1), serving more than
100 datasets with an average monthly usage of over 1
million server hits (see Supplementary Table S1).
Furthermore, the web server provides complete access
to metadata that can be used by third party client
writers to emulate functionality offered by the BioMart
Central Portal as per their domain requirements.
We believe that this service will be of enormous benefit
to many users and deployers ranging from wet-lab biol-
ogists to computer scientists working in bioinformatics
setups.
*To whom correspondence should be addressed. Email: arek.kasprzyk@oicr.on.ca
ß2009 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
BIOMART CENTRAL PORTAL
The BioMart Central Portal is a web server interface of
BioMart software and provides a unified view over dispa-
rate data sources that enable bioscientists to retrieve data
from one or multiple sources in a simple and efficient way.
The library behind the web server handles user request
and takes over the responsibility of fetching data from
respective locations, aggregating results and subsequent
formatting in the specified format. Figure 2 describes the
high-level system architecture and the data flow. A query
to the BioMart Central Portal primarily consists of three
simple abstractions (Dataset, Filters and Attributes).
Dataset being the logical boundary of the query, Filters
(optional) are the inputs and Attributes are the user spe-
cified outputs. The BioMart Central Portal handles
queries from several interfaces, all utilizing these three
abstractions in a coherent way across all interfaces.
These interfaces are:
Perl API
Web interface (MartView)
URL based access
RESTful web service (MartService)
SOAP web service (MartServiceSoap)
DAS server
All the query interfaces are written in Perl. A detailed
description of usage and query formulation is explained in
(11) and the project docs available at www.biomart.org/
install.html.
In the sections to follow, we will describe the access to
BioMart Central Portal through its web service end-point,
MartServiceSoap. The BioMart queries can be fundamen-
tally categorized into two types; metadata and data access.
A machine readable XML based description of inputs
and outputs of these queries are published in Web
Service Definition Language (WSDL) and XML Schema
Definition (XSD) files available at http://www.biomart.
org/biomart/martwsdl and http://www.biomart.org/
biomart/martxsd.
Metadata Access
These requests are used to retrieve information about
which databases, datasets, filters, attributes and associated
formatters are made available by BioMart Central Portal.
These queries support not only programmatic access, they
also return additional information which may be used to
write domain specific specialized clients to access BioMart
Figure 1. List of databases available through BioMart Central Portal (March 2009).
Figure 2. The schematic representation of BioMart Central Portal.
W24 Nucleic Acids Research, 2009, Vol. 37, Web Server issue
Central Portal remotely. These requests are described
as follows:
getRegistry. This request retrieves information contents
such as name, location, host, port etc about all the
databases/marts available at BioMart Central Portal.
The output is equivalent to the list displayed by
MartView, see Figure 1.
getDatasets. This request retrieves a list of datasets avail-
able under each mart, mart name being the input of the
request.
getFilters and getAttributes. These two requests retrieve a
list of all the filters and attributes available given a dataset.
Additional information about hierarchy, limitations and
output formatters is also returned. Most importantly, the
W3C suggested property ‘modelReference’ in the output, if
configured by the data publisher, provides the Uniform
Resource Identifier (URI) of the concept in an ontology
that contains description of the output attribute/s. This
feature offers a framework for semantic annotation of
terms in BioMart databases. This feature will improve
interoperability of BioMart results with non-BioMart
data sources and analysis tools.
Data Access
In order to access biological content of the marts available
through the BioMart web server, a query request is used.
Figure 3a illustrates an example query in MartSoapService
format that spans two datasets (Ensembl Homo Sapiens &
Reactome Pathways) residing at different locations
(Sanger & CSHL). The query finds the alleles in genes
involved in the regulation of DNA replication. A user
can specify the attributes of interest along with any pos-
sible limitations (filters) from a given dataset/s and in
return gets results as shown in Figure 3b. Users are neither
expected to ascertain the database specific access protocol,
nor its physical location. From a user’s point of view, all
datasets appear to be residing at BioMart Central Portal
that takes care of all underlying federation logic.
Query processing
The BioMart server-side software constitutes of a
QueryPlanner and an Aggregator. The QueryPlanner con-
sumes data access queries and formulates an execution
plan. If BioMart Central Portal has direct access creden-
tials to the database server, then SQL statements are com-
piled, otherwise XML-based web service requests are
sent to the remote BioMart web server over HTTP
stream and results are retrieved over the same connection.
The execution scheme consists of ANSI SQL statements
(to ensure compatibility across MySQL, Oracle and
PostgreSQL) or web service requests or combination of
both if a query involves one or more datasets providing
direct database access and others proving only web service
access. To minimize database or HTTP time-outs and
slow response times, the query engine uses a sophisticated
batching system that performs the job over several itera-
tions. The results are piped back to the user as soon as the
first batch in finished. The Aggregator component enables
merging of data coming from different sources on a
common concept. This is achieved by extending the
afore-mentioned abstractions, Attributes and Filters, to
Exportables and Importables. A dataset that exposes an
attribute as exportable is able to integrate data from all
those sources whereby a filter with similar name is tagged
as importable. The exportables and importables are col-
umns with similar contents in a database table. The aggre-
gation of results is an in-memory operation that does
not prove to be very costly given the batching model
described above.
Registry
The BioMart Central Portal does not store any data
locally except meta information of all the datasets. The
server maintains a registry containing references to
remote BioMart web servers. To add a new mart to this
registry, we only require the URL of the BioMart server
hosting the databases or read access to the database
server. This information is added to the registry file of
the web server and following a configuration rerun, the
whole bioinformatics community can benefit from the
data through BioMart Central Portal as well as several
third party softwares, see www.biomart.org for a complete
list. The web server stays in sync with any of the data
updates carried out on various databases. However,
updates relating to metadata are made available shortly
after the stable release of such updates upon reconfigura-
tion of the web server.
FUTURE DIRECTIONS
We are working on extending the system to support mul-
tiple and more specialized web GUIs. This includes inte-
gration of analysis and visualization plugins with special
focus on cancer research. We also envisage substantial
development towards semantic annotation of attributes
and filters by data publishers that would enhance the
interoperability of mart datasets with analysis tools and
non-BioMart databases. MartServiceSoap provides a
complete framework to define ontology references for
the annotation of these terms and we would like to
collaborate with data providers to achieve this goal.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
ACKNOWLEDGEMENTS
We are very thankful to Dr Paul Flicek (EMBL-EBI) for
his feedback on this manuscript.
FUNDING
Ontario Institute for Cancer Research; the Wellcome
Trust, EMBL; the European Commission within its FP6
Programme under the thematic area ‘Life sciences, geno-
mics and biotechnology for health’, contract number
Nucleic Acids Research, 2009, Vol. 37, Web Server issue W25
Figure 3. (a) SOAP request envelope representing data federation between Ensembl Homo Sapiens (Sanger-UK) and Reactome pathway (CSHL-US)
datasets. The query finds the alleles in genes involved in the regulation of DNA replication (b) SOAP response envelope for the query shown
in figure 3a.
W26 Nucleic Acids Research, 2009, Vol. 37, Web Server issue
LHSG-CT-2004-512092. Funding for open access charge:
Ontario Government and Ministry of Research and
Innovation.
Conflict of interest statement. None declared.
REFERENCES
1. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J. and
Sayers,E.W. (2009) GenBank. Nucleic Acids Res.,37, D26–D31.
2. The International HapMap Consortium. (2007) A second genera-
tion human haplotype map of over 3.1 million SNPs. Nature, 449,
851–861.
3. Dowell,R.D., Jokerst,R.M., Day,A., Eddy,S.R. and Stein,L. (2001)
The distributed annotation system. BMC Bioinformatics,2,7.
4. Kasprzyk,A., Keefe,D., Smedley,D., London,D., Spooner,W.,
Melsopp,C., Hammond,M., Rocca-Serra,P., Cox,T. and Birney,E.
(2004) EnsMart: a generic system for fast and flexible access to
biological data. Genome Res.,14, 160–169.
5. Hubbard,T.J.P., Aken,B.L., Ayling,S., Ballester,B., Beal,K.,
Bragin,K., Brent,S., Chen,Y., Clapham,P., Clarke,L. et al. (2009)
Ensembl 2009. Nucleic Acids Res.,37, D690–D697.
6. The UniProt Consortium., (2008) The Universal Protein Resource
(UniProt). Nucleic Acids Res., 36, D190–D195.
7. Vastrik,I., D’Eustachio,P., Schmidt,E., Joshi-Tope,G., Gopinath,G.,
Croft,D., de Bono,B., Gillespie,M., Jassal,B., Lewis,S. et al. (2007)
Reactome: a knowledge base of biologic pathways and processes.
Genome Biol.,8, R39.
8. Bruford,E.A., Lush,M.J., Wright,M.W., Sneddon,T.P., Povey,S.
and Birney,E. (2008) The HGNC Database in 2008: a resource for
the human genome. Nucleic Acids Res.,36, D445–D448.
9. Bieri,T., Blasiar,D., Ozersky,P., Antoshechkin,I., Bastiani,C.,
Canaran,P., Chan,J., Chen,N., Chen,W.J., Davis,P. et al. (2007)
WormBase: new content and better access. Nucleic Acids Res.,35,
D506–D510.
10. Jones,P., Coˆ te
´,R.G., Cho,S.Y., Klie,S., Martens,L., Quinn,A.F.,
Thorneycroft,D. and Hermjakob,H. (2008) PRIDE: new develop-
ments and new datasets. Nucleic Acids Res.,36, D878–D883.
11. Smedley,D., Haider,S., Ballester,B., Holland,R., London,D.,
Thorisson,G. and Kasprzyk,A. (2009) BioMart—biological queries
made easy. BMC Genomics,10, 22.
Nucleic Acids Research, 2009, Vol. 37, Web Server issue W27

Supplementary resource (1)

... The length of the CDS of the canonical transcript and number of paralogs were extracted from Ensembl BioMart release 75. 40,41 Inheritance. Mode of inheritance (MOI) of each gene was extracted from the HGMD and sub-categorized as autosomal recessive (AR), autosomal dominant (AD), autosomal recessive/dominant (ADAR), X-linked dominant (XLD), X-linked recessive (XLR), and unknown. ...
... 43 Pfam protein domain information for all genes in our database was downloaded from Ensembl BioMart (release 99) with Ensembl protein IDs with versions corresponding to the canonical transcripts as annotated by Ensembl VEP. 40,44 We also searched the InterPro database for the proteins lacking Pfam domain information in Ensembl BioMart. When a protein existed in a protein database but the variant position did not fall into a known protein domain, it was considered as ''outside'' of the domain, and when a protein was absent from both Pfam and Inter-Pro databases, it was considered as an ''unknown'' domain. ...
Article
Identifying whether a given genetic mutation results in a gene product with increased (gain-of-function; GOF) or diminished (loss-of-function; LOF) activity is an important step toward understanding disease mechanisms because they may result in markedly different clinical phenotypes. Here, we generated an extensive database of documented germline GOF and LOF pathogenic variants by employing natural language processing (NLP) on the available abstracts in the Human Gene Mutation Database. We then investigated various gene- and protein-level features of GOF and LOF variants and applied machine learning and statistical analyses to identify discriminative features. We found that GOF variants were enriched in essential genes, for autosomal-dominant inheritance, and in protein binding and interaction domains, whereas LOF variants were enriched in singleton genes, for protein-truncating variants, and in protein core regions. We developed a user-friendly web-based interface that enables the extraction of selected subsets from the GOF/LOF database by a broad set of annotated features and downloading of up-to-date versions. These results improve our understanding of how variants affect gene/protein function and may ultimately guide future treatment options.
... The predicted PPIs between banana and Foc4 will provide data support for molecular biology experiments. [31] to convert the different protein IDs into uniform IDs. ...
Article
Full-text available
Background The pathogen of banana Fusarium oxysporum f. sp. cubense race 4(Foc4) infects almost all banana species, and it is the most destructive. The molecular mechanism of the interactions between Fusarium oxysporum and banana still needs to be further investigated. Methods We use both the interolog and domain-domain method to predict the protein–protein interactions (PPIs) between banana and Foc4. The predicted protein interaction sequences are encoded by the conjoint triad and autocovariance method respectively to obtain continuous and discontinuous information of protein sequences. This information is used as the input data of the neural network model. The Long Short-Term Memory (LSTM) neural network five-fold cross-validation and independent test methods are used to verify the predicted protein interaction sequences. To further confirm the PPIs between banana and Foc4, the GO (Gene Ontology) and KEGG (Kyoto Encylopedia of Genes and Genomics) functional annotation and interaction network analysis are carried out. Results The experimental results show that the PPIs for banana and foc4 predicted by our proposed method may interact with each other in terms of sequence structure, GO and KEGG functional annotation, and Foc4 protein plays a more active role in the process of Foc4 infecting banana. Conclusions This study obtained the PPIs between banana and Foc4 by using computing means for the first time, which will provide data support for molecular biology experiments.
... To collect data on TC related to VDR mutation, the web-software BioMart Central Portal and the COSMIC (Catalogue of Somatic Mutations in Cancer) database [158] were used. BioMart offers a one-stop shop solution to access a wide array of biological databases, such as the major biomolecular sequence, pathway and annotation databases such as Ensembl, Uniprot, Reactome, HGNC, Wormbase and PRIDE [159]. The Cancer BioMart web-interface with the following criteria was used: 1. ...
... The gene content within the autozygosity islands and CNVRs were identified using the UMD3.1 genome assembly from the Ensembl Biomart tool (Haider et al., 2009). Database for Annotation, Visualization, and Integrated Discovery (DAVID) v.6.8 tool (Huang et al., 2009a(Huang et al., , 2009b) was used to identify significant (p<0.05) ...
Article
This study aimed to identify structural variations in the form of runs of homozygosity (ROH) and copy number variants (CNVs) in the genome of the Brazilian Senepol cattle and to better explore the potential biological functions of the genes located within such regions. A total of 140 animals were genotyped with the GeneSeek® Genomic Profiler™ Bovine 50K. Autosomal ROH and CNVs were detected after appropriated quality control. A total of 5,531 ROH were identified, with an average number of 40.37 per animal and an average length of 5.09 Mb. BTA1 had the greatest number of ROH per chromosome (n=458), while BTA20 displayed the greatest fraction of the chromosome covered with ROH (22.06%). A total of 181 CNVs were identified, with an average length size of 184.67 kb, with values ranging from 26.61 to 334.48 kb. A total of 16 CNV regions (CNVRs) were seen distributed among eleven autosomes, and the largest number of CNVRs (n=4) was described on BTA12. Autozygosity islands were identified using an outlier approach, and none of them overlapped with CNVRs. Our study revealed that the limited genetic basis together with the narrow number of imported animals to disseminate the breed might have strongly contributed to the low effective population size and the high genomic autozygosity proportion described in this population. The functional enrichment analysis revealed several significant terms (p<0.05) within the autozygosity islands and CNVRs closely linked to molecular and immune response mechanisms. The average FROH of different lengths were low to in the studied population, however, the autozygotic proportion in the genome indicates moderate to high inbreeding levels. The results exposed the need of implementing mating programs to control the increase of inbreeding and coancestry in Brazilian Senepol cattle. In these sense, Senepol breeders should apply selection and mating strategies to minimize the occurrence of long ROH in the offspring, as well as import of new genetic material to avoid the loss of genetic diversity. This study revealed the existence of overlapped regions between CNVRs and ROH throughout the genome of the Senepol cattle, that contributes to a better understanding of the functional role of genomic structural variations in taurine cattle adapted to tropical regions.
... The R software package BioMart (http:// www.biomarbiomart.org/) [30] was used to annotate genes in the module, using the reference genome Sscrofa11.1. We selected a subset of modules based on their functional annotation and selected genes related to fat development. ...
Article
Full-text available
Background Fat deposition is an important economic consideration in pig production. The amount of fat deposition in pigs seriously affects production efficiency, quality, and reproductive performance, while also affecting consumers’ choice of pork. Weighted gene co-expression network analysis (WGCNA) is effective in pig genetic studies. Therefore, this study aimed to identify modules that co-express genes associated with fat deposition in pigs (Songliao black and Landrace breeds) with extreme levels of backfat (high and low) and to identify the core genes in each of these modules. Results We used RNA sequences generated in different pig tissues to construct a gene expression matrix consisting of 12,862 genes from 36 samples. Eleven co-expression modules were identified using WGCNA and the number of genes in these modules ranged from 39 to 3,363. Four co-expression modules were significantly correlated with backfat thickness. A total of 16 genes ( RAD9A , IGF2R , SCAP , TCAP , SMYD1 , PFKM , DGAT1 , GPS2 , IGF1 , MAPK8 , FABP , FABP5 , LEPR , UCP3 , APOF , and FASN ) were associated with fat deposition. Conclusions RAD9A , TCAP , SMYD1 , PFKM , GPS2 , and APOF were the key genes in the four modules based on the degree of gene connectivity. Combining these results with those from differential gene analysis, SMYD1 and PFKM were proposed as strong candidate genes for body size traits. This study explored the key genes that regulate porcine fat deposition and lays the foundation for further research into the molecular regulatory mechanisms underlying porcine fat deposition.
... The epistatic SNP pairs from significant interactions were placed in the cattle UMD3.1 genome assembly using the Ensembl BIOMART tool with the Genes 94 database (Haider et al. 2009). The classification of genes regarding their biological function and appropriate analysis of metabolic pathways was performed using the Database for Annotation, Visualization and Integrated Discovery (DAVID) v. 6.8, (Huang et al. 2007(Huang et al. , 2009) using all annotated genes in the cattle genome as background. ...
Article
Gene–gene interactions cause hidden genetic variation in natural populations and could be responsible for the lack of replication that is typically observed in complex traits studies. This study aimed to identify gene–gene interactions using the empirical Hilbert–Schmidt Independence Criterion method to test for epistasis in beef fatty acid profile traits of Nellore cattle. The dataset contained records from 963 bulls, genotyped using a 777 962k SNP chip. Meat samples of Longissimus muscle, were taken to measure fatty acid composition, which was quantified by gas chromatography. We chose to work with the sums of saturated (SFA), monounsaturated (MUFA), polyunsaturated (PUFA), omega‐3 (OM3), omega‐6 (OM6), SFA:PUFA and OM3:OM6 fatty acid ratios. The SNPs in the interactions where were mapped individually and used to search for candidate genes. Totals of 602, 3, 13, 23, 13, 215 and 169 candidate genes for SFAs, MUFAs, PUFAs, OM3s, OM6s and SFA:PUFA and OM3:OM6 ratios were identified respectively. The candidate genes found were associated with cholesterol, lipid regulation, low‐density lipoprotein receptors, feed efficiency and inflammatory response. Enrichment analysis revealed 57 significant GO and 18 KEGG terms ( < 0.05), most of them related to meat quality and complementary terms. Our results showed substantial genetic interactions associated with lipid profile, meat quality, carcass and feed efficiency traits for the first time in Nellore cattle. The knowledge of these SNP–SNP interactions could improve understanding of the genetic and physiological mechanisms that contribute to lipid‐related traits and improve human health by the selection of healthier meat products.
... In that respect, transcripts are classified as significantly differentially expressed if the p-value, after correction for multiple tests with the False Discovery Rate (FDR), is below 0.01. To determine the number of genes differentially expressed between the two conditions, we retrieved the Ensembl ID of genes corresponding to the transcript ID using biomart (38). Among the 16,548 differentially expressed transcripts identified, we were able to retrieve 8,386 genes identifiers. ...
Preprint
Full-text available
Transposable elements (TEs) are middle-repeated DNA sequences that can move along chromosomes using internal coding and regulatory regions. By their ability to move and because they are repeated, TEs can promote mutations. Especially they can alter the expression pattern of neighboring genes and have been shown to be involved in the mammalian regulatory network evolution. Human and mouse share more than 95% of their genomes and are affected by comparable diseases, which makes the mouse a perfect model in cancer research. However not much investigation concerning the mouse TE content has been made on this topics. In human cancer condition, a global activation of TEs can been observed which may ask the question of their impact on neighboring gene functioning. In this work, we used RNA sequences of highly aggressive pancreatic tumors from mouse to analyze the gene and TE deregulation happening in this condition compared to pancreas from healthy animals. Our results show that several TE families are deregulated and that the presence of TEs is associated with the expression divergence of genes in the tumor condition. These results illustrate the potential role of TEs in the global deregulation at work in the cancer cells.
Article
Full-text available
Preadipocyte differentiation plays an important role in lipid deposition and affects fattening efficiency in pigs. In the present study, preadipocytes isolated from the subcutaneous adipose tissue of three Landrace piglets were induced into mature adipocytes in vitro . Gene clusters associated with fat deposition were investigated using RNA sequencing data at four time points during preadipocyte differentiation. Twenty-seven co-expression modules were subsequently constructed using weighted gene co-expression network analysis. Gene Ontology and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analyses revealed three modules (blue, magenta, and brown) as being the most critical during preadipocyte differentiation. Based on these data and our previous differentially expressed gene analysis, angiopoietin-like 4 ( ANGPTL4 ) was identified as a key regulator of preadipocyte differentiation and lipid metabolism. After inhibition of ANGPTL4 , the expression of adipogenesis-related genes was reduced, except for that of lipoprotein lipase ( LPL ), which was negatively regulated by ANGPTL4 during preadipocyte differentiation. Our findings provide a new perspective to understand the mechanism of fat deposition.
Preprint
Full-text available
Background: The pathogen of banana Fusarium oxysporum race 4 (Foc4) infects almost all banana species, and it is the most destructive. The molecular mechanism of the interactions between Fusarium oxysporum and banana still needs to be further investigated. Methods: We use both the homology-interolog and domain-domain method to predict the protein-protein interactions (PPIs) between banana and Foc4. The predicted protein interaction sequences are encoded by the conjoint triad and autocovariance method respectively to obtain continuous and discontinuous information of protein sequences. This information is used as the input data of the neural network model. The Long Short-Term Memory (LSTM) neural network five-fold cross-validation and independent test methods are used to verify the predicted protein interaction sequences. To further confirm the PPIs between banana and Foc4, the Go functional annotation and interaction network analysis are carried out. Results: The experimental results show that the PPIs for banana and foc4 predicted by our proposed method may interact with each other in terms of sequence structure and GO functional annotation, and Foc4 protein plays a more active role in the process of Foc4 infecting banana. Conclusions: This study obtained the PPIs between banana and Foc4 by using computing means for the first time, which will provide data support for molecular biology experiments.
Article
Full-text available
Despite the uniform selection criteria for the isolation of human mesenchymal stem cells (MSCs), considerable heterogeneity exists which reflects the distinct tissue origins and differences between individuals with respect to their genetic background and age. This heterogeneity is manifested by the variabilities seen in the transcriptomes, proteomes, secretomes, and epigenomes of tissue-specific MSCs. Here, we review literature on different aspects of MSC heterogeneity including the role of epigenetics and the impact of MSC heterogeneity on therapies. We then combine this with a meta-analysis of transcriptome data from distinct MSC subpopulations derived from bone marrow, adipose tissue, cruciate, tonsil, kidney, umbilical cord, fetus, and induced pluripotent stem cells derived MSCs (iMSCs). Beyond that, we investigate transcriptome differences between tissue-specific MSCs and pluripotent stem cells. Our meta-analysis of numerous MSC-related data sets revealed markers and associated biological processes characterizing the heterogeneity and the common features of MSCs from various tissues. We found that this heterogeneity is mainly related to the origin of the MSCs and infer that microenvironment and epigenetics are key drivers. The epigenomes of MSCs alter with age and this has a profound impact on their differentiation capabilities. Epigenetic modifications of MSCs are propagated during cell divisions and manifest in differentiated cells, thus contributing to diseased or healthy phenotypes of the respective tissue. An approach used to reduce heterogeneity caused by age- and tissue-related epigenetic and microenvironmental patterns is the iMSC concept: iMSCs are MSCs generated from induced pluripotent stem cells (iPSCs). During iMSC generation epigenetic and chromatin remodeling result in a gene expression pattern associated with rejuvenation thus allowing to overcome age-related shortcomings (e.g., limited differentiation and proliferation capacity). The importance of the iMSC concept is underlined by multiple clinical trials. In conclusion, we propose the use of rejuvenated iMSCs to bypass tissue- and age-related heterogeneity which are associated with native MSCs.
Article
Full-text available
The primary mission of UniProt is to support biological research by maintaining a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces freely accessible to the scientific community. UniProt is produced by the UniProt Consortium which consists of groups from the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. UniProt is updated and distributed every 3 weeks and can be accessed online for searches or download at http://www.uniprot.org.
Article
Full-text available
The Universal Protein Resource (UniProt) provides a stable, comprehensive, freely accessible, central resource on protein sequences and functional annotation. The UniProt Consortium is a collaboration between the European Bioinformatics Institute (EBI), the Protein Information Resource (PIR) and the Swiss Institute of Bioinformatics (SIB). The core activities include manual curation of protein sequences assisted by computational analysis, sequence archiving, development of a user-friendly UniProt website, and the provision of additional value-added information through cross-references to other databases. UniProt is comprised of four major components, each optimized for different uses: the UniProt Knowledgebase, the UniProt Reference Clusters, the UniProt Archive and the UniProt Metagenomic and Environmental Sequences database. UniProt is updated and distributed every three weeks, and can be accessed online for searches or download at http://www.uniprot.org.
Article
Full-text available
The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information that is essential for modern biological research. UniProt is produced by the UniProt Consortium which consists of groups from the European Bioinformatics Institute, the Protein Information Resource and the Swiss Institute of Bioinformatics. The core activities include manual curation of protein sequences assisted by computational analysis, sequence archiving, a user-friendly UniProt website and the provision of additional value-added information through cross-references to other databases. UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. One of the key achievements of the UniProt consortium in 2008 is the completion of the first draft of the complete human proteome in UniProtKB/Swiss-Prot. This manually annotated representation of all currently known human protein-coding genes was made available in UniProt release 14.0 with 20 325 entries. UniProt is updated and distributed every three weeks and can be accessed online for searches or downloaded at www.uniprot.org .
Article
Full-text available
Biologists need to perform complex queries, often across a variety of databases. Typically, each data resource provides an advanced query interface, each of which must be learnt by the biologist before they can begin to query them. Frequently, more than one data source is required and for high-throughput analysis, cutting and pasting results between websites is certainly very time consuming. Therefore, many groups rely on local bioinformatics support to process queries by accessing the resource's programmatic interfaces if they exist. This is not an efficient solution in terms of cost and time. Instead, it would be better if the biologist only had to learn one generic interface. BioMart provides such a solution. BioMart enables scientists to perform advanced querying of biological data sources through a single web interface. The power of the system comes from integrated querying of data sources regardless of their geographical locations. Once these queries have been defined, they may be automated with its "scripting at the click of a button" functionality. BioMart's capabilities are extended by integration with several widely used software packages such as BioConductor, DAS, Galaxy, Cytoscape, Taverna. In this paper, we describe all aspects of BioMart from a user's perspective and demonstrate how it can be used to solve real biological use cases such as SNP selection for candidate gene screening or annotation of microarray results. BioMart is an easy to use, generic and scalable system and therefore, has become an integral part of large data resources including Ensembl, UniProt, HapMap, Wormbase, Gramene, Dictybase, PRIDE, MSD and Reactome. BioMart is freely accessible to use at http://www.biomart.org.
Article
Full-text available
The Ensembl project (http://www.ensembl.org) is a comprehensive genome information system featuring an integrated set of genome annotation, databases, and other information for chordate, selected model organism and disease vector genomes. As of release 51 (November 2008), Ensembl fully supports 45 species, and three additional species have preliminary support. New species in the past year include orangutan and six additional low coverage mammalian genomes. Major additions and improvements to Ensembl since our previous report include a major redesign of our website; generation of multiple genome alignments and ancestral sequences using the new Enredo-Pecan-Ortheus pipeline and development of our software infrastructure, particularly to support the Ensembl Genomes project (http://www.ensemblgenomes.org/).
Article
Full-text available
Currently, most genome annotation is curated by centralized groups with limited resources. Efforts to share annotations transparently among multiple groups have not yet been satisfactory. Here we introduce a concept called the Distributed Annotation System (DAS). DAS allows sequence annotations to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software. The communication between client and servers in DAS is defined by the DAS XML specification. Annotations are displayed in layers, one per server. Any client or server adhering to the DAS XML specification can participate in the system; we describe a simple prototype client and server example. The DAS specification is being used experimentally by Ensembl, WormBase, and the Berkeley Drosophila Genome Project. Continued success will depend on the readiness of the research community to adopt DAS and provide annotations. All components are freely available from the project website http://www.biodas.org/.
Article
The GenBank® sequence database (http://www.ncbi.nlm.nih.gov/) incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from individual laboratories and from large-scale sequencing projects. Most submitters use the BankIt (WWW) or Sequin programs to send their sequence data. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome and protein structure information. MEDLINE® abstracts from published articles describing the sequences are also included as an additional source of biological annotation. Sequence similarity searching is offered through the BLAST series of database search programs. In addition to FTP, e-mail and server/client versions of Entrez and BLAST, NCBI offers a wide range of World Wide Web retrieval and analysis services of interest to biologists.
Data
The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information that is essential for modern biological research. UniProt is produced by the UniProt Consortium which consists of groups from the European Bioinformatics Institute, the Protein Information Resource and the Swiss Institute of Bioinformatics. The core activities include manual curation of protein sequences assisted by computa-tional analysis, sequence archiving, a user-friendly UniProt website and the provision of additional value-added information through cross-references to other databases. UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledge-base, the UniProt Reference Clusters and the Uni-Prot Metagenomic and Environmental Sequence Database. One of the key achievements of the UniProt consortium in 2008 is the completion of the first draft of the complete human proteome in UniProtKB/Swiss-Prot. This manually annotated representation of all currently known human protein-coding genes was made available in UniProt release 14.0 with 20 325 entries. UniProt is updated and distributed every three weeks and can be accessed online for searches or downloaded at www.uniprot.org. INTRODUCTION