Article

MBEToolbox: a MATLAB toolbox for sequence data analysis in molecular biology and evolution.

Department of Microbiology, University of Hong Kong, Pokfulam, Hong Kong, China.
BMC Bioinformatics (Impact Factor: 2.67). 02/2005; 6(1):64. DOI: 10.1186/1471-2105-6-64
Source: PubMed

ABSTRACT MATLAB is a high-performance language for technical computing, integrating computation, visualization, and programming in an easy-to-use environment. It has been widely used in many areas, such as mathematics and computation, algorithm development, data acquisition, modeling, simulation, and scientific and engineering graphics. However, few functions are freely available in MATLAB to perform the sequence data analyses specifically required for molecular biology and evolution.
We have developed a MATLAB toolbox, called MBEToolbox, aimed at filling this gap by offering efficient implementations of the most needed functions in molecular biology and evolution. It can be used to manipulate aligned sequences, calculate evolutionary distances, estimate synonymous and nonsynonymous substitution rates, and infer phylogenetic trees. Moreover, it provides an extensible, functional framework for users with more specialized requirements to explore and analyze aligned nucleotide or protein sequences from an evolutionary perspective. The full functions in the toolbox are accessible through the command-line for seasoned MATLAB users. A graphical user interface, that may be especially useful for non-specialist end users, is also provided.
MBEToolbox is a useful tool that can aid in the exploration, interpretation and visualization of data in molecular biology and evolution. The software is publicly available at http://web.hku.hk/~jamescai/mbetoolbox/ and http://bioinformatics.org/project/?group_id=454

Full-text

Available from: Xuhua Xia, Jun 15, 2015
2 Followers
 · 
185 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The evolutionary origin of "orphan" genes, genes that lack sequence similarity to any known gene, remains a mystery. One suggestion has been that most orphan genes evolve rapidly so that similarity to other genes cannot be traced after a certain evolutionary distance. This can be tested by examining the divergence rates of genes with different degrees of lineage specificity. Here the lineage specificity (LS) of a gene describes the phylogenetic distribution of that gene's orthologues in related species. Highly lineage-specific genes will be distributed in fewer species in a phylogeny. In this study, we have used the complete genomes of seven ascomycotan fungi and two animals to define several levels of LS, such as Eukaryotes-core, Ascomycota-core, Euascomycetes-specific, Hemiascomycetes-specific, Aspergillus-specific, and Saccharomyces-specific. We compare the rates of gene evolution in groups of higher LS to those in groups with lower LS. Molecular evolutionary analyses indicate an increase in nonsynonymous nucleotide substitution rates in genes with higher LS. Several analyses suggest that LS is correlated with the evolutionary rate of the gene. This correlation is stronger than those of a number of other factors that have been proposed as predictors of a gene's evolutionary rate, including the expression level of genes, gene essentiality or dispensability, and the number of protein-protein interactions. The accelerated evolutionary rates of genes with higher LS may reflect the influence of selection and adaptive divergence during the emergence of orphan genes. These analyses suggest that accelerated rates of gene evolution may be responsible for the emergence of apparently orphan genes.
    Journal of Molecular Evolution 08/2006; 63(1):1-11. DOI:10.1007/s00239-004-0372-5 · 1.86 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Assessing genetic diversity within populations is vital for understanding the nature of evolutionary processes at the molecular level. PGEToolbox is a Matlab-based open-sourced software package for data analysis in population genetics. The main features of this software are as follows: 1) capability for handling both DNA sequence polymorphisms and single nucleotide polymorphisms (SNPs), which include genotype and haplotype data; 2) exhaustive population genetic analyses and neutrality tests based on the coalescent theory; 3) extendibility and scalability for complex and large genome-wide datasets; 4) simple yet effective graphic user interfaces and sophisticated visualization of data and results. For academic uses, PGEToolbox is available free of charge at http://bioinformatics.org/pgetoolbox.
    The Journal of heredity 02/2008; 99(4):438-40. DOI:10.1093/jhered/esm127 · 1.97 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Feature selection on mass spectrometry (MS) data is essential for improving classification performance and biomarker discovery. The number of MS samples is typically very small compared with the high dimensionality of the samples, which makes the problem of biomarker discovery very hard. In this paper, we propose the use of genetic programming for biomarker detection and classification of MS data. The proposed approach is composed of two phases: in the first phase, feature selection and ranking are performed. In the second phase, classification is performed. The results show that the proposed method can achieve better classification performance and biomarker detection rate than the information gain- (IG) based and the RELIEF feature selection methods. Meanwhile, four classifiers, Naive Bayes, J48 decision tree, random forest and support vector machines, are also used to further test the performance of the top ranked features. The results show that the four classifiers using the top ranked features from the proposed method achieve better performance than the IG and the RELIEF methods. Furthermore, GP also outperforms a genetic algorithm approach on most of the used data sets.
    Connection Science 04/2014; 26(3):215-243. DOI:10.1080/09540091.2014.906388 · 0.77 Impact Factor