Article

Improved gapped alignment in BLAST

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Homology search is a key tool for understanding the role, structure, and biochemical function of genomic sequences. The most popular technique for rapid homology search is BLAST, which has been in widespread use within universities, research centers, and commercial enterprises since the early 1990s. In this paper, we propose a new step in the BLAST algorithm to reduce the computational cost of searching with negligible effect on accuracy. This new step-semigapped alignment-compromises between the efficiency of ungapped alignment and the accuracy of gapped alignment, allowing BLAST to accurately filter sequences with lower computational cost. In addition, we propose a heuristic-restricted insertion alignment-that avoids unlikely evolutionary paths with the aim of reducing gapped alignment cost with negligible effect on accuracy. Together, after including an optimization of the local alignment recursion, our two techniques more than double the speed of the gapped alignment stages in BLAST. We conclude that our techniques are an important improvement to the BLAST algorithm. Source code for the alignment algorithms is available for download at http://www.bsg.rmit.edu.au/iga/.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... BLAST performs comparisons between a pair of sequences in order to find regions of local similarity [1]. The popular BLAST derivatives are NCBI-BLAST (web based and standalone versions are available) [2], [3], WU-BLAST [4], Paracel BLAST [5] and fast search algorithm (FSA)-BLAST [6], [7]. Among them, NCBI-BLAST (standalone) and FSA-BLAST are open source programs and any one can download (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/Latest, ...
... Because of its widespread usage (used over 120,000 times each day [8]) any improvement to the BLAST algorithm that can reduce runtime space and time without effecting sensitivity and selectivity [9] would be very much desirable. Over the years several modifications to the fundamental algorithms and new heuristics in BLAST were proposed to improve speed and minimize runtime space [3][4][5][6][7], [10][11][12][13][14][15][16]. Manuscript This paper proposes modified data structure that reduces the runtime space during the hit detection stage of the FSA protein BLAST algorithm. ...
... Basically, BLAST program was designed to analyze both protein and DNA sequences. It has mainly four algorithmic steps namely finding hits, performing un-gapped alignments, performing gapped alignments and computing trace back and outputting the results [2], [3], [6], [7]. The main functional differences between NCBI BLAST and FSA BLAST are, one is the structure used for finding hits between a query sequence and database sequence during the hit detection stage and the other is using semi-gapped and restricted insertion alignments during alignment stage of the algorithm [6,7]. ...
... When the function of a protein is not known, a putative function is sometimes assigned. These assignments are often the result of simple bioinformatics analyses including sequence and three-dimensional structure comparisons using programs such as BLAST [20,21] and Dali [22,23]. SG proteins can be assigned a putative function based on simple transfer of function from the closest sequence or structure match. ...
... ProFunc [82] is a metaserver that combines sequence, global structure, and local structure-based methods to obtain a set of function predictions from which one might seek consensus. First, the protein of unknown function is analyzed by numerous sequence searches, shown on the left-hand side in Fig. 3. BLAST [20,21] analysis scans both the PDB and UniProt and uses multiple sequence alignment to determine sequence similarities and detect sequence motifs [83]. Gene neighbors are also examined based on the query protein's predicted location within the genome. ...
... The published results revolve around a maximum F-measure, also known as F max , which corresponds to a "harmonic mean between precision and recall" [92]. Two methods, BLAST [20,21] and a Naïve baseline method [92], were used to compare the test methods. In the BLAST method, the GO terms that define any protein sequences for which a function has been experimentally determined are assigned to the sequence being analyzed. ...
Article
Full-text available
With the exponential growth in the determination of protein sequences and structures via genome sequencing and structural genomics efforts, there is a growing need for reliable computational methods to determine the biochemical function of these proteins. This paper reviews the efforts to address the challenge of annotating the function at the molecular level of uncharacterized proteins. While sequence- and three-dimensional-structure-based methods for protein function prediction have been reviewed previously, the recent trends in local structure-based methods have received less attention. These local structure-based methods are the primary focus of this review. Computational methods have been developed to predict the residues important for catalysis and the local spatial arrangements of these residues can be used to identify protein function. In addition, the combination of different types of methods can help obtain more information and better predictions of function for proteins of unknown function. Global initiatives, including the Enzyme Function Initiative (EFI), COMputational BRidges to EXperiments (COMBREX), and the Critical Assessment of Function Annotation (CAFA), are evaluating and testing the different approaches to predicting the function of proteins of unknown function. These initiatives and global collaborations will increase the capability and reliability of methods to predict biochemical function computationally and will add substantial value to the current volume of structural genomics data by reducing the number of absent or inaccurate functional annotations.
... The unigenes were aligned with BLASTX against four protein databases (NCBI non-redundant or Nr proteins, Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes or KEGG and euKaryotic Ortholog Groups or KOG) and one nucleotide database (NCBI nucleotide or Nt sequences) with an E-value threshold of 1.0E-5 for all except KOG with a threshold of 1.0E-3 [15,16]. Using nucleotide-based annotation, Blast2GO [17] software was used to obtain GO annotation categories defined by molecular function, cellular component and biological process ontologies. ...
... Within the biological process category, the highest sub-category was metabolic process (20,281, 21.0%), followed by cellular process (18,363,19.0%), and single-organism process (15,757, 16.3%). Under the molecular function category, binding activity (15,837,42.0%) and catalytic activity (15,475,41.0%) ...
... Within the biological process category, the highest sub-category was metabolic process (20,281, 21.0%), followed by cellular process (18,363,19.0%), and single-organism process (15,757, 16.3%). Under the molecular function category, binding activity (15,837,42.0%) and catalytic activity (15,475,41.0%) were prominently represented. ...
Article
Full-text available
Rice bean (Vigna umbellata (Thunb.) Ohwi & Ohashi) is a warm season annual legume mainly grown in East Asia. Only scarce genomic resources are currently available for this legume crop species and no simple sequence repeat (SSR) markers have been specifically developed for rice bean yet. In this study, approximately 26 million high quality cDNA sequence reads were obtained from rice bean using Illumina paired-end sequencing technology and assembled into 71,929 unigenes with an average length of 986 bp. Of these unigenes, 38,840 (33.2%) showed significant similarity to proteins in the NCBI non-redundant protein and nucleotide sequence databases. Furthermore, 30,170 (76.3%) could be classified into gene ontology categories, 25,451 (64.4%) into Swiss-Prot categories and 21,982 (55.6%) into KOG database categories (E-value < 1.0E-5). A total of 9,301 (23.5%) were mapped onto 118 pathways using the Kyoto Encyclopedia of Genes and Genome (KEGG) pathway database. A total of 3,011 genic SSRs were identified as potential molecular markers. AG/CT (30.3%), AAG/CTT (8.1%) and AGAA/TTCT (20.0%) are the three main repeat motifs. A total of 300 SSR loci were randomly selected for validation by using PCR amplification. Of these loci, 23 primer pairs were polymorphic among 32 rice bean accessions. A UPGMA dendrogram revealed three major clusters among 32 rice bean accessions. The large number of SSR-containing sequences and genic SSRs in this study will be valuable for the construction of high-resolution genetic linkage maps, association or comparative mapping and genetic analyses of various Vigna species.
... BLAST performs comparisons between a pair of sequences in order to find regions of local similarity [1]. The popular BLAST derivatives are NCBI-BLAST (web based and standalone versions are available), WU-BLAST, Paracel BLAST and fast search algorithm (FSA)-BLAST [2, 3, 4, 5, 6, 7]. Among them, NCBI-BLAST (standalone) and FSA-BLAST are open source programs and any one can download (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/Latest, ...
... In view of the above reasons, any improvement to the BLAST algorithm that can reduce runtime space and time without effecting sensitivity and selectivity would be very much desirable [10]. Over the years several modifications to the fundamental algorithms and new heuristics in BLAST were proposed to improve speed and minimize runtime space [3][4][5][6][7], [11][12][13][14][15][16][17][18]. Basically, BLAST program was designed to analyze both protein and DNA sequences. ...
... Basically, BLAST program was designed to analyze both protein and DNA sequences. It has mainly four algorithmic steps namely finding hits, performing un-gapped alignments, performing gapped alignments and computing trace back and outputting the results [2, 3, 6, 7]. The main functional differences between NCBI BLAST and FSA BLAST are, one is the structure used for finding hits between a query sequence and database sequence during the hit detection process and the other is using semi-gapped and restricted insertion alignments during alignment stage of the algorithm [6,7]. ...
... Non-redundant unigenes were then used to cluster to large EST datasets with TIGR Gene Indices clustering tools (TGICL) (Pertea et al., 2003). Finally, the sequence direction of non-redundant unigenes was determined by BlastX alignment (Altschul et al., 1997;Cameron et al., 2004) using a maximum E-value of 10 −5 as cut-off. After searching the unigenes against the NCBI non-redundant (Nr), Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes (KEGG) database, and Cluster of Orthologous Groups (COG) database, the sequence direction of the unigenes was determined based on the best search results (Xiao et al., 2013). ...
... This may account for the relatively low annotation rate. On the other hand, short sequences frequently fail to match database entries due to the significance level required for the BlastX similarity search, which depends partly on the length of the query sequence (Altschul et al., 1997;Cameron et al., 2004). In the present study, 8435 unigenes had a length of less than 300 bp, of which only 1273 (15.1%) had a Blast match. ...
Article
The transcriptome of Kappaphycus alvarezii was profiled using high-throughput Solexa paired-end sequencing technology. A total of 61 million sequencing reads was generated by filtering the low-quality reads, and 28 701 unigenes with a mean length of 901 bp were obtained based on de novo assembly. In similarity alignments against the NCBI non-redundant protein sequence (NR) database, Swissprot database, Cluster of Orthologous Groups (COG) database, Gene ontology (GO) database, and Kyoto Encyclopedia of Genes and Genomes (KEGG) database, 11 996 (41.79%) unigenes were identified with significant hits (E-value < 10−5) against existing genes. Functional annotation with the KEGG pathway database identified 8975 unigenes and mapped to 125 pathways. Through functional enrichment analysis of the genes with a higher expression value than the average RPKM (reads per kilobase per million reads), we found that some important pathways were highly expressed. The substantial number of transcript sequences of K. alvarezii provides a valuable resource for potential gene identification and comparative genomic studies.
... In this thesis, we first present highly efficient methodologies to parallelize the Smith-Waterman pairwise sequence alignment algorithm within a Cell chip [1], where we achieve near-constant efficiency for up to 16 SPEs on the dual-Cell QS20 blades, and our approach is highly scalable to more cores, if available. However, by using only one Cell processor for aligning a pair of sequences, we limit the problem space to aligning sequences smaller than 8KB on the QS20 Cell blade and smaller than 3.5KB on a PS3 console, due to the inherent memory constraints of the algorithm. ...
... BLAST misses onfinding several optimal sequence alignments that the ideal Smith-Waterman algorithm would havefound [6]. Several variations of the original BLASTalgorithm have emerged to improve the sensitivity of the sequencesearch while maintaining the high speed [28,4,10,16]. ...
... When combined, our three improvements more than double the speed of the gapped alignment stages in blast, and we conclude that our techniques are important improvements to the algorithm. The results and discussions presented in this chapter are based on work published in Cameron et al. [2004]. ...
... Finally, Section 4.3 presents a summary of this work. A preliminary version of the results and discussions presented in this chapter appeared inCameron et al. [2004]. ...
... Because query indexing usually contains a high percentage of empty slots due to few letters in a query, most of the optimizations of query indexing seek to reduce the sparsity of the index, e.g., the thick backbone and the position array in NCBI BLAST [26] and the deterministic finite automaton (DFA) in FSA-BLAST [14]. For database indexing, which is full of positions from millions of subject sequences from a database (e.g., about 6 million sequences in env_nr database, and over 85 million sequences in nr database), the major challenges differ substantially from query indexing. ...
... Moreover, the current version of muBLASTP can only produce the identical results to NCBI BLAST when both Output: Print_Result(R) 4: for Index block I b in I do 5: #pragma omp parallel for schedule(dynamic) 6: for Query Q i in Q do 7: 10: end for 11: end for 12: #pragma omp parallel for schedule(dynamic) 13: for Query Q i in Q do 14: 15: end for 16: end function use the default output format (i.e., "pairwise" format) and the default composition-based statistics method. As a result, our software can only generate the similar results to NCBI BLAST if any other parameter is set. ...
Article
Full-text available
Background The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. Results muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. Conclusions With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1302-4) contains supplementary material, which is available to authorized users.
... The unigenes were divided into either clusters or singletons. BLASTX (Cameron et al. 2004) alignments (applying an E value of \1.0E-5) between each unigene sequence and those lodged in the non-redundant protein (Nr) database (NCBI), non-redundant nucleotide (Nt) database (NCBI), Swiss-Prot protein database (http://www.expasy.ch/sprot), the KEGG pathway database, GO (http://www.geneontology.org/) and clusters of orthologous groups (COG) databases (http://www.ncbi.nlm.nih.gov/COG) were performed, and the best alignments were used to infer the directionality of the unigene. When the outcome from the various databases conflicted, the priority order applied was Nr, Nt, Swiss-Prot, KEGG and COG. ...
Article
Full-text available
Recessive genic male sterility (RGMS) is common in plants and has been widely applied as an effective and economic system for hybrid seed production in many crops. However, little is known regarding the molecular mechanisms of RGMS in cabbage (Brassica oleracea L. var. capitata) due to limited transcriptomic and genomic data. Comparative transcriptomic analyses were performed on the sterile (RGMS632-MS) and fertile plants (RGMS632-SF) of a RGMS line (RGMS632) using second-generation Illumina sequencing to identify critical genes and pathways associated with male sterility. A total of approximately 109 million sequencing reads was obtained using RNA-seq. Abundance analysis identified a total of 5107 unigenes that showed significant differences between the two kinds of plants. Among these, 1558 genes were upregulated while 3549 genes were downregulated by more than twofold in RGMS632-MS compared to RGMS632-SF. KEGG pathway enrichment analysis revealed changes in the transcript abundance of genes involved in the metabolism and signal transduction of various phytohormones. The majority of hormone signalling pathways were downregulated in RGMS632-MS. Furthermore, a set of potential candidate genes involved in the formation or abortion of pollen were investigated. These results increased our understanding of the molecular mechanisms and biological processes in RGMS plants.
... A putative or possible function is sometimes assigned when the function of a protein is unknown. These assignments are often incorrect due to the simple bioinformatics analyses including sequence and structural comparisons using programs such as BLAST (Altschul et al. 1997;Cameron et al. 2004) and Dali (Holm et al. 2006;Holm and Rosenstrom 2010), respectively. With the growing challenges of protein function prediction, the development of reliable computational methods is vital for assigning accurate function to proteins with confidence (Caitlyn et al. 2015). ...
Article
As an extended gamut of integral membrane (extrinsic) proteins, and based on their transporting specificities, P-type ATPases include five subfamilies in Arabidopsis, inter alia, P4ATPases (phospholipid-transporting ATPase), P3AATPases (plasma membrane H+ pumps), P2A and P2BATPases (Ca2+ pumps) and P1B ATPases (heavy metal pumps). Although, many different computational methods have been developed to predict substrate specificity of unknown proteins, further investigation needs to improve the efficiency and performance of the predicators. In this study, various attribute weighting and supervised clustering algorithms were employed to identify the main amino acid composition attributes, which can influence the substrate specificity of ATPase pumps, classify protein pumps and predict the substrate specificity of uncharacterized ATPase pumps. The results of this study indicate that both non-reduced coefficients pertaining to absorption and Cys extinction within 280 nm, the frequencies of hydrogen, Ala, Val, carbon, hydrophilic residues, the counts of Val, Asn, Ser, Arg, Phe, Tyr, hydrophilic residues, Phe-Phe, Ala-Ile, Phe-Leu, Val-Ala and length are specified as the most important amino acid attributes through applying the whole attribute weighting models. Here, learning algorithms engineered in a predictive machine (Naive Bays) is proposed to foresee the Q9LVV1 and O22180 substrate specificities (P-type ATPase like proteins) with 100 % prediction confidence. For the first time, our analysis demonstrated promising application of bioinformatics algorithms in classifying ATPases pumps. Moreover, we suggest the predictive systems that can assist towards the prediction of the substrate specificity of any new ATPase pumps with the maximum possible prediction confidence.
... We preformed de novo transcriptome assembly using Trinity software (Campton, NH, USA) [35]. Sequence direction was judged by BLASTX alignment using non-redundant protein (Nr), non-redundant nucleotide (Nt), Swiss-Prot, Gene Ontology (GO), and Cluster of Orthologous Groups (COG) database (in priority order). ...
Article
Full-text available
Japanese red pine (Pinus densiflora) is extensively cultivated in Japan, Korea, China, and Russia and is harvested for timber, pulpwood, garden, and paper markets. However, genetic information and molecular markers were very scarce for this species. In this study, over 51 million sequencing clean reads from P. densiflora mRNA were produced using Illumina paired-end sequencing technology. It yielded 83,913 unigenes with a mean length of 751 bp, of which 54,530 (64.98%) unigenes showed similarity to sequences in the NCBI database. Among which the best matches in the NCBI Nr database were Picea sitchensis (41.60%), Amborella trichopoda (9.83%), and Pinus taeda (4.15%). A total of 1953 putative microsatellites were identified in 1784 unigenes using MISA (MicroSAtellite) software, of which the tri-nucleotide repeats were most abundant (50.18%) and 629 EST-SSR (expressed sequence tag- simple sequence repeats) primer pairs were successfully designed. Among 20 EST-SSR primer pairs randomly chosen, 17 markers yielded amplification products of the expected size in P. densiflora. Our results will provide a valuable resource for gene-function analysis, germplasm identification, molecular marker-assisted breeding and resistance-related gene(s) mapping for pine for P. densiflora.
... To annotate the assembled unigenes, a basic local alignment search tool (BLAST) alignment with an E-value threshold of 10 -5 was conducted in Nr, Nt, Swiss-Prot, KEGG, and COG (Altschul et al., 1997;Cameron et al., 2004). The results indicated that of the 126,402 unigenes, 99,712 (78.88%) were annotated, and 93,307 (73.82%), 86,638 (68.54%), 55,955 (44.27%), 49,247 (38.96%), and 28,925 (22.88%) of which were matched to Nr, Nt, Swiss-Prot, KEGG, and COG, respectively. ...
Article
Verticillium wilt is one of the main diseases in cotton (Gossypium hirsutum), severely reduces yield and fiber quality, and is difficult to be con-trolled effectively. At present, the molecular mechanism that confers resistance to this disease is unclear. Transcriptome sequencing is an important method to detect resistance genes, explore metabolic pathways, and study resistance mechanisms. In this study, the transcriptome of a disease-resistant inbred cot-ton line inoculated with Verticillium dahliae was sequenced. A total of 126,402 unigenes were obtained using de novo assembly and data analysis, 99,712 (78.88%) of which were annotated into the Nr, Nt, Swiss-Prot, KEGG, COG, and GO databases. The expression patterns of 16 candidate disease-resis-tance genes showed that some genes were upregulated soon after V. dahliae inoculation and others were upregulated later, which may indicate instanta-neous basal defense and lagged specific defense, respectively. We conducted a preliminary analysis of the transcriptome database, which will contribute to further research regarding the cloning of disease-resistance genes.
... Functional annotation was assigned using the protein (Nr and Swiss-Prot), Clusters of Orthologous Groups (COG) and Gene Ontology (GO) databases. BLASTX was employed to identify related sequences in the protein databases based on E-values of less than 10-5 [66]. In addition, all transcriptome sequences were reannotated using the NCBI genome databases or the EST database. ...
Article
Full-text available
Whole genome duplication, associated with the induction of widespread genetic changes, has played an important role in the evolution of many plant taxa. All extant angiosperm species have undergone at least one polyploidization event, forming either an auto- or allopolyploid organism. Compared with allopolyploidization, however, few studies have examined autopolyploidization, and few studies have focused on the response of genetic changes to autopolyploidy. In the present study, newly synthesized C. nankingense autotetraploids (Asteraceae) were employed to characterize the genome shock following autopolyploidization. Available evidence suggested that the genetic changes primarily involved the loss of old fragments and the gain of novel fragments, and some novel sequences were potential long terminal repeat (LTR) retrotransposons. As Ty1-copia and Ty3-gypsy elements represent the two main superfamilies of LTR retrotransposons, the dynamics of Ty1-copia and Ty3-gypsy were evaluated using RT-PCR, transcriptome sequencing, and LTR retrotransposon-based molecular marker techniques. Additionally, fluorescence in situ hybridization(FISH)results suggest that autopolyploidization might also be accompanied by perturbations of LTR retrotransposons, and emergence retrotransposon insertions might show more rapid divergence, resulting in diploid-like behaviour, potentially accelerating the evolutionary process among progenies. Our results strongly suggest a need to expand the current evolutionary framework to include a genetic dimension when seeking to understand genomic shock following autopolyploidization in Asteraceae.
... The unigenes were aligned with BLASTX against the NCBI non-redundant (NR) proteins [17,18]. The proteins with highest sequence similarity were retrieved and annotated to each unigene. ...
Article
Full-text available
The adzuki bean (Vigna angularis (Ohwi) Ohwi and Ohashi) is an important grain legume of Asia. It is cultivated mainly in China, Japan and Korea. Despite its importance, few genomic resources are available for molecular genetic research of adzuki bean. In this study, we developed EST-SSR markers for the adzuki bean through next-generation sequencing. More than 112 million high-quality cDNA sequence reads were obtained from adzuki bean using Illumina paired-end sequencing technology, and the sequences were de novo assembled into 65,950 unigenes. The average length of the unigenes was 1,213 bp. Among the unigenes, 14,547 sequences contained a unique simple sequence repeat (SSR) and 3,350 sequences contained more than one SSR. A total of 7,947 EST-SSRs were identified as potential molecular markers, with mono-nucleotide A/T repeats (99.0%) as the most abundant motif class, followed by AG/CT (68.4%), AAG/CTT (30.0%), AAAG/CTTT (26.2%), AAAAG/CTTTT (16.1%), and AACGGG/CCCGTT (6.0%). A total of 500 SSR markers were randomly selected for validation, of which 296 markers produced reproducible amplicons with 38 polymorphic markers among the 32 adzuki bean genotypes selected from diverse geographical locations across China. The large number of SSR-containing sequences and EST-SSR markers will be valuable for genetic analysis of the adzuki bean and related Vigna species.
... In addition, paired-end Illumina reads obtained from the RNA-seq analysis (described below) were incorporated to error-correct homopolymers in the initial 454 assembly using iCORN v. 1.0 (Otto et al. 2010). Transcriptome annotation was conducted using BLASTx v. 2.2.26+ (Altschul et al. 1997;Cameron et al. 2004) against the following databases: UniRef90 (Sept. 2012), Drosophila melanogaster (FlyBase release 5.47), Tribolium castaneum (BeetleBase OGS3), and the Arthropoda subset of the nonredundant (nr) protein database (Sept. ...
Article
Full-text available
The western corn rootworm (WCR, Diabrotica virgifera virgifera LeConte) is an important pest of corn. Annual crop rotation between corn and soybean disrupts the corn-dependent WCR lifecycle and is widely adopted to manage this pest. This strategy selected for rotation-resistant (RR) WCR with reduced ovipositional fidelity to corn. Previous studies revealed that RR-WCR adults exhibit greater tolerance of soybean diets, different gut physiology and host-microbe interactions compared to rotation-susceptible wild-types (WT). To identify genetic mechanisms underlying these phenotypic changes, a de novo assembly of the WCR adult gut transcriptome was constructed and used for RNA-sequencing analyses of RNA libraries from different WCR phenotypes fed with corn or soybean diets. Global gene expression profiles of WT- and RR-WCR were similar when feeding on corn diets, but different when feeding on soybean. Using network-based methods, we identified gene modules transcriptionally correlated with the RR phenotype. Gene Ontology enrichment analyses indicated that the functions of these modules were related to metabolic processes, immune responses, biological adhesion, and other functions/processes that appear to correlate to documented traits in RR populations. These results suggest that gut transcriptomic divergence correlated with brief soybean feeding and other physiological traits may exist between RR and WT-WCR adults.This article is protected by copyright. All rights reserved.
... The unigenes are divided into either clusters or singletons. BLASTX [23] alignment (applying an Evalue of less than 10 −5 ) between each unigene sequence and those lodged in Nr (non-redundant protein database, NCBI), Nt (non-redundant nucleotide database, NCBI), Swiss-Prot, GO (gene ontology, http:// www.geneontology.org/) and COG (clusters of orthologous groups) databases were performed, and the best alignments used to infer the unigene's directionality. Where the outcome from the various databases conflicted with one another, the priority order applied was Nr, Swiss-Prot, COG. ...
Article
Full-text available
MicroRNAs (miRNAs) play important roles in plant responses to environmental stress. In this work, we used high-throughput sequencing to analyze transcriptome and small RNAs (sRNAs) in Typha angustifolia under cadmium (Cd) stress. 57,608,230 raw reads were obtained from deep sequencing of a pooled cDNA library. Sequence assembly and analysis yielded 102,473 unigenes. We subsequently sequenced two sRNA libraries from T. angustifolia with or without Cd exposure respectively. Based on transcriptome data of T. angustifolia, we catalogued and analyzed the sRNAs, resulting in the identification of 114 conserved miRNAs and 41 novel candidate miRNAs in both small RNA libraries. In silico analysis revealed 764 targets for 89 conserved miRNAs and 21 novel miRNAs. Statistical analysis on sequencing reads abundance and experimental validation revealed that 4 conserved and 6 novel miRNAs showed specific expression. Combined with function of target genes, these results suggested that miRNAs might play a role in plant Cd stress response. This study provided the first transcriptome-based analysis of miRNAs and their targets responsive to Cd stress in T. angustifolia, which provide a framework for further analysis of miRNAs and their role in regulating plant responses to Cd stress.
... After gene family clustering, the final obtained unigenes were divided into either clusters (shared more than 70% similarity) or singletons. Finally, a Blastx [19] alignment (E-value < 10 -5 ) was performed between the unigenes and various protein databases, such as the non-redundant protein (nr) database (http://www.ncbi.nlm.nih.gov), the Swiss-Prot protein database (http:// www.expasy.ch/sprot), the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database (http://www.genome.jp/kegg) and the Cluster of Orthologous Groups (COG) database (http://www.ncbi.nlm.nih.gov/COG). ...
Article
Full-text available
Background: Stipa grandis (Poaceae) is one of the dominant species in a typical steppe of the Inner Mongolian Plateau. However, primarily due to heavy grazing, the grasslands have become seriously degraded, and S. grandis has developed a special growth-inhibition phenotype against the stressful habitat. Because of the lack of transcriptomic and genomic information, the understanding of the molecular mechanisms underlying the grazing response of S. grandis has been prohibited. Results: Using the Illumina HiSeq 2000 platform, two libraries prepared from non-grazing (FS) and overgrazing samples (OS) were sequenced. De novo assembly produced 94,674 unigenes, of which 65,047 unigenes had BLAST hits in the National Center for Biotechnology Information (NCBI) non-redundant (nr) database (E-value < 10-5). In total, 47,747, 26,156 and 40,842 unigenes were assigned to the Gene Ontology (GO), Clusters of Orthologous Group (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases, respectively. A total of 13,221 unigenes showed significant differences in expression under the overgrazing condition, with a threshold false discovery rate ≤ 0.001 and an absolute value of log2Ratio ≥ 1. These differentially expressed genes (DEGs) were assigned to 43,257 GO terms and were significantly enriched in 32 KEGG pathways (q-value ≤ 0.05). The alterations in the wound-, drought- and defense-related genes indicate that stressors have an additive effect on the growth inhibition of this species. Conclusions: This first large-scale transcriptome study will provide important information for further gene expression and functional genomics studies, and it facilitated our investigation of the molecular mechanisms of the S. grandis grazing response and the associated morphological and physiological characteristics.
... The annotation of unigenes was performed using various bioinformatics procedures. The unigenes were aligned with BLASTX to four protein databases (NCBI non-redundant or Nr proteins, Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes or KEGG and euKaryotic Ortholog Groups or KOG) and one nucleotide database (NCBI nucleotide or Nt sequences) with an E-value threshold of 1.0E-5 for all except KOG with a threshold of 1.0E-3 [25,26]. The proteins with highest sequence similarity were retrieved and annotated to each unigene. ...
Article
Full-text available
Mung bean (Vigna radiate (L.) Wilczek) is an important traditional food legume crop, with high economic and nutritional value. It is widely grown in China and other Asian countries. Despite its importance, genomic information is currently unavailable for this crop plant species or some of its close relatives in the Vigna genus. In this study, more than 103 million high quality cDNA sequence reads were obtained from mung bean using Illumina paired-end sequencing technology. The processed reads were assembled into 48,693 unigenes with an average length of 874 bp. Of these unigenes, 25,820 (53.0%) and 23,235 (47.7%) showed significant similarity to proteins in the NCBI non-redundant protein and nucleotide sequence databases, respectively. Furthermore, 19,242 (39.5%) could be classified into gene ontology categories, 18,316 (37.6%) into Swiss-Prot categories and 10,918 (22.4%) into KOG database categories (E-value < 1.0E-5). A total of 6,585 (8.3%) were mapped onto 244 pathways using the Kyoto Encyclopedia of Genes and Genome (KEGG) pathway database. Among the unigenes, 10,053 sequences contained a unique simple sequence repeat (SSR), and 2,303 sequences contained more than one SSR together in the same expressed sequence tag (EST). A total of 13,134 EST-SSRs were identified as potential molecular markers, with mono-nucleotide A/T repeats being the most abundant motif class and G/C repeats being rare. In this SSR analysis, we found five main repeat motifs: AG/CT (30.8%), GAA/TTC (12.6%), AAAT/ATTT (6.8%), AAAAT/ATTTT (6.2%) and AAAAAT/ATTTTT (1.9%). A total of 200 SSR loci were randomly selected for validation by PCR amplification as EST-SSR markers. Of these, 66 marker primer pairs produced reproducible amplicons that were polymorphic among 31 mung bean accessions selected from diverse geographical locations. The large number of SSR-containing sequences found in this study will be valuable for the construction of a high-resolution genetic linkage maps, association or comparative mapping and genetic analyses of various Vigna species.
... Protein-coding sequences were found from the different contigs. Blastp was used to compare the predicted protein-coding sequences in GenBank NR, GO (gene ontology), KEGG (Kyoto Encyclopedia of Genes and Genomes), and KOG (The Eukaryotic Clusters of Orthologous Groups) database with an E value threshold of 10 −5 (Altschul et al., 1997;Cameron et al., 2004). The best matching one was chosen as the annotation of each contig. ...
Article
Full-text available
Taro (Colocasia esculenta) is an important crop in Africa, Southeast Asia and the subtropics, as it is used for food and medicine. The aims of this study were to access a large expression datasets of taro and reveal the candidate genes of starch synthesis. As a result, approximately 2.2 Gb sequence data of taro transcriptome were obtained by using Illumina HiSeq 2000 platform and assembled into 52,935 contigs with an average length of 588.5 bp. Sequence similarity analyses against four public databases (NR, GO, KEGG, KOG) found 17,047 contigs that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Among the important metabolic pathways, 26 genes related to starch synthesis were validated by RT-PCR. This transcriptome dataset can serve as an important public information platform for further studies in gene expression, genomics, and functional genomic studies in C. esculenta.
... Sequence divergence analysis was based on gene orthologues. Putative orthologous gene families were identified from allagainst-all protein similarities using BLASTP 67 concatenated protein alignment of 729 orthologue families from 15 species was created for phylogenetic reconstruction and molecular dating. Gblocks 68 was used to remove the less-conserved sites. ...
Article
Full-text available
Vertebrates diverged from other chordates ~500 Myr ago and experienced successful innovations and adaptations, but the genomic basis underlying vertebrate origins are not fully understood. Here we suggest, through comparison with multiple lancelet (amphioxus) genomes, that ancient vertebrates experienced high rates of protein evolution, genome rearrangement and domain shuffling and that these rates greatly slowed down after the divergence of jawed and jawless vertebrates. Compared with lancelets, modern vertebrates retain, at least relatively, less protein diversity, fewer nucleotide polymorphisms, domain combinations and conserved non-coding elements (CNE). Modern vertebrates also lost substantial transposable element (TE) diversity, whereas lancelets preserve high TE diversity that includes even the long-sought RAG transposon. Lancelets also exhibit rapid gene turnover, pervasive transcription, fastest exon shuffling in metazoans and substantial TE methylation not observed in other invertebrates. These new lancelet genome sequences provide new insights into the chordate ancestral state and the vertebrate evolution.
... Carbohydrate utilization enzymes were identified from UniProt (Apweiler et al., 2004;Consortium, 2013Consortium, , 2014, and BLASTp analysis (Cameron et al., 2004) was used to identify orthologs in the genomes. Neighborhood analysis was performed using IMG tools (Markowitz et al., 2012) to determine clusters and manually curate the electronic annotations. ...
Article
Full-text available
In this work we report the whole genome sequences of six new Geobacillus xylanolytic strains along with the genomic analysis of their capability to degrade carbohydrates. The six sequenced Geobacillus strains described here have a range of GC contents from 43.9% to 52.5% and clade with named Geobacillus species throughout the entire genus. We have identified a ~200 kb unique super-cluster in all six strains, containing five to eight distinct carbohydrate degradation clusters in a single genomic region, a feature not seen in other genera. The Geobacillus strains rely on a small number of secreted enzymes located within distinct clusters for carbohydrate utilization, in contrast to most biomass-degrading organisms which contain numerous secreted enzymes located randomly throughout the genomes. All six strains are able to utilize fructose, arabinose, xylose, mannitol, gluconate, xylan, and α-1,6-glucosides. The gene clusters for utilization of these seven substrates have identical organization and the individual proteins have a high percent identity to their homologs. The strains show significant differences in their ability to utilize inositol, sucrose, lactose, α-mannosides, α-1,4-glucosides and arabinan.
... So how to securely and effectively use gene information is a meaningful and challenging work. In the plaintext, there are many algorithms to realize the match of gene sequence, such as Smith-Waterman [1], FAST [2], FASTA [3], and BLAST [4][5][6][7]. However, they are usually not suitable in the encrypted domain, because they usually need to communicate with the server, which is against the requirement of privacy in the cloud computing. ...
Article
Human genome project is a grand scale scientific work, which aims at measuring three billion base pairs in human chromosomes (haploid). It brings a great challenging task to store and utilize these gene sequences (GS) securely and effectively. With the development of the cloud computing, the storage of gene information can be out of consideration. However, their secure utilization still puzzles data owners and data users. One popular way is to encrypt these GS and construct searchable indexes for secure retrieval. In this paper, we first define and solve the problem of privacy-preserving outsourced gene data search in encryption domain. We transfer GS into numerical vectors by reasonable mapping for ease of similarity calculation. We employ secure KNN algorithm to encrypt the query, index, and gene data and compute relevance scores securely. We test our scheme through a real-world dataset: plant GS from National Center of Biotechnology Information. Extensive experiments are conducted to demonstrate the efficiency of the proposed scheme. Copyright
... According to the annotation results against the NCBI nonredundant (Nr) database using the BLASTx algorithm [19,20] with a cutoff E-value of 10 −5 , many putative proteins were identified, such as WRKY DNA-binding protein 1 (Bra023983), cytochrome P450 (Bra010598), expansin-like A2 (Bra033563), 3-ketoacyl-CoA synthase 12 (Bra035683), PLAT/LH2 family protein (Bra030871), pathogenesis-related thaumatin-like protein (Bra015659), and alpha-dioxygenase 1 (Bra039120). In addition, we found 24 unannotated ...
Article
Full-text available
Due to the visual appearance and high carotenoid content, orange inner leaves are a desirable trait for the Chinese cabbage. To understand the molecular mechanism underlying the formation of orange inner leaves, the BrCRTISO (Bra031539) gene, as the Br-or candidate gene, was analyzed among the white and orange varieties, and 7 single nucleotide polymorphisms (SNPs) were identified. However, only one SNP (C952 to T952) altered the amino acid sequence, resulting in a mutation from Leu318 to Phe318 in the orange varieties. Additionally, we analyzed differentially expressed genes (DEGs) between the orange and white F2 individuals (14-401 × 14-490) and found four downregulated genes were involved in the carotenoid biosynthesis pathway, which may lead to the accumulation of prolycopene and other carotenoid pigments in the orange inner leaves. In addition, we developed a novel InDel marker in the first intron, which cosegregates with the phenotypes of orange color inner leaves. In conclusion, these findings enhance our understanding of the underlying mechanism of pigment accumulation in the inner leaves of the Chinese cabbage. Additionally, the SNP (C952 to T952) and the InDel marker will facilitate the marker-assisted selection during Chinese cabbage breeding.
... The others were singletons (which have the prefix "Unigene"). BLASTX [20] alignment between each unigene sequence and those registered in the Nr (non-redundant protein database, NCBI), Nt (non-redundant nucleotide database, NCBI), Swiss-Prot, and GO (gene ontology, http://www. geneontology.org/) ...
Article
Full-text available
Chrysanthemum crassum is a decaploid species of Chrysanthemum with high stress tolerance that allows survival under salinity stress while maintaining a relatively ideal growth rate. We previously recorded morphological changes after salt treatment, such as the expansion of leaf cells. To explore the underlying salinity tolerance mechanisms, we used an Illumina platform and obtained three sequencing libraries from samples collected after 0 h, 12 h and 24 h of salt treatment. Following de novo assembly, 154,944 transcripts were generated, and 97,833 (63.14%) transcripts were annotated, including 55 Gene Ontology (GO) terms and 128 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The expression profile of C. crassum was globally altered after salt treatment. We selected functional genes and pathways that may contribute to salinity tolerance and identified some factors involved in the salinity tolerance strategies of C. crassum, such as signal transduction, transcription factors and plant hormone regulation, enhancement of energy metabolism, functional proteins and osmolyte synthesis, reactive oxygen species (ROS) scavenging, photosystem protection and recovery, and cell wall protein modifications. Forty-six genes were selected for quantitative real-time polymerase chain reaction detection, and their expression patterns were shown to be consistent with the changes in their transcript abundance determined by RNA sequencing.
... Often short sequences also struggle to suit database entries due to the degree of specificity needed in the BlastX similarity search, where it relies partially on the size of the known sequence (BLAST, 2009) (Cameron et al., 2004). In the current analysis, 55,965 unigenessize are below 500 bp for infected K. alvarezii while healthy K. alvareziihas a high amount of unigenes over 900 bp in length (Zhang et al., 2015). ...
Article
This transcriptomic study for ice-ice diseased Kappaphycus alvarezii was done using Illumina sequencing technology. As many as 46 million raw reads were generated and after filtering the data, a clean read of 44 million was obtained. Further 59,942 uni-genes were obtained through De novo assembly with a mean length of 304 bp. Gene functional annotations were generated in respect to the non-redundant database (Nr), nucleotide sequence (Nt), Protein family (Pfam), Kyoto Encyclopedia of Genes and Genomes database (KEGG), SwissProt, Eukaryotic Or-thologous Groups (KOG) database and the Gene Ontology database (GO). With respect to uni-genes, 47,725 (79.61%) unigenes were described in the minimum one database and 3351 uni-genes were described in entire databases. Functional annotation with GO extracted 24,903 uni-genes under three main domains. Gene interaction was scored using the KEGG pathway database with 125 pathways in which 8394 uni-genes were recognized. The study of enrichment of functional gene expression with a value greater than the FPKM average can reveal several essential pathways. Ice-ice diseased K. alvarezii mRNA transcripts were studied by Illumina sequencing technologyfor the potential changes in the level of genes and gene expressions.
... The KEGG classification was performed using the Path_finder with default set and the KEGG Automatic Annotation Server (Moriya et al., 2007) . The Blastx alignment (Altschul et al., 1997;Cameron et al., 2004) (E-value < 0.00001) between Unigenes and protein databases, like NR, Swiss-Prot, KEGG, and COG, were performed, and the best aligning results were used to decide the sequence direction of Unigenes by blast. If the results of different databases conflicted with each other, a priority order of NR, Swiss-Prot, KEGG, and COG was followed when deciding the sequence direction of Unigenes. ...
Article
The deep-sea hydrothermal vent is a special ecosystem, which is different from terrestrial or coastal ecosystems. Rimicaris exoculata, which adapts well to several deep-sea hydrothermal vent environments, is the ideal model for studying hydrothermal vent fauna. In the present study, we obtained R. exoculata from a newly found hydrothermal vent in the south Mid-Atlantic Ridge, and the Illumina next-generation sequencing and de novo assembly were performed by Beijing Genomics Institution. A total of 17,258 annotated Unigenes were obtained. Several Unigenes associated with sulfide metabolism, which might contribute to well adaptation to high concentration of sulfide for R. exoculata, were annotated. This study is the first report on the high-throughput sequencing of R. exoculata. Our data can allow for further studies on the ability of R. exoculata adaptation to harsh conditions and provide abundant gene resources for research and development.
... Functional annotation was assigned using the protein (Nr and Swiss-Prot), Clusters of Orthologous Groups (COG) and Gene Ontology (GO) databases. BLASTX was employed to identify related sequences in the protein databases based on Evalues of less than 10 − 5 [66]. In addition, all transcriptome sequences were re-annotated using the rice genome databases or the EST database. ...
Preprint
Full-text available
Background Whole genome duplication, associated with the induction of widespread genetic changes, has played an important role in the evolution of many plant taxa. The majority of extant angiosperm species have undergone at least one polyploidization event, forming either an auto- or allopolyploid organism. Compared with allopolyploidization, however, few studies have examined autopolyploidization, and almost no studies have focused on the response of genetic changes to autopolyploidy. Results In the present study, newly synthesized C. nankingense autotetraploids (Asteraceae) were employed to characterize the genome shock following autopolyploidization. Available evidence suggested that the genetic changes primarily involved the loss of old fragments and the gain of novel fragments, and some novel sequences were potential long terminal repeat (LTR) retrotransposons. As Ty1-copia and Ty3-gypsy elements represent the two main superfamilies of LTR retrotransposons, the dynamics of Ty1-copia and Ty3-gypsy were evaluated using RT-PCR, transcriptome sequencing and LTR retrotransposon-based molecular marker techniques. These results suggest that autopolyploidization might also be accompanied by perturbations of LTR retrotransposons, and the emergence retrotransposon insertions might show more rapid homologue divergence, resulting in diploid-like behaviour, potentially accelerating the evolutionary process among progenies. Conclusions Our results strongly suggest a need to expand current evolutionary framework to encompass a genetic dimension when seeking to understand genomic shock following autopolyploidization in Asteraceae.
... Cluster of Orthologous Groups of proteins (COG) database using blastx (Michael Cameron & Cannane, 2004;Stephen et al., 1997), with an E-value threshold of 1e−5. ...
Article
Full-text available
The edible silver carp (Hypophthalmichthys molitrix) and bighead carp (H. nobilis), which are two of the “Four Domesticated Fish” of China, are cultivated intensively worldwide. Here, we constructed 837 Mb and 845 Mb draft genome assemblies for the silver carp and the bighead carp, respectively, including 24,571 and 24,229 annotated protein‐coding genes. Genetic maps, anchoring 71.7% and 83.8% of all scaffolds, were obtained for the silver and bighead carp, respectively. Phylogenetic analysis showed that the bighead carp formed a clade with the silver carp, with an estimated divergence time of 3.6 MYA; the time of divergence between the silver carp and zebrafish was 50.7 MYA. An East Asian cyprinid genome‐specific chromosome fusion took place approximately 9.2 million years after this clade diverged from the clade containing the common carp and Sinocyclocheilus. KEGG and GO analyses indicated that the expanded gene families in the silver and bighead carp were associated with diseases, the immune system, and environmental adaptations. Genomic regions differentiating the silver and bighead carp populations were detected based on the whole‐genome sequences of 42 individuals. Genes associated with the divergent regions were associated with reproductive system development and the development of primary female sexual characteristics. Thus, our results provided a novel systematic genomic analysis of the East Asian cyprinids, as well as the evolution and speciation of the silver carp and bighead carp.
... The unigenes were searched for in the Kyoto Encyclopedia of Genes and Genomes (KEGG), NCBI Nucleotide (NT), Pfam, Gene Ontology (GO), NCBI non-redundant Protein (NR), UniProt and Evolutionary genealogy of genes: Non-supervised Orthologous Groups (EggNOG) using BLASTN of NCBI BLAST [69] and BLASTX of the software DIAMOND [70] with an E-value default cut-off of 1.0E-5. ...
Article
Full-text available
: Jojoba is one of the main two known plant source of natural liquid wax ester for use in various applications, including cosmetics, pharmaceuticals, and biofuel. Due to the lack of transcriptomic and genomic data on lipid biosynthesis and accumulation, molecular marker breeding has been used to improve jojoba oil production and quality. In the current study, the transcriptome of developing jojoba seeds was investigated using the Illunina NovaSeq 6000 system, 100 × 106 paired end reads, an average length of 100 bp, and a sequence depth of 12 Gb per sample. A total of 176,106 unigenes were detected with an average contig length of 201 bp. Gene Ontology (GO) showed that the detected unigenes were distributed in the three GO groups biological processes (BP, 5.53%), cellular component (CC, 6.06%), and molecular functions (MF, 5.88%) and distributed in 67 functional groups. The lipid biosynthesis pathway was established based on the expression of lipid biosynthesis genes, fatty acid (FA) biosynthesis, FA desaturation, FA elongation, fatty alcohol biosynthesis, triacylglycerol (TAG) biosynthesis, phospholipid metabolism, wax ester biosynthesis, and lipid transfer and storage genes. The detection of these categories of genes confirms the presence of an efficient lipid biosynthesis and accumulation system in developing jojoba seeds. The results of this study will significantly enhance the current understanding of wax ester biology in jojoba seeds and open new routes for the improvement of jojoba oil production and quality through biotechnology applications.
... This variant gives identical results to Algorithm I (assuming that a D b D and a I b I ), but is more efficient (fewer operations). There are even more efficient variants (Cameron et al., 2004;Farrar, 2007;Rognes, 2011;Suzuki and Kasahara, 2018;Zhang et al., 1997), discussed in the Supplementary Material. ...
Article
Full-text available
Motivation: Sequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions, and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts, and measure sequence similarity integrated over possible alignments. Results: This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a "temperature" parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias towards either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment. Supplementary information: Supplementary data are available at Bioinformatics online.
... Several different public protein databases were used to validate and annotate the assembled unigenes for assigning gene names, coding sequences (CDS) and predicting protein annotations. The sequence-based alignments were mapped against the NCBI NR protein database, Swiss-Prot protein database, KEGG and COG using the BLASTx algorithm 34,35 with an E-value threshold of 1e −5 . The priority order of NR, Swiss-Prot, KEGG, and COG was set. ...
Article
Full-text available
Chinese sturgeon (Acipenser sinensis), a critically endangered Acipenseridae family member, is one of the largest anadromous, native fish in China. Numerous research programmes and protection agencies have focused on breeding and preserving this endangered species. However, available information is limited on the different stages of sex development, especially on the reproductive regulation of the hypothalamus-pituitary-gonad (HPG) axis of A. sinensis. To unravel the mechanism of gene interactions during sex differentiation and gonad development of A. sinensis, we performed transcriptome sequencing using HPG samples from male and female A. sinensis in two developmental stages. In this study, 271.19 Gb high-quality transcriptome data were obtained from 45 samples belonging to 15 individuals (six in stage I, six males and three females in stage II). These transcriptomic data will help us understand the reproductive regulation of the HPG axis in the development stages of A. sinensis and provide important reference data for genomic and genetic studies in A. sinensis and related species. Machine-accessible metadata file describing the reported data (ISA-Tab format)
... This variant gives identical results to Algorithm I (assuming that a D ≤ b D and a I ≤ b I ), but is more efficient (fewer operations). There are even more efficient variants [29,5,10,21,23], discussed in the Supplement. ...
Preprint
Full-text available
Sequence alignment remains fundamental in bioinformatics. Pairwise alignment is traditionally based on ad hoc scores for substitutions, insertions, and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts, and measure sequence similarity integrated over possible alignments. This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a “temperature” parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias towards either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment.
... All the SLAF sequences carrying the trait-associated SNP loci were aligned with the available chrysanthemum 'Jinba' transcriptome using BLASTX (Cameron et al. 2004) based on an E-value of less than 1.0E−5 to identify potential candidate genes. The library of 'Jinba' transcriptome can be accessed by the accession number PRJNA312892 in the NCBI repository. ...
Article
Full-text available
Key message 81 SNPs were identified for three inflorescence-related traits, in which 15 were highly favorable. Two dCAPS markers were developed for future MAS breeding, and six candidate genes were predicted. Abstract Chrysanthemum is a leading ornamental species worldwide and demonstrates a wealth of morphological variation. Knowledge about the genetic basis of its phenotypic variation for key horticultural traits can contribute to its effective management and genetic improvement. In this study, we conducted a genome-wide association study (GWAS) based on two years of phenotype data and a set of 92,617 single nucleotide polymorphisms (SNPs) using a panel of 107 diverse cut chrysanthemums to dissect the genetic control of three inflorescence-related traits. A total of 81 SNPs were significantly associated with the three inflorescence-related traits (capitulum diameter, number of ray florets and flowering time) in at least one environment, with an individual allele explaining 22.72–38.67% of the phenotypic variation. Fifteen highly favorable alleles were identified for the three target traits by computing the phenotypic effect values for the stable associations detected in 2 year-long trials at each locus. Dosage pyramiding effects of the highly favorable SNP alleles and significant linear correlations between highly favorable allele numbers and corresponding phenotypic performance were observed. Two highly favorable SNP alleles correlating to flowering time and capitulum diameter were converted to derived cleaved amplified polymorphic sequence (dCAPS) markers to facilitate future breeding. Finally, six putative candidate genes were identified that contribute to flowering time and capitulum diameter. These results serve as a foundation for analyzing the genetic mechanisms underlying important horticultural traits and provide valuable insights into molecular marker-assisted selection (MAS) in chrysanthemum breeding programs.
... To annotate the A. tenuissima transcriptome, unigenes were searched against various databases, including NCBI non-redundant protein (Nr protein), Swiss-Prot, euKaryotic Orthologous Groups (KOG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Altschul et al., 1997;Cameron et al., 2004). Blast2GO software (Conesa et al., 2005) was used to assign the Gene Ontology (GO) terms to the unigenes. ...
Article
Full-text available
A total of 32,284 unigenes were obtained from the transcriptome of Alternaria tenuissima, a pathogenic fungus causing foliar disease in tomato, using next-generation sequencing (NGS) technology. In total, 24,670 unigenes were annotated using five databases, including NCBI non-redundant protein, Swiss-Prot, euKaryotic Orthologous Groups, Kyoto Encyclopedia of Genes and Genomes, and the Gene Ontology. A total of 1,140 simple sequence repeats were also identified for use as molecular markers. Sixteen of the simple sequence repeat loci were selected to study the population structure of A. tenuissima. A population genetic analysis of 191 A. tenuissima isolates, sampled from four geographic regions in China, indicated that A. tenuissima had a high level of genetic diversity, and that the selected simple sequence repeat markers could reliably capture the genetic variation. The null hypothesis of random mating was rejected for all four geographic regions in China. Isolation by distance was observed for the entire data set, but not within clusters, which is indicative of barriers to gene flow among geographic regions. The analyses of Bayesian and principal coordinates, however, did not separate four geographic regions into four separate genetic clusters. The different levels of historical migration rates suggest that isolation by distance did not represent a major biological obstacle to the spread of A. tenuissima. The potential epidemic spread of A. tenuissima in China may occur through the transport of plant products or other factors. The presented results provide a basis for a comprehensive understanding of the population genetics of A. tenuissima in China.
... Thresholds are used in similarity functions and clustering algorithms. BLAST uses thresholds in order to determine similarity among sequences [Cameron et al. 2004]. The d2 cluster algorithm developed by Davison [2001] and the Sequence Search Tree algorithm developed by Giladi et al. [2002] among other examples, use thresholds in determining cut-off points on the level of similarity among sequences. ...
... The unigenes were divided into either clusters or singletons. BLASTX [34] alignment between each unigene sequence and those lodged in the Nr, Nt (Nucleotide database, NCBI), Swiss-Prot, GO (http://www.geneontology.org/), and COG (clusters of orthologous groups) databases were performed, and the best alignments were used to infer the directionality of the unigene. Where the outcomes from the various databases conflicted with one another, the priority order applied was: Nr, Swiss-Prot, and COG. ...
Article
Full-text available
Background: Pines are widely distributed in the Northern Hemisphere and have a long evolutionary history. The availability of transcriptome data has facilitated comparative transcriptomics for studying the evolutionary patterns associated with the different geographical distributions of species in the Pinus phylogeny. Results: The transcriptome of Pinus kesiya var. langbianensis was sequenced using the Illumina HiSeq 2000 platform, and a total of 68,881 unigenes were assembled by Trinity. Transcriptome sequences of another 12 conifer species were downloaded from public databases. All of the pairwise orthologues were identified by comparative transcriptome analysis in 13 conifer species, from which the rate of diversification was calculated and a phylogenetic tree inferred. All of the fast-evolving positive selection sequences were identified, and some salt-, drought-, and abscisic acid-resistance genes were discovered. Conclusions: mRNA sequences of P. kesiya var. langbianensis were obtained by transcriptome sequencing, and a large number of simple sequence repeat and short nucleotide polymorphism loci were detected. These data can be used in molecular marker-assisted selected in pine breeding. Divergence times were estimated in the 13 conifer species using comparative transcriptomic analysis. A number of positive selection genes were found to be related to environmental factors. Salt- and abscisic acid-related genes exhibited different selection patterns between coastal and inland Pinus. Our findings help elucidate speciation patterns in the Pinus lineage.
... Functional annotation provided the information on biological function to reveal the metabolic pathway in the organism. The basic local alignment search tool (BLASTX) (Cameron et al. 2005) alignment (applying an E-value of less than 10 -5 ) between each unigene sequence and those lodged in non-redundant protein database (NR; NCBI), nonredundant nucleotide database (NT; NCBI), Swiss-Prot, gene ontology (GO), clusters of orthologous groups (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases were performed, and the best alignments were used to infer the unigene's directionality. Where the outcomes from the various databases conflicted with one another, priority was given in the following order: NR, Swiss-Prot, and COG. ...
Article
Full-text available
Pinus kesiya var. langbianensis is an important resin resource tree that belongs to the Pinaceae family. It produces a higher yield of resin per year compared to the rest of the pine trees from the same habitat. To identify genes that may be involved in this high resin yield production, the bark transcriptomes of P. kesiya var. langbianensis and a P. kesiya that produce a normal volume of resin were sequenced using RNA-Seq, and their gene expression profiles were compared in regards to specific interest in the resin synthetic metabolism pathways. The results showed that a total of 68,881 transcripts were assembled, 180 of which were involved in terpene metabolism. Surprisingly, in both the transcriptome analysis and the quantitative fluorescent polymerase chain reaction (QFPCR), nine genes involved in resin biosynthesis were shown to be significantly down-regulated in P. kesiya. In addition, this study provided numerous gene candidates for the further study of resin production in pine trees.
... Four types of noncoding RNAs were annotated using tRNAscan-SE 1.23 and RFAM database 9.1 (Griffiths-Jones et al., 2005). The final consensus gene set was obtained by integrating all predicted gene structures using GLEAN (Elsik et al., 2007), and the predicted genes were functionally annotated using the BLASTX algorithm (Altschul et al., 1997;Michael Cameron & Adam, 2004). ...
... Additionally, the Swiss-Prot protein, Protein family (Pfam), Eukaryotic Orthologous Groups of proteins (KOG), Gene Ontology (GO), and the Kyoto Encyclopaedia of Genes and Genomes (KEGG). All databases are used to align assembled unigenes using Blast [145][146][147] (https://blast.ncbi.nlm.nih.gov/Blast.cgi) to obtain the annotated functions of each unigene. With the NR annotation, gene ontology annotations of the unigenes can be acquired using Blast2GO [148] or AmiGO [149]. ...
Article
Full-text available
Abstract: Microsatellites, or simple sequence repeats (SSRs), are one of the most informative and multi-purpose genetic markers exploited in plant functional genomics. However, the discovery of SSRs and development using traditional methods are laborious, time-consuming, and costly. Recently, the availability of high-throughput sequencing technologies has enabled researchers to identify a substantial number of microsatellites at less cost and effort than traditional approaches. Illumina is a noteworthy transcriptome sequencing technology that is currently used in SSR marker development. Although 454 pyrosequencing datasets can be used for SSR development, this type of sequencing is no longer supported. This review aims to present an overview of the next generation sequencing, with a focus on the efficient use of de novo transcriptome sequencing (RNA-Seq) and related tools for mining and development of microsatellites in plants.
... T. brucei prozyme protein sequence (XP_845564.1) was used to query the RefSeq_protein database with PSI-BLAST (Cameron et al., 2004) (default settings, 1000 maximum hits, 3 iterations) to identify AdoMetDC representatives (947 sequences). Identified sequences were submitted to batch CDsearch (Marchler-Bauer et al., 2015) against the PFAM database to confirm the presence of an Ado-MetDC domain (pfam01536) and were analyzed according to taxonomic groups using batch Entrez on the NCBI server. ...
Article
Full-text available
Catalytically inactive enzyme paralogs occur in many genomes. Some regulate their active counterparts but the structural principles of this regulation remain largely unknown. We report X-ray structures of Trypanosoma brucei S-adenosylmethionine decarboxylase alone and in functional complex with its catalytically dead paralogous partner, prozyme. We show monomeric TbAdoMetDC is inactive because of autoinhibition by its N-terminal sequence. Heterodimerization with prozyme displaces this sequence from the active site through a complex mechanism involving a cis-to-trans proline isomerization, reorganization of a b-sheet, and insertion of the N-terminal a-helix into the heterodimer interface, leading to enzyme activation. We propose that the evolution of this intricate regulatory mechanism was facilitated by the acquisition of the dimerization domain, a single step that can in principle account for the divergence of regulatory schemes in the AdoMetDC enzyme family. These studies elucidate an allosteric mechanism in an enzyme and a plausible scheme by which such complex cooperativity evolved.
... Data mining and identification of unigenes were performed using the BLASTX software (Cameron et al., 2004). Biosynthetic pathway analyses were completed using the KEGG database. ...
Article
Full-text available
Green tea (Camellia sinensis, Cs) abundantly produces a diverse array of phenylpropanoid compounds benefiting human health. To date, the regulation of the phenylpropanoid biosynthesis in tea remains to be investigated. Here, we report a cDNA isolated from leaf tissues, which encodes a R2R3-MYB transcription factor. Amino acid sequence alignment and phylogenetic analysis indicate that it is a member of the MYB4-subgroup and named as CsMYB4a. Transcriptional and metabolic analyses show that the expression profile of CsMYB4a is negatively correlated to the accumulation of six flavan-3-ols and other phenolic acids. GFP fusion analysis shows CsMYB4a’s localization in the nucleus. Promoters of five tea phenylpropanoid pathway genes are isolated and characterized to contain four types of AC-elements, which are targets of MYB4 members. Interaction of CsMYB4a and five promoters shows that CsMYB4a decreases all five promoters’ activity. To further characterize its function, CsMYB4a is overexpressed in tobacco plants. The resulting transgenic plants show dwarf, shrinking and yellowish leaf, and early senescence phenotypes. A further genome-wide transcriptomic analysis reveals that the expression levels of 20 tobacco genes involved in the shikimate and the phenylpropanoid pathways are significantly downregulated in transgenic tobacco plants. UPLC-MS and HPLC based metabolic profiling reveals significant reduction of total lignin content, rutin, chlorogenic acid, and phenylalanine in CsMYB4a transgenic tobacco plants. Promoter sequence analysis of the 20 tobacco genes characterizes four types of AC-elements. Further CsMYB4a-AC element and CsMYB4a-promoter interaction analyses indicate that the negative regulation of CsMYB4a on the shikimate and phenylpropanoid pathways in tobacco is via reducing promoter activity. Taken together, all data indicate that CsMYB4a negatively regulates the phenylpropanoid and shikimate pathways.
... Current genome alignment software tools are able to align two genomes very efficiently and with only a small sacrifice in sensitivity [1][2][3][4][5][6][7][8][9]. Yet, it becomes very slow if the extra sensitivity is needed. ...
Article
Full-text available
Background The recent advancement of whole genome alignment software has made it possible to align two genomes very efficiently and with only a small sacrifice in sensitivity. Yet it becomes very slow if the extra sensitivity is needed. This paper proposes a simple but effective method to improve the sensitivity of existing whole-genome alignment software without paying much extra running time. Results and conclusions We have applied our method to a popular whole genome alignment tool LAST, and we called the resulting tool LASTM. Experimental results showed that LASTM could find more high quality alignments with a little extra running time. For example, when comparing human and mouse genomes, to produce the similar number of alignments with similar average length and similarity, LASTM was about three times faster than LAST. We conclude that our method can be used to improve the sensitivity, and the extra time it takes is small, and thus it is worthwhile to be implemented in existing tools.
Article
Winter wheat can regrow surviving the cold winter in extremely cold areas and this type of cold resistance is a very attractive quality. Very little is known regarding the molecular mechanism of cold resistance in winter wheat, particularly the identity of the cold resistance genes. In this study, RNA was extracted from the crown of winter wheat varieties with different cold resistances subjected to various low-temperature treatments. Using the Solexa/Illumina sequencing platform, 60 million sequencing reads were obtained and these reads were assembled into 80, 704 unigenes. Based on the method of known protein similarity search, we acquired 51, 929 sequences that were consistent with the standard E-value cut-off of 10-5. Additionally, 22, 724 sequences were annotated by gene ontology (GO) term, 31, 964 sequences were annotated by Swiss-Prot, 18, 764 sequences were clustered into 43 types by Clusters of Orthologous Groups classifications (COG) and 29, 553 sequences are assigned to 125 pathways by the Kyoto Encyclopedia of Genes and Genomes pathways (KEGG). Furthermore, transcription factor genes involved in cold and dehydration resistance are more highly expressed in Dongnongdongmai 1, which is cold resistant, compared to Jimai 22, which is cold sensitive. The expression of genes in Phenylalanine metabolism, Alpha-linolenic acid metabolism, Gluathione metabolism, as well as Starch and sucrose metabolism pathway was triggered by LT. In addition, these pathways had difference between the two winter wheat varieties. We performed cluster analysis for the differentially expressed genes. Eight genes were randomly selected for expression quantity validation by quantitative RT-PCR. These results of the gene expression pattern analysis under three low temperature treatments are essentially same as the DGE data. Summarily, we obtained an extensive transcriptome dataset from winter wheat, a non-model whole genome which identifies winter wheat cold resistance genes under low temperature conditions.
Chapter
Sequence alignment serves as the basis for comparing two sequences. The score obtained from the comparison of two sequences, the query and the candidate sequence in the database, is in turn used to retrieve the candidate sequences that are related to the query as evidenced by the value of their similarity score. While performing real time searches the sheer size the database often precludes the applicability of obvious and direct approaches and necessitates the development of algorithms that can yield approximate scores in reasonable amount of turnaround time.
Article
Full-text available
Introduction : Intrinsic disorder prediction field develops, assesses and deploys computational predictors of disorder in protein sequences and constructs and disseminates databases of these predictions. Over 40 years of research resulted in the release of numerous resources. Areas covered : We identify and briefly summarize the most comprehensive to date collection of over 100 disorder predictors. We focus on their predictive models, availability and predictive performance. We categorize and study them from a historical point of view to highlight informative trends. Expert opinion : We find a consistent trend of improvements in predictive quality as newer and more advanced predictors are developed. The original focus on machine learning methods has shifted to meta-predictors in early 2010s, followed by a recent transition to deep learning. The use of deep learners will continue in foreseeable future given recent and convincing success of these methods. Moreover, a broad range of resources that facilitate convenient collection of accurate disorder predictions is available to users. They include web servers and standalone programs for disorder prediction, servers that combine prediction of disorder and disorder functions, and large databases of pre-computed predictions. We also point to the need to address the shortage of accurate methods that predict disordered binding regions.
Article
Lily is an important cut-flower and bulb crop in the commercial market. Here, transcriptome profiling of Lilium 'Sorbonne' was conducted through de novo sequencing based on Illumina platform. This research aims at revealing basic information and data that can be used for applied purposes especially the molecular regulatory information on flower color formation in lily. In total, 36,920,680 short reads which corresponded to 3.32 GB of total nucleotides, were produced through transcriptome sequencing. These reads were assembled into 39,636 Unigenes, of which 30,986 were annotated in Nr, Nt, Swiss-Prot, KEGG, COG, GO databases. Based on the three public protein databases, a total of 32,601 coding sequences were obtained. Meanwhile, 19,242 Unigenes were assigned to 128 KEGG pathways. Those with the greatest representation by unique sequences were for ''metabolic pathways'' (5,406 counts, 28.09 %). Our transcriptome revealed 156 Unigenes that encode key enzymes in the flavonoid biosynthesis pathway including CHS, CHI, F3H, FLS, DFR, etc. MISA software identified 2,762 simple sequence repeats, from which 1,975 primers pairs were designed. Over 2,762 motifs were identified, of which the most frequent was AG/CT (659, 23.86 %), followed by A/T (615, 22.27 %) and CCG/CGG (416, 15.06 %). Based on the results, we believe that the color formation of the Lilium 'Sorbonne' flower was mainly controlled by the flavonoid biosynthesis pathway. Additionally, this research provides initial genetic resources that will be valuable to the lily community for other molecular biology research, and the SSRs will facilitate marker-assisted selection in lily breeding.
Article
Leymus chinensis (Trin.) Tzvel. and Stipa grandis P. Smirn. are dominant species in grassland on the typical steppe of Inner Mongolia. Long-term overgrazing, which is considered to represent multiple stresses, reduces the growth of L. chinensis and S. grandis. To gain an understanding of the molecular mechanisms underlying the responses of these plants to overgrazing, we explored the gene expression profiles of L. chinensis and S. grandis to discover the common features of these dominant plants in response to overgrazing. Using the Illumina RNA-Seq platform, two sequencing libraries prepared from non-grazed (Lc-NG) and overgrazed samples (Lc-OG) of L. chinensis were sequenced. Using Trinity software assembly, 129,087 unigenes with a mean length of 693 bp and an N50 of 1,093 bp were obtained by combining the Lc-NG and Lc-OG data. By comparing differentially expressed genes (DEGs) of L. chinensis with those of S. grandis, we identified 16 Kyoto Encyclopedia of Genes and Genomes pathways and 15 Gene Ontology terms that were significantly enriched with DEGs in both species. Most of these DEG-enriched pathways, for example, phenylpropanoid biosynthesis and flavonoid biosynthesis, were related to stress responses. The results suggest that stress plays an important role in plants’ responsiveness to long-term overgrazing and associated reductions in plant growth. The DEGs shared by these two species will be valuable for further research on key genes and molecular mechanisms involved in plants’ adaptation to overgrazing and associated stresses.
Article
Full-text available
The sequence comparison process is one of the main bioinformatics task. The new sequencing technologies lead to a fast increasing of genomic data and strengthen the need of fast and efficient tools to perform this task. In this thesis, a new algorithm for intensive sequence comparison is proposed. It has been specifically designed to exploit all forms of parallelism of today microprocessors (SIMD instructions, multi-core architecture). This algorithm is also well suited for hardware accelerators such as FPGA or GPU boards. The algorithm has been implemented into the PLAST software (Parallel Local Alignment Search Tool). Different versions are available according to the data to process (protein and/or DNA). A MPI version has also been developed. According to the nature of the data and the type of technologies, speedup from 3 to 20 has been measured compared with the reference software, BLAST, with the same level of quality.
Article
Full-text available
Based on the observation that a single mutational event can delete or insert multiple residues, affine gap costs for sequence alignment charge a penalty for the existence of a gap, and a further length-dependent penalty. From structural or multiple alignments of distantly related proteins, it has been observed that conserved residues frequently fall into ungapped blocks separated by relatively nonconserved regions. To take advantage of this structure, a simple generalization of affine gap costs is proposed that allows nonconserved regions to be effectively ignored. The distribution of scores from local alignments using these generalized gap costs is shown empirically to follow an extreme value distribution. Examples are presented for which generalized affine gap costs yield superior alignments from the standpoints both of statistical significance and of alignment accuracy. Guidelines for selecting generalized affine gap costs are discussed, as is their possible application to multiple alignment. Proteins 32:88–96, 1998. Published 1998 Wiley-Liss, Inc.
Article
Full-text available
Publisher Summary This chapter discusses the study of local alignment statistics, the distribution of optimal gapped subalignment scores, and the evidence that two parameters are sufficient to describe both the form of this distribution and its dependence on sequence length. Using a random protein model, the relevant statistical parameters are calculated for a variety of substitution matrices and gap costs. An analysis of these parameters elucidates the relative effectiveness of affine as opposed to length-proportional gap costs. Thus, sum statistics provide a method for evaluating sequence similarity that treats short and long gaps differently. By example, the chapter shows how this method has the potential to increase search sensitivity. The statistics described can be applied to the results of fast alignment (FASTA) searches or to those from a variation of the basic local alignment search tool (BLAST) programs.
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic, and statistical refinements permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is described for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position Specific Iterated BLAST (PSLBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities.
Article
Full-text available
We develop several algorithms for the problem of aligning a DNA sequence with a protein sequence. Our methods account for frameshift errors, but not for introns in the DNA sequence. Thus, they are particularly appropriate for comparing a cDNA sequence that contains sequencing errors with an amino acid sequence or a protein sequence database. We describe techniques for efficient implementation, verify sufficient conditions for equivalence of several definitions of alignment, and discuss experience with these ideas in a new release of the fasta suite of database-searching programs.
Article
Full-text available
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 ± 0.005 to 0.895 ± 0.003. This does not include the benefits from four modifications we included in the ‘baseline’ version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence’s amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
Article
Full-text available
We describe an algorithm for aligning two sequences within a diagonal band that requires only O(NW) computation time and O(N) space, where N is the length of the shorter of the two sequences and W is the width of the band. The basic algorithm can be used to calculate either local or global alignment scores. Local alignments are produced by finding the beginning and end of a best local alignment in the band, and then applying the global alignment algorithm between those points. This algorithm has been incorporated into the FASTA program package, where it has decreased the amount of memory required to calculate local alignments from O(NW) to O(N) and decreased the time required to calculate optimized scores for every sequence in a protein sequence database by 40%. On computers with limited memory, such as the IBM-PC, this improvement both allows longer sequences to be aligned and allows optimization within wider bands, which can include longer gaps.
Article
Full-text available
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
Article
Full-text available
An unusual pattern in a nucleic acid or protein sequence or a region of strong similarity shared by two or more sequences may have biological significance. It is therefore desirable to know whether such a pattern can have arisen simply by chance. To identify interesting sequence patterns, appropriate scoring values can be assigned to the individual residues of a single sequence or to sets of residues when several sequences are compared. For single sequences, such scores can reflect biophysical properties such as charge, volume, hydrophobicity, or secondary structure potential; for multiple sequences, they can reflect nucleotide or amino acid similarity measured in a wide variety of ways. Using an appropriate random model, we present a theory that provides precise numerical formulas for assessing the statistical significance of any region with high aggregate score. A second class of results describes the composition of high-scoring segments. In certain contexts, these permit the choice of scoring systems which are "optimal" for distinguishing biologically relevant patterns. Examples are given of applications of the theory to a variety of protein sequences, highlighting segments with unusual biological features. These include distinctive charge regions in transcription factors and protooncogene products, pronounced hydrophobic segments in various receptor and transport proteins, and statistically significant subalignments involving the recently characterized cystic fibrosis gene.
Article
Full-text available
An algorithm was developed which facilitates the search for similarities between newly determined amino acid sequences and sequences already available in databases. Because of the algorithm's efficiency on many microcomputers, sensitive protein database searches may now become a routine procedure for molecular biologists. The method efficiently identifies regions of similar sequence and then scores the aligned identical and differing residues in those regions by means of an amino acid replacability matrix. This matrix increases sensitivity by giving high scores to those amino acid replacements which occur frequently in evolution. The algorithm has been implemented in a computer program designed to search protein databases very rapidly. For example, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC).
Article
Full-text available
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
Article
Full-text available
Sequence similarity search programs are versatile tools for the molecular biologist, frequently able to identify possible DNA coding regions and to provide clues to gene and protein structure and function. While much attention had been paid to the precise algorithms these programs employ and to their relative speeds, there is a constellation of associated issues that are equally important to realize the full potential of these methods. Here, we consider a number of these issues, including the choice of scoring systems, the statistical significance of alignments, the masking of uninformative or potentially confounding sequence regions, the nature and extent of sequence redundancy in the databases and network access to similarity search services.
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
Article
Full-text available
We develop several algorithms for the problem of aligning DNA sequence with a protein sequence. Our methods account for frameshift errors, but not for introns in the DNA sequence. Thus, they are particularly appropriate for comparing a cDNA sequence that suffers from sequencing errors with an amino acid sequence or a protein sequence database. We describe algorithms for computing optimal alignments for several definitions of DNA-protein alignment, verify sufficient conditions for equivalence of certain definitions, describe techniques for efficient implementation, and discuss experience with these ideas in a new release of the FASTA suite of database-searching programs.
Article
Full-text available
Pairwise sequence comparison methods have been assessed using proteins whose relationships are known reliably from their structures and functions, as described in the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol. Biol. 247, 536-540]. The evaluation tested the programs BLAST [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460-480], FASTA [Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448], and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197] and their scoring schemes. The error rate of all algorithms is greatly reduced by using statistical scores to evaluate matches rather than percentage identity or raw scores. The E-value statistical scores of SSEARCH and FASTA are reliable: the number of false positives found in our tests agrees well with the scores reported. However, the P-values reported by BLAST and WU-BLAST2 exaggerate significance by orders of magnitude. SSEARCH, FASTA ktup = 1, and WU-BLAST2 perform best, and they are capable of detecting almost all relationships between proteins whose sequence identities are >30%. For more distantly related proteins, they do much less well; only one-half of the relationships between proteins with 20-30% identity are found. Because many homologs have low sequence similarity, most distant relationships cannot be detected by any pairwise comparison method; however, those which are identified may be used with confidence.
Article
Full-text available
Based on the observation that a single mutational event can delete or insert multiple residues, affine gap costs for sequence alignment charge a penalty for the existence of a gap, and a further length-dependent penalty. From structural or multiple alignments of distantly related proteins, it has been observed that conserved residues frequently fall into ungapped blocks separated by relatively nonconserved regions. To take advantage of this structure, a simple generalization of affine gap costs is proposed that allows nonconserved regions to be effectively ignored. The distribution of scores from local alignments using these generalized gap costs is shown empirically to follow an extreme value distribution. Examples are presented for which generalized affine gap costs yield superior alignments from the standpoints both of statistical significance and of alignment accuracy. Guidelines for selecting generalized affine gap costs are discussed, as is their possible application to multiple alignment.
Article
Full-text available
The distribution of optimal local alignment scores of random sequences plays a vital role in evaluating the statistical significance of sequence alignments. These scores can be well described by an extreme-value distribution. The distribution’s parameters depend upon the scoring system employed and the random letter frequencies; in general they cannot be derived analytically, but must be estimated by curve fitting. For obtaining accurate parameter estimates, a form of the recently described ‘island’ method has several advantages. We describe this method in detail, and use it to investigate the functional dependence of these parameters on finite-length edge effects.
Article
Full-text available
The GenBank sequence database incorporates publicly available DNA sequences of more than 105 000 different organisms, primarily through direct submission of sequence data from individual laboratories and large-scale sequencing projects. Most submissions are made using the BankIt (web) or Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical literature via PubMed. Sequence similarity searching is provided by the BLAST family of programs. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. NCBI also offers a wide range of World Wide Web retrieval and analysis services based on GenBank data. The GenBank database and related resources are freely accessible via the NCBI home page at http://www.ncbi.nlm.nih.gov.
Article
Full-text available
Genomics and proteomics studies routinely depend on homology searches based on the strategy of finding short seed matches which are then extended. The exploding genomic data growth presents a dilemma for DNA homology search techniques: increasing seed size decreases sensitivity whereas decreasing seed size slows down computation. We present a new homology search algorithm 'PatternHunter' that uses a novel seed model for increased sensitivity and new hit-processing techniques for significantly increased speed. At Blast levels of sensitivity, PatternHunter is able to find homologies between sequences as large as human chromosomes, in mere hours on a desktop. PatternHunter is available at http://www.bioinformaticssolutions.com, as a commercial package. It runs on all platforms that support Java. PatternHunter technology is being patented; commercial use requires a license from BSI, while non-commercial use will be free.
Article
Full-text available
The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. Protein domains in SCOP are hierarchically classified into families, superfamilies, folds and classes. The continual accumulation of sequence and structural data allows more rigorous analysis and provides important information for understanding the protein world and its evolutionary repertoire. SCOP participates in a project that aims to rationalize and integrate the data on proteins held in several sequence and structure databases. As part of this project, starting with release 1.63, we have initiated a refinement of the SCOP classification, which introduces a number of changes mostly at the levels below superfamily. The pending SCOP reclassification will be carried out gradually through a number of future releases. In addition to the expanded set of static links to external resources, available at the level of domain entries, we have started modernization of the interface capabilities of SCOP allowing more dynamic links with other databases. SCOP can be accessed at http://scop.mrc‐lmb.cam.ac.uk/scop.
Article
Full-text available
The ASTRAL Compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. Partially derived from the SCOP database of protein structure domains, it includes sequences for each domain and other resources useful for studying these sequences and domain structures. The current release of ASTRAL contains 54 745 domains, more than three times as many as the initial release 4 years ago. ASTRAL has undergone major transformations in the past 2 years. In addition to several complete updates each year, ASTRAL is now updated on a weekly basis with preliminary classifications of domains from newly released PDB structures. These classifications are available as a stand‐alone database, as well as integrated into other ASTRAL databases such as representative subsets. To enhance the utility of ASTRAL to structural biologists, all SCOP domains are now made available as PDB‐style coordinate files as well as sequences. In addition to sequences and representative subsets based on SCOP domains, sequences and subsets based on PDB chains are newly included in ASTRAL. Several search tools have been added to ASTRAL to facilitate retrieval of data by individual users and automated methods. ASTRAL may be accessed at http://astral.stanford. edu/.
Article
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
Article
Extending the single optimized spaced seed of PatternHunter(20) to multiple ones, PatternHunter II simultaneously remedies the lack of sensitivity of Blastn and the lack of speed of Smith-Waterman, for homology search. At Blastn speed, PatternHunter II approaches Smith-Waterman sensitivity, bringing homology search methodology research back to a full circle.
Article
Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. BLAT's speed stems from an index of all nonoverlapping K-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible. This paper describes how BLAT was optimized. Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches. BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications. http://genome.ucsc.edu hosts a web-basedBLAT server for the human genome.
Article
In the eight years since we last examined the amino acid exchanges seen in closely related proteins, &apos; the information has doubled in quantity and comes from a much wider variety of protein types. The matrices derived from these data that describe the amino acid replacement probabilities between two sequences at various evolutionary distances are more accurate and the scoring matrix that is derived is more sensitive in detecting distant relationships than the one that we previously deri~ed.2, ~ The method used &apos;in this chapter is essentially the same as that described in the Atlas, Volume 34 and Volume 5.&apos; Accepted Point Mutations An accepted poinfmutation in a protein is a replacement of one amino acid by another, accepted by natural selection. It is the result of two distinct processes: the
Article
The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40-J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40% or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional features. For nine false positive predictions out of a possible 432,680, i.e. at a false positive rate of about 1/50,000, SAM-T98 found 35% of the true homologous relationships in PDBD40-J, whilst PSI-BLAST found 30% and ISS found 25%. Overall, this is about twice the number of PDBD40-J relations that can be detected by the pairwise comparison procedures FASTA (17%) and GAP-BLAST (15%). For distantly related sequences in PDBD40-J, those pairs whose sequence identity is less than 30%, SAM-T98 and PSI-BLAST detect three times the number of relationships found by the pairwise methods.
Article
To facilitate understanding of, and access to, the information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. It also provides for each entry Links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references. Two search facilities are available. The homology search permits users to enter a sequence and obtain a list of any structures to which it has significant levels of sequence similarity The key word search finds, for a word entered by the user, matches from both the text of the scop database and the headers of Brookhaven Protein Databank structure files. The database is freely accessible on World Wide Web (WWW) with an entry point to URL http://scop.mrc-lmb.cam.ac.uk/scop/ scop: an old English poet or minstrel (Oxford English Dictionary); ckon: pile, accumulation (Russian Dictionary).
Article
In this paper, we borrow the idea of the receiver operating characteristic (ROC) from clinical medicine and demonstrate its application to sequence comparison. The ROC includes elements of both sensitivity and specificity, and is a quantitative measure of the usefulness of a diagnostic. The ROC is used in this work to investigate the effects of scoring table and gap penalties on database searches. Studies on three families of proteins, 4Fe-4S ferredoxins, lysR bacterial regulatory proteins, and bacterial RNA polymerase σ-factors lead to the following conclusions: sequence families are quite idiosyncratic, but the best PAM distance for database searches using the Smith-Waterman method is somewhat larger than predicted by theoretical methods, about 200 PAM. The length independent gap penalty (gap initation penalty) is quite important, but shows a broad peak at values of about 20–24. The length dependent gap penalty (gap extension penalty) is almost irrelevant suggesting that successful database searches rely only to a limited degree on gapped alignments. Taken together, these observations lead to the conclusion that the optimal conditions for alignments and database searches are not, and should not be expected to be, the same.
Book
Probablistic models are becoming increasingly important in analyzing the huge amount of data being produced by large-scale DNA-sequencing efforts such as the Human Genome Project. For example, hidden Markov models are used for analyzing biological sequences, linguistic-grammar-based probabilistic models for identifying RNA secondary structure, and probabilistic evolutionary models for inferring phylogenies of sequences from different organisms. This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis. Written by an interdisciplinary team of authors, it is accessible to molecular biologists, computer scientists, and mathematicians with no formal knowledge of the other fields, and at the same time presents the state of the art in this new and important field.
Article
This issue's expert guest column is by Eric Allender, who has just taken over the Structural Complexity Column in the Bulletin of the EATCS.Regarding "Journals to Die For" (SIGACT News Complexity Theory Column 16), Joachim von zur Gathen, ...
Article
Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.