ArticlePDF Available

Basic Local Aligment Search Tool

Authors:

Abstract and Figures

A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
Content may be subject to copyright.
A preview of the PDF is not available
... Protein−ligand complexes were prepared with the Protein Preparation Wizard [45] to fix protonation states of amino acids, add hydrogens, and fix missing side-chain atoms. The template structures for homology modeling were identified using BLAST [46], and microtubule affinity . CC-BY-NC-ND 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. ...
Preprint
Full-text available
Salt-inducible kinases 1-3 (SIK1-3) are key regulators of the LKB1-AMPK pathway and play an important role in cellular homeostasis. Dysregulation of any of the three isoforms has been associated with tumorigenesis in liver, breast, and ovarian cancers. We have recently developed the dual pan-SIK/group I p21-activated kinase (PAK) chemical probe MRIA9. However, inhibition of p21-activated kinases has been associated with cardiotoxicity in vivo, which complicates the use of MRIA9 as a tool compound. Here, we present a structure-based approach involving the back-pocket and gatekeeper residues, for narrowing the selectivity of pyrido[2,3-d]pyrimidin-7(8H)-one-based inhibitors towards SIK kinases, eliminating PAK activity. Optimization was guided by high-resolution crystal structure analysis and computational methods, resulting in a pan-SIK inhibitor, MR22, which no longer exhibited activity on STE group kinases and displayed excellent selectivity in a representative kinase panel. MR22-dependent SIK inhibition led to centrosome dissociation and subsequent cell-cycle arrest in ovarian cancer cells, as observed with MRIA9, conclusively linking these phenotypic effects to SIK inhibition. Taken together, MR22 represents a valuable tool compound for studying SIK kinase function in cells.
... Taxonomy was assigned using BLAST and HITdb. The abundance matrices were first filtered and then normalized in R/Bioconductor at each classification level [40][41][42][43]. Blood samples were centrifuged at 4 • C for 15 min to obtain plasma and isolate the buffy coat fraction. ...
Article
Full-text available
Changes in gut microbiota composition and in epigenetic mechanisms have been proposed to play important roles in energy homeostasis, and the onset and development of obesity. However, the crosstalk between epigenetic markers and the gut microbiome in obesity remains unclear. The main objective of this study was to establish a link between the gut microbiota and DNA methylation patterns in subjects with obesity by identifying differentially methylated DNA regions (DMRs) that could be potentially regulated by the gut microbiota. DNA methylation and bacterial DNA sequencing analysis were performed on 342 subjects with a BMI between 18 and 40 kg/m2. DNA methylation analyses identified a total of 2648 DMRs associated with BMI, while ten bacterial genera were associated with BMI. Interestingly, only the abundance of Ruminococcus was associated with one BMI-related DMR, which is located between the MACROD2/SEL1L2 genes. The Ruminococcus abundance negatively correlated with BMI, while the hypermethylated DMR was associated with reduced MACROD2 protein levels in serum. Additionally, the mediation test showed that 19% of the effect of Ruminococcus abundance on BMI is mediated by the methylation of the MACROD2/SEL1L2 DMR. These findings support the hypothesis that a crosstalk between gut microbiota and epigenetic markers may be contributing to obesity development.
... Iterative sequence similarity searches using PSI-BLAST were executed for each NUP subunit type (24,43). PSI-BLAST runs reported herein were performed using the non-redundant protein database (nr) and the Entrez Query 'NOT fragment NOT partial', to exclude protein fragments or partial proteins and keep only complete sequences, avoiding low quality or ambiguous information. ...
Article
Full-text available
The nuclear pore complex exhibits different manifestations across eukaryotes, with certain components being restricted to specific clades. Several studies have been conducted to delineate the nuclear pore complex composition in various model organisms. Due to its pivotal role in cell viability, traditional lab experiments, such as gene knockdowns, can prove inconclusive and need to be complemented by a high-quality computational process. Here, using an extensive data collection, we create a robust library of nucleoporin protein sequences and their respective family-specific position-specific scoring matrices. By extensively validating each profile in different settings, we propose that the created profiles can be used to detect nucleoporins in proteomes with high sensitivity and specificity compared to existing methods. This library of profiles and the underlying sequence data can be used for the detection of nucleoporins in target proteomes.
... To identify a template structure that we could use as a reference to generate a 3D model of AQP3, we ran a BLAST [55] on UniProt with the blosum62 matrix. The PDB structure showing the highest amino acid sequence identity with AQP3 (50.2%) was the one with PDB id equal to 6F7H from AQP10 [56]. ...
Article
Full-text available
The natural polyphenolic compound Rottlerin (RoT) showed anticancer properties in a variety of human cancers through the inhibition of several target molecules implicated in tumorigenesis, revealing its potential as an anticancer agent. Aquaporins (AQPs) are found overexpressed in different types of cancers and have recently emerged as promising pharmacological targets. Increasing evidence suggests that the water/glycerol channel aquaporin-3 (AQP3) plays a key role in cancer and metastasis. Here, we report the ability of RoT to inhibit human AQP3 activity with an IC50 in the micromolar range (22.8 ± 5.82 µM for water and 6.7 ± 2.97 µM for glycerol permeability inhibition). Moreover, we have used molecular docking and molecular dynamics simulations to understand the structural determinants of RoT that explain its ability to inhibit AQP3. Our results show that RoT blocks AQP3-glycerol permeation by establishing strong and stable interactions at the extracellular region of AQP3 pores interacting with residues essential for glycerol permeation. Altogether, our multidisciplinary approach unveiled RoT as an anticancer drug against tumors where AQP3 is highly expressed providing new information to aquaporin research that may boost future drug design.
... Two or three different plasmid DNAs of the same Influenza A virus isolate were sequenced to determine the accuracy of the gene sequencing. The gene sequences were then assembled using the Basic Local Alignment Search Tool from the National Center for Biotechnology Information [18]. ...
Article
Full-text available
The rapid identification of Influenza A virus and its variants, which cause severe respiratory diseases, is imperative to providing timely treatment and improving patient outcomes. Conventionally, two separate assays (total test duration of up to 6 h) are required to initially differentiate Influenza A and B viruses and subsequently distinguish the pdm H1N1 and H3N2 serotypes of Influenza A virus. In this study, we developed a multiplex real-time RT-PCR method for simultaneously detecting Influenza A and B viruses and subtyping Influenza A virus, with a substantially reduced test duration. Clinical specimens from hospitalized patients and outpatients with influenza-like symptoms in Eastern Taiwan were collected between 2011 and 2015, transported to Hualien Tzu Chi Hospital, and analyzed. Conventional RT-PCR was used to subtype the isolated Influenza A viruses. Thereafter, for rapid identification, the multiplex real-time RT-PCR method was developed and applied to identify the conserved regions that aligned with the available primers and probes. Accordingly, a multiplex RT-PCR assay with three groups of primers and probes (MAF and MAR primers and MA probe; InfAF and InfAR primers and InfA probe; and MBF and MBR primers and MB probe) was established to distinguish these viruses in the same reaction. Thus, with this multiplex RT-PCR assay, Influenza B, Influenza A pdm H1N1, and Influenza A H3N2 viruses were accurately detected and differentiated within only 2.5 h. This multiplex RT-PCR assay showed similar analytical sensitivity to the conventional singleplex assay. Further, the phylogenetic analyses of our samples revealed that the characteristics of these viruses were different from those reported previously using samples collected during 2012-2013. In conclusion, we developed a multiplex real-time RT-PCR method for highly efficient and accurate detection and differentiation of Influenza A and B viruses and subtyping Influenza A virus with a substantially reduced test duration for diagnosis.
... Sequences identified as telomeres were discarded, and sequences identified as ORFs, structural RNAs and nonconserved sequences; target motifs along with 300 bp of flanking sequences on each side were recovered and saved in a local database set. Each sequence was then aligned to all genomes of the Ustilaginales taxon group available on April 30, 2019, using blastn option of BLAST [40] software of NCBI. The highest-scoring single loci conserved in all organisms were sought, and 21 candidates were obtained. ...
Article
Full-text available
The RNA subunit of telomerase is an essential component whose primary sequence and length are poorly conserved among eukaryotic organisms. The phytopathogen Ustilago maydis is a dimorphic fungus of the order Ustilaginales. We analyzed several species of Ustilaginales to computationally identify the TElomere RNA (TER) gene ter1. To confirm the identity of the TER gene, we disrupted the gene and characterized telomerase-negative mutants. Similar to catalytic TERT mutants, ter1Δ mutants exhibit phenotypes of growth delay, telomere shortening and low replicative potential. ter1-disrupted mutants were unable to infect maize seedlings in heterozygous crosses and showed defects such as cell cycle arrest and segregation failure. We concluded that ter1, which encodes the TER subunit of the telomerase of U. maydis, have similar and perhaps more extensive functions than trt1.
Preprint
Most microbes evolve faster than their hosts and should therefore drive evolution of host-microbe interactions 1–3 . However, relatively little is known about the characteristics that define the adaptive path of microbes to host-association. In this study we have identified microbial traits that mediate adaptation to hosts by experimentally evolving the bacterium Pseudomonas lurida with the nematode Caenorhabditis elegans . We repeatedly observed the evolution of beneficial host-specialist bacteria with improved persistence in the nematode, achieved by mutations that uniformly upregulate the universal second messenger c-di-GMP. We subsequently upregulated c-di-GMP in different Pseudomonas species, consistently causing increased host-association. Comparison of Pseudomonad genomes from various environments revealed that c-di-GMP underlies adaptation to a variety of hosts, from plants to humans, suggesting that it is fundamental for establishing host-association.
Preprint
Though the phylogenetic signal of loci on sex chromosomes can differ from those on autosomes, chromosomal-level genome assemblies for non-vertebrates are still relatively scarce and conservation of chromosomal gene content across deep phylogenetic scales has therefore remained largely unexplored. We here assemble a uniquely large and diverse set of samples (17 Anchored Hybrid Enrichment [AHE], 24 RNA-Seq, and 70 whole-genome sequencing [WGS] samples of variable depth) for the medically important assassin bugs (Reduvioidea). We assess the performance of genes based on multiple features (e.g., nucleotide vs. amino acid, nuclear vs. mitochondrial, and autosomal vs. X chromosomal) and employ different methods (concatenation and coalescence analyses) to reconstruct the unresolved phylogeny of this diverse (~7,000 spp.) and old (>180 MYA) group. Our results show that genes on the X chromosome are more likely to have discordant phylogenies than those on autosomes. We find that the X chromosome conflict is driven by high gene substitution rates that impact accuracy of phylogenetic inference. However, gene tree clustering showed strong conflict even after discounting variable third codon positions. Alternative topologies were not particularly enriched for sex chromosome loci, but spread across the genome. We conclude that binning genes to autosomal or sex chromosomes may result in a more accurate picture of the complex evolutionary history of a clade.
Chapter
The analysis of the relationship between sequence and structure similarities during the evolution of a protein family has revealed a limit of sequence divergence for which structural conservation can be confidently assumed and homology modeling is reliable. Below this limit, the twilight zone corresponds to sequence divergence for which homology modeling becomes increasingly difficult and requires specific methods. Either with conventional threading methods or with recent deep learning methods, such as AlphaFold, the challenge relies on the identification of a template that shares not only a common ancestor (homology) but also a conserved structure with the query. As both homology and structural conservation are transitive properties, mining of sequence databases followed by multidimensional scaling (MDS) of the query sequence space can reveal intermediary sequences to infer homology and structural conservation between the query and the template. Here, as a case study, we studied the plethodontid receptivity factor isoform 1 (PRF1) from Plethodon jordani, a member of a pheromone protein family present only in lungless salamanders and weakly related to cytokines of the IL6 family. A variety of conventional threading methods led to the cytokine CNTF as a template. Sequence mining, followed by phylogenetic and MDS analysis, provided missing links between PRF1 and CNTF and allowed reliable homology modeling. In addition, we compared automated models obtained from web servers to a customized model to show how modeling can be improved by expert information.Key wordsMolecular modelingThreadingTwilight zoneProfile-profile miningCytokinePlethodontid receptivity factor
Preprint
Full-text available
Purpose: Sophora flavescens is a medicinal plant in the genus Sophora of the Fabaceae family. The root of S. flavescens is known in China as Kushen and has a long history of wide use in multiple formulations of Traditional Chinese Medicine (TCM). However, there is little genomic information available for S. flavescens . Methods: In this study, we used third-generation Nanopore long-read sequencing technology combined with Hi-C scaffolding technology to de novo assemble the S. flavescens genome. Results: We obtained a chromosomal level high-quality S. flavescens draft genome. The draft genome size is approximately 2.08 Gb, with more than 80% annotated as Transposable Elements (TEs), which have recently and rapidly proliferated. This genome size is ~5x larger than its closest sequenced relative Lupinus albus l. . We annotated 60,485 genes and examined their expression profiles in leaf, stem and root tissues, and also characterised the genes and pathways involved in the biosynthesis of major bioactive compounds, including alkaloids, flavonoids and isoflavonoids. Conclusion: The assembled genome highlights the very different evolutionary trajectories that have occurred in recently diverged Fabaceae, leading to smaller duplicated genomes vs larger genomes resulting from TE expansion. Our assembly provides valuable resources for conservation, genetic research and breeding of S. flavescens .
Article
Full-text available
Sequence analysis of protein and nucleic acid databases by exhaustive string-matching algorithms is effectively implemented on large processor-array machines, such as the I.C.L. DAP. An improved method of assessing the significance of the best alignments for proteins is described. Examples involving the cystic fibrosis antigen and Drosophila vitellogenins illustrate the usefulness of this approach.
Article
Full-text available
We have developed a computer algorithm that can extract the pattern of conserved primary sequence elements common to all members of a homologous protein family. The method involves clustering the pairwise similarity scores among a set of related sequences to generate a binary dendrogram (tree). The tree is then reduced in a stepwise manner by progressively replacing the node connecting the two most similar termini by one common pattern until only a single common "root" pattern remains. A pattern is generated at a node by (i) performing a local optimal alignment on the sequence/pattern pair connected by the node with the use of an extended dynamic programming algorithm and then (ii) constructing a single common pattern from this alignment with a nested hierarchy of amino acid classes to identify the minimal inclusive amino acid class covering each paired set of elements in the alignment. Gaps within an alignment are created and/or extended using a "pay once" gap penalty rule, and gapped positions are converted into gap characters that function as 0 or 1 amino acid of any type during subsequent alignment. This method has been used to generate a library of covering patterns for homologous families in the National Biomedical Research Foundation/Protein Identification Resource protein sequence data base. We show that a covering pattern can be more diagnostic for sequence family membership than any of the individual sequences used to construct the pattern.
Article
This paper gives a formal definition of the biological concept of evolutionary distance and an algorithm to compute it. For any set S of finite sequences of varying lengths this distance is a real-valued function on $S \times S$, and it is shown to be a metric under conditions which are wide enough to include the biological application. The algorithm, introduced here, lends itself to computer programming and provides a method to compute evolutionary distance which is shorter than the other methods currently in use.
Article
Dynamic Monte Carlo studies have been performed on various diamond lattice models of β-proteins. Unlike previous work, no bias toward the native state is introduced; instead, the protein is allowed to freely hunt through all of phase space to find the equilibrium conformation. Thus, these systems may aid in the elucidation of the rules governing protein folding from a given primary sequence; in particular, the interplay of short- vs long-range interaction can be explored. Three distinct models (AC) were examined. In model A, in addition to the preference for trans (t) over gauche states (g+ and g−) (thereby perhaps favoring β-sheet formation), attractive interactions are allowed between all nonbonded, nearest neighbor pairs of segments. If the molecules possess a relatively large fraction of t states in the denatured form, on cooling spontaneous collapse to a well-defined β-barrel is observed. Unfortunately, in model A the denatured state exhibits too much secondary structure to correctly model the globular protein collapse transition. Thus in models B and C, the local stiffness is reduced. In model B, in the absence of long-range interactions, t and g states are equally weighted, and cooperativity is introduced by favoring formation of adjacent pairs of nonbonded (but not necessarily parallel) t states. While the denatured state of these systems behaves like a random coil, their native globular structure is poorly defined. Model C retains the cooperativity of model B but allows for a slight preference of t over g states in the short-range interactions. Here, the denatured state is indistinguishable from a random coil, and the globular state is a well-defined β-barrel. Over a range of chain lengths, the collapse is well represented by an all-or-none model. Hence, model C possesses the essential qualitative features observed in real globular proteins. These studies strongly suggest that the uniqueness of the globular conformation requires some residual secondary structure to be present in the denatured state.
Article
The theoretical basis of sequential circuit synthesis is developed, with particular reference to the work of D. A. Huffman and E. F. Moore. A new method of synthesis is developed which emphasizes formal procedures rather than the more familiar intuitive ones. Familiarity is assumed with the use of switching algebra in the synthesis of combinational circuits.
Article
Mathematical methods for comparison of nucleic acid sequences are reviewed. There are two major methods of sequence comparison: dynamic programming and a method referred to here as the regions method. The problem types discussed are comparison of two sequences, location of long matching segments, efficient database searches and comparison of several sequences.
Article
A new development is introduced here in the use of dynamic programming in finding pattern similarities in genetic sequences, as was first done by Needleman and Wunsch (1969). A condition of pattern similarity is defined and an algorithm is given which scans any set of similarities and screens out those which fail to meet the condition. When the set to be scanned contains every pair of segments, one from each of two given sequences of lengthsm andn (i.e. every possible location for a pattern similarity), then it completes the scan in a number of computational steps proportional tom·n, leaving those pairs of segments which satisfy the similarity condition. The algorithm is based on the concept of match density, as suggested by Goad and Kanehisa (1982).
Article
Homology and distance measures have been routinely used to compare two biological sequences, such as proteins or nucleic acids. The homology measure of Needleman and Wunsch is shown, under general conditions, to be equivalent to the distance measure of Sellers. A new algorithm is given to find similar pairs of segments, one segment from each sequence. The new algorithm, based on homology measures, is compared to an earlier one due to Sellers.