To read the full-text of this research, you can request a copy directly from the author.
This paper describes a computer program designed to look for similarities between pairs of nucleic or amino acid sequences.
The program looks both for segments of perfect identity or for regions where, using a scoring matrix, a minimum value is exceeded.
The results of comparisons are presented as a matrix which is displayed on a simple graphics terminal. Use of a graphics terminal
allows the user to display the whole of the two sequences in one screenful or to home-in on regions of interest to examine
them in more detail. The program is interactive and so the user can easily see the effect of changes to variables and can
use inbuilt editing functions to make insertions to produce alignments of the two sequences. These aligned sequences can then
be saved on disk files for further processing.
To read the full-text of this research, you can request a copy directly from the author.
... The Dot-Plot method (Gibbs & McIntyre, 1970;Staden, 1982) can be used to produce illustrative two dimensional representations of 'long-range' homologies. In addition these plots can be used to investigate the occurrence of internal repeats. ...
... This peptide spans a putative protein kinase A site and is found near the N terminus of gelsolin (Kwiatkowski et al, 1986), in transgelin, NP25, mp20, three times in calponin-a, and seven FIGURE 21. The diagonal dot-plot method (Gibbs & McIntyre, 1970;Staden, 1982) was used to graphically compare proteins using the Gene Jockey program. A window size of eight residues (filter function 4-7) was used throughout, with C4^ residues 1-199 shown along the x-axis in all cases. ...
... Dot matrix plots are an effective way of analysing amino acid sequences and showing a relationship between two proteins (Gibbs & McIntyre, 1970;Staden, 1982). ...
The actin network is crucial for many cellular events such as cell movement, phagocytosis, cell division and movement of cell surface receptors. Control of the actin network lies with a large number of actin-associated proteins that regulate the polymerisation status, interaction and geometry of actin. Protein C4 is an actin-associated polypeptide doublet described by Shapland et al (1988; 1993). C41 is the only protein C4 isoform present in motile cells such as lymphocytes and transformed mesenchymal cells, and the objectives of this work were to see if C41 is associated with actin filament bundles in transformed cells and to characterise the molecule at the level of the gene. My initial studies have shown that C41 is associated with vestigial actin filament bundles found in transformed mesenchymal cells. To further investigate C41 I have purified the molecule to more than 90[percent] homogeneity from human T cell lymphoma (HTCL), and then used degenerate oligonucleotides to obtain internal C41 sequence from reverse transcribed HTCL mRNA (RT-PCR). C41 gene-specific oligonucleotides were used to obtain both the 5'- end (anchor-PCR) and 3'- end (3'- rapid amplification of cDNA ends) from reversed transcribed HTCL mRNA. Cloning and sequencing of these overlapping PCR products gives the full length coding sequence for C41 from HTCL. The translated product of C41 open reading frame is 199 amino acids in length, with a calculated molecular weight of 22,381 Da and an estimated pi of 8.09. Northern blotting reveals that C41 is expressed as a single message of 1.44 Kb which is apparently up- regulated in oncogenically transformed lymphocytes. Database searches indicate that C41 is a previously undescribed molecule. Regions of homology, including a potential actin-binding domain and phosphorylation sites, which are shared between C41 and other proteins such as rat transgelin (Prinjha et al, 1994), chicken SM22[alpha] (Pearlstone et al, 1987), rat NP25 (unpublished), chicken calponins alpha and beta (Takahashi and Nadal-Ginard, 1991), and Drosophila mp20 (Ayme-Southgate et al, 1989) suggest that all of these proteins may be classified as members of a new transgelin-like multigene family.
... К принципиальным преимуществам второго подхода следует отнести отсутствие необходимости решать проблему оценки штрафа за введение пробелов в выравниваемые последовательности, а также возможность выявления повторов в структуре биополимеров. Вместе с тем наиболее распространенная среди исследователей программа «DIAGON», написанная Стаденом , либо ее модификации, с помощью которых строят карты локального сходства, несвободны от существенных недостатков. Они обусловлены наличием двух параметров сравнения, значения которых устанавливаются заранее, -размера сравниваемых фрагментов (так называемого размера «окна сравнения») и уровня (величины) сходства фрагментов. ...
... Более объективно, хотя более трудоемко, измерять сходство фрагментов, исходя не из мощности, а из вероятности случайного появления такого сходства в паре случайных бернуллиевских независимых между собой последовательностей той же длины, что и сравниваемые биополимеры, и с теми же частотами букв (мономеров); чем меньше эта вероятность (P) 1 тем больше сходство. Например, в качестве меры сходства можно взять величину IgP [1,2]. Для таких бернуллиевских последовательностей мера сходства сравниваемых фрагментов, измеряемая мощностью, является случайной величиной, и при увеличении длины L фрагментов ее функция распределения приближается к стандартной функции нормального распределения JV(O 1 I). ...
... Однако время работы, по-видимому, резко отличается. Поскольку в алгоритме Альтшуля -Эриксона производится полный перебор сравнений всех возможных пар фрагментов одинаковой длины, то его работа по существу эквивалентна суммарной работе алгоритма Стадена  при всех возможных длинах «окон сравнения». Исходя из этого можно оценить время его работы. ...
... DNA sequence data was compiled and analyzed on a VAX 11-785 computer (Digital Equipment Corp., Marlboro, MA) using the programs ANALYSEQ (Staden, 1984) and DIAGON (Staden, 1982). Secondary structure analyses were obtained using the PREDICT program, which is a modified version of the joint prediction program of Eliopoulos et al. (1982). ...
... The predicted amino acid sequence of ~.sTrl was analyzed for similar internal sequences using the computer program DIA-GON(Staden, 1982) with a window length of 23. The output of this comparison was then plotted. ...
Trichohyalin is a highly expressed protein within the inner root sheath of hair follicles and is similar, or identical, to a protein present in the hair medulla. In situ hybridization studies have shown that trichohyalin is a very early differentiation marker in both tissues and that in each case the trichohyalin mRNA is expressed from the same single copy gene. A partial cDNA clone for sheep trichohyalin has been isolated and represents approximately 40% of the full-length trichohyalin mRNA. The carboxy-terminal 458 amino acids of trichohyalin are encoded, and the first 429 amino acids consist of full- or partial-length tandem repeats of a 23 amino acid sequence. These repeats are characterized by a high proportion of charged amino acids. Secondary structure analyses predict that the majority of the encoded protein could form alpha-helical structures that might form filamentous aggregates of intermediate filament dimensions, even though the heptad motif obligatory for the intermediate filament structure itself is absent. The alternative structural role of trichohyalin could be as an intermediate filament-associated protein, as proposed from other evidence.
... Normal ,f contains the sequence EPDPGM repeated at residues 17-22 and 36-41 (see Fig. 4 below). Since repeated sequences in the DNA could give rise to internal deletions, the sequence of ,f mRNA was examined for regions of internal homology by using the DIAGON method of Staden (1982). Fig. 3(a) shows the presence of three regions of extended homology in the DNA sequence. ...
... Regions of internal homology in II (a) DIAGON plot(Staden, 1982) of a portion of ft cDNA against itself. Horizontal and vertical axes show nucleotide numbers of part of the coding region of sialoglycoprotein ,f cDNA. ...
We have studied the DNA of individuals who express an altered sialoglycoprotein beta on their red cells by using Southern blotting with sialoglycoprotein-beta cDNA probes. Individuals of the Leach phenotype do not express any beta (sialoglycoprotein beta) or gamma (sialoglycoprotein gamma) on their red cells, and we show that about 7 kb of DNA, including the 3′ end of the beta gene, is deleted in this DNA. Any protein product of this gene is likely to lack the membrane-associating domain of beta. We have also examined the DNA of two types of other individuals (Yus-type and Gerbich-type) who have red cells that lack beta and gamma, but contain abnormal sialoglycoproteins related to beta. These two types of DNA contain different internal deletions of about 6 kb in the beta gene. We suggest that these deletions result from the presence of two different sets of internal homology in the beta gene, and on this basis we propose structures for the abnormal Yus-type and Gerbich-type sialoglycoproteins which are consistent with the other evidence that is available. We provide evidence that beta and gamma are products of the same gene and suggest a possible mechanism for the origin of gamma based on leaky initiation of translation of beta mRNA.
... Regions of homology between NrcA protein (vertical axis) and related cyclins (horizontal axis) were compared by the DIAGON analysis program using a span length of 21 and a score of 230 (Staden, 1982). The number of amino acid residues of each protein is indicated and the axes are scaled in proportion to the size of each protein. ...
I expressed Xenopus laevis cyclin A1 in the cellular slime mould Dictyostelium discoideum to investigate the currently unclear role of A-type cyclins. Expression of this cyclin in the budding yeast Saccharomyces cerevisiae and the fission yeast Schizosaccharomyces pombe causes a cell cycle arrest. However, overexpression of wild-type or an 'indestructible' form of the protein had no effect upon the growth of Dictyostelium vegetative cells or their subsequent development when starved. This may be explained by the inability of Xenopus cyclin A1 to bind to the two cloned Dictyostelium cyclin dependent kinases (CDKs) in vitro. A single B-type cyclin has been previously cloned in Dictyostelium. In this thesis, I describe the identification of two new cyclins from this organism. I constructed a cDNA library from growing Dictyostelium amoebae and then isolated clones able to complement the defect of the (clnl, cln2, cln3) strain of S. cerevisiae. The rescuing clones fell into three classes. One encoded the already characterised B-type cyclin. The second showed extensive homology to mitotic cyclins from other species and also contained large runs of asparagine repeats in non-conserved regions of the protein. The final class of clone showed weak homology to the S. cerevisiae Cln3, Pho80 and Pell cyclin-like proteins. Antibodies were raised against the mitotic cyclin protein overexpressed in bacteria. A polypeptide of the correct size was recognised in Dictyostelium extracts. The cyclin had no effect upon growth or development when overexpressed in Dictyostelium cells in vivo in either a wild-type or a N-terminally truncated form.
... Sequence comparisons. RN A and protein sequences were compared using the Diagon program (Staden, 1982). A scan window of 15 or 21 was used with a record registered for proportional matching scores of 11 or 236 for nucleotide and amino acid sequences respectively. ...
... As graphic displays were becoming affordable during the 1980s, interactive graphical visualisation tools started proliferating into bioinformatics, such as within the Staden (Staden 1982, 1984, Gleeson and Staden 1991 and WHAT IF (Vriend 1990) toolkits. While at the time of the first publication GCG offered graphics only as output printed by plotters (Devereux et al. 1984), graphical output on displays became available soon after. ...
The aim of the presented work was contributing to making scientific computing more accessible, reliable, and thus more efficient for researchers, primarily computational biologists and molecular biologists. Many approaches are possible and necessary towards these goals, and many layers need to be tackled, in collaborative community efforts with well-defined scope. As diverse components are necessary for the accessible and reliable bioinformatics scenario, our work focussed in particular on the following:
In the BioXSD project, we aimed at developing an XML-Schema-based data format compatible with Web services and programmatic libraries, that is expressive enough to be usable as a common, canonical data model that serves tools, libraries, and users with convenient data interoperability.
The EDAM ontology aimed at enumerating and organising concepts within bioinformatics, including operations and types of data. EDAM can be helpful in documenting and categorising bioinformatics resources using a standard “vocabulary”, enabling users to find respective resources and choose the right tools.
The eSysbio project explored ways of developing a workbench for collaborative data analysis, accessible in various ways for users with various tasks and expertise. We aimed at utilising the World-Wide-Web and industrial standards, in order to increase compatibility and maintainability, and foster shared effort.
In addition to these three main contributions that I have been involved in, I present a comprehensive but non-exhaustive research into the various previous and contemporary efforts and approaches to the broad topic of integrative bioinformatics, in particular with respect to bioinformatics software and services. I also mention some closely related efforts that I have been involved in.
The thesis is organised as follows: In the Background chapter, the field is presented, with various approaches and existing efforts. Summary of results summarises the contributions of my enclosed projects – the BioXSD data format, the EDAM ontology, and the eSysbio workbench prototype – to the broad topics of the thesis. The Discussion chapter presents further considerations and current work, and concludes the discussed contributions with alternative and future perspectives.
In the printed version, the three articles that are part of this thesis, are attached after the Discussion and References. In the electronic version (in PDF), the main thesis is optimised for reading on a screen, with clickable cross-references (e.g. from citations in the text to the list of References) and hyperlinks (e.g. for URLs and most References). A PDF viewer with “back“ function is recommended.
... Hence the adjacent restriction fragment was subcloned and sequenced to give the whole open reading frame (Fig. 1). Although this has plausible transcription and translation signals that match the E. coli consensus (see the legend to Fig. 1), the N-terminal sequence does not correspond to that determined for the purified D-xylose isomerase, and no similarity was observed when the DNA sequence from nucleotide residue -36 to 561 was compared with six different D-xylose isomerase gene sequences (Actinoplanes missouriensis, Ampullariella sp., Bacillus subtilis, E. coli, Streptomyces violaceoniger and Arthrobacter sp.) by the proportional algorithm of the Staden program (Staden, 1982) with a spin length of 11 and a proportional score of 8. ...
Arthrobacter strain N.R.R.L. B3728 superproduces a D-xylose isomerase that is also a useful industrial D-glucose isomerase. The gene (xylA) that encodes it has been cloned by complementing a xylA mutant of the ancestral strain, with the use of a shuttle vector. The 5' region shows strong sequence similarity to Escherichia coli consensus promoters and ribosome-binding sequences and allows high levels of expression in E. coli. The coding sequence shows similarity to those for other D-xylose isomerases and is followed by 22 nucleotide residues with stop codons in each reading frame, a good 'consensus' ribosome-binding site and an open reading frame showing similarity to those of known D-xylulokinases (xylB). Studies on the expression of the cloned gene in Arthrobacter and in E. coli suggest that the two genes are part of a xyl operon regulated by a repressor that is defective in strain B3728. Codon usage in these two genes, and in another open reading frame (nxi) that was adventitiously isolated during early cloning attempts, shows some characteristic omissions and a strong G + C preference in redundant positions.
... Таким образом, уменьшение времени счета является актуальной задачей. В работе  предложен алгоритм (назовем его алгоритмом 0) с обходом матрицы по диагоналям и подсчетом входящих в окно и выходящих из окна нуклеотидов. Трудоемкость этого алгоритма ^L i L 2 . ...
... Research into computational methods for DNA sequencing has been ongoing since the advent of sequencing methods. Starting as early as the late 1970s, Rodger Staden published a series of papers describing software programs designed to analyze and manipulate sequence data (Staden 1986(Staden , 1984a(Staden , 1984b(Staden , 1982a(Staden , 1982b(Staden , 1980(Staden , 1979(Staden , 1978(Staden , 1977. Recent work by various researchers has progressed into methods that use artificial intelligence, genetic algorithms, and examination of fluorescent trace data. ...
... Muitos filtros são possíveis (basta puxar pela imaginação), mas o mais comum consiste em colocar um ponto na célula (i, j) se uma janela de 10 bases centrada em (i, j) contém mais de 6 matches positivos. Outra forma de filtrar os resultados consiste em dar-lhes um peso de acordo com a sua semelhança química (Staden 1982). Independentemente do filtro, este método requer a construção de uma matriz mxn, e portanto cresce com o produto do comprimento das sequências (O(N 2 )). ...
... Nucleotide sequences on material subcloned in bacteriophages M13mp18 and M13mp19 were determined with the dideoxy chain terminator method of Sanger et al. (18) as modified by Biggin et al. (19). The nucleotide sequences were aligned and analyzed by computer using programs provided by Staden (20,21). ...
The predominant protein in human semen, semenogelin, was characterized by λgt11 clones isolated from a seminal vesicular cDNA library. One clone, carrying a cDNA insert of 1606 nucleotides and a polyadenylated tail, coded for the entire semenogelin precursor. An open reading frame of 1386 nucleotides encodes a signal peptide and the mature protein of 439 amino acid residues, in which residues 85–136 are identical with a previously characterized semenogelin fragment. The polypeptide chain displays a most conspicuous region of internal sequence homology where 46 of the 58 amino acid residues at positions 259–316 are repeated at positions 319–376. An abundant seminal vesicular transcript of 1.8 kilobases (kb) codes for semenogelin. Two additional transcripts, one seminal vesicular 2.2-kb species and one epididymal 2.0-kb species, code for related proteins that have a close structural relationship as well as antigenic epitopes in common with semenogelin. Semenogelin and the semenogelin-related proteins are the major proteins involved in the gelatinous entrapment of ejaculated spermatozoa. Antigenic epitopes common to these proteins are localized to the parts of the spermatozoa involved in locomotion. The spermatozoa become progressively motile as the gel-forming proteins are fragmented by the kallikrein-like protease, prostate-specific antigen, and the gel dissolves.
... The comparison was perprogram BESTFIT (41). (57). The odd span length parameter was 25, the score parameter was 300, and the percent score mode was used. ...
Six alpha 2-macroglobulin cDNA clones were isolated from two liver cDNA libraries produced from rats undergoing acute inflammation. The coding sequence for rat alpha 2-macroglobulin including its 27-residue signal peptide and the 3' - and part of the 5' nontranslated regions were determined. The mature protein consisting of 1445 amino acids is coded for by a 4790 +/- 40 nucleotide messenger RNA. It contains a typical internal thiol ester region and 25 cysteine residues which are conserved between rat and human alpha 2-macroglobulin. Although the amino acid sequences of rat and human alpha 2-macroglobulin share 73% identity, two small divergent areas of 17 and 38 residues were found, corresponding to 29 and 11% identity, respectively. These areas are located in the bait region and, therefore, may confer specific proteinase recognition capabilities on rat alpha 2-macroglobulin. Following an inflammatory stimulation, rat alpha 2-macroglobulin mRNA levels increased 214-fold over control values and reached a maximum at 18 h. By 24 h the levels had decreased to less than 30% of the maximum value. Transcription rates from the alpha 2-macroglobulin gene as measured in nuclear run-on experiments showed a less than 3-fold increase in nuclei from acutely inflamed rats as compared to controls. These results suggest that the accummulation of alpha 2M mRNA is due to the combined effects of increased transcription rates and post-transcriptional processing.
... We use global pooling to allow variable-sized input of the convolutional neural network to lead to a fixed-sized class prediction. The key of our approach is to encode the secondary structure in the pairing matrix format (also known as dot plot, see [8,12,30,41]). In the pairing matrix, we specify the minimum free energy of the sequence and the interactions between nucleotides directly. ...
Precursor microRNA (pre-miRNA) identification is the basis for identifying microRNAs (miRNAs), which have important roles in post-transcriptional regulation of gene expression. In this paper, we propose a deep learning method to identify whether a small non-coding RNA sequence is a pre-miRNA or not. We outperform state-of-the-art methods on three benchmark datasets, namely the human, cross-species, and new datasets. The key of our method is to use a matrix representation of predicted secondary structure as input to a 2D convolutional network. The neural network extracts optimized features automatically instead of using the enormous number of handcrafted features as most of existing methods do. Code and results are available at https://github.com/peace195/ miRNA-identification-conv2D.
The tenet of structural biology that function follows form had its seeds in the monograph by C. B. Anfinsen, The Molecular Basis of Evolution (Anfinsen, 1959), wherein he stated “Protein chemists naturally feel that the most likely approach to the understanding of cellular behavior lies in the study of structure and function of protein molecules.” The achievement of protein crystallography over the past 30 years has confirmed this view whereby the description of the structure and function of proteins is now frequently understood at the atomic level.
“Here you are sitting in front of a computer terminal, probably attached to some VAX machine or PC. You have just entered into a file your newly determined nucleotide sequence and are now wondering which theoretical methods and associated computer programs you should use to glean structural and functional information from the mere sequence of letters.”
Since the beginning of big genome sequencing, initiated by the work on the nematode Caenhorhabditis elegans, the Staden group has concentrated on developing methods to increase the efficiency of these large-scale projects. In the course of this, we have designed and implemented a sophisticated and intuitive graphical user interface for use in our programs GAP4 and PREGAP4. This interface has also been used in our sequence analysis program SPIN, but as it has not been the main focus of our efforts, SPIN is still limited in the number and variety of the functions it contains. The EMBOSS project was initiated to provide a comprehensive set of sequence analysis tools that would be available free to all and has made rapid progress towards this goal. However, it did not have a graphical user interface and this limited its usefulness. It was felt that the combination of SPIN and EMBOSS would provide a powerful package.
We have cloned the human genomic DNA and the corresponding cDNA for the gene which complements the mutation of tsBN51, a temperature-sensitive (Ts) cell cycle mutant of BHK cells which is blocked in G1 at the nonpermissive temperature. After transfecting human DNA into TsBN51 cells and selecting for growth at 39.5 degrees C, Ts+ transformants were identified by their content of human AluI repetitive DNA sequences. Following two additional rounds of transfection, a genomic library was constructed from a tertiary Ts+ transformant and a recombinant phage containing the complementing gene isolated by screening for human AluI sequences. A genomic probe from this clone recognized a 2-kilobase mRNA in human and tertiary transformant cell lines, and this probe was used to isolate a biologically active cDNA from the Okayama-Berg cDNA expression library. Sequencing of this cDNA revealed a single open reading frame encoding a polypeptide of 395 amino acids. The deduced BN51 gene product has a high proportion of acidic and basic amino acids which are clustered in four hydrophilic domains spaced at 60- to 80-amino-acid intervals. These domains have strong sequence homology to each other. Thus, the tsBN51 protein consists of periodic repetitive clusters of acidic and basic amino acids.
Before we can discuss comparative gene finding in this chapter we go through some of the basic theory behind sequence alignment. The chapter is divided into two parts. In the first part we describe the basic concepts of pairwise alignments, including substitution schemes and gap models, and move on to the application of dynamic programming to global and local alignments. We finish off by giving an overview of heuristic database searches and the statistical foundation they rest upon. The extension of dynamic programming to multiple alignments is complicated by the increased computational complexity. As a result, a flora of heuristic alignment algorithms have evolved. In the second part of the chapter we give an account of the most common of these models, including progressive alignments, iterative methods, hidden Markov models, genetic algorithms, simulated annealing, and alignment profiles.
In both clam oocytes and sea urchin eggs, fertilization triggers the synthesis of a set of proteins specified by stored maternal mRNAs. One of the most abundant of these (p41) has a molecular weight of 41,000. This paper describes the identification of p41 as the small subunit of ribonucleotide reductase, the enzyme that provides the precursors necessary for DNA synthesis. This identification is based mainly on the amino acid sequence deduced from cDNA clones corresponding to p41, which shows homology with a gene in Herpes Simplex virus that is thought to encode the small subunit of viral ribonucleotide reductase. Comparison with the B2 (small) subunit of Escherichia coli ribonucleotide reductase also shows striking homology in certain conserved regions of the molecule. However, our attention was originally drawn to protein p41 because it was specifically retained by an affinity column bearing the monoclonal antibody YL 1/2, which reacts with alpha-tubulin (Kilmartin, J. V., B. Wright, and C. Milstein, 1982, J. Cell Biol., 93:576-582). The finding that this antibody inhibits the activity of sea urchin embryo ribonucleotide reductase confirmed the identity of p41 as the small subunit. The unexpected binding of the small subunit of ribonucleotide reductase can be accounted for by its carboxy-terminal sequence, which matches the specificity requirements of YL 1/2 as determined by Wehland et al. (Wehland, J., H. C. Schroeder, and K. Weber, 1984, EMBO [Eur. Mol. Biol. Organ.] J., 3:1295-1300). Unlike the small subunit, there is no sign of synthesis of a corresponding large subunit of ribonucleotide reductase after fertilization. Since most enzymes of this type require two subunits for activity, we suspect that the unfertilized oocytes contain a stockpile of large subunits ready for combination with newly made small subunits. Thus, synthesis of the small subunit of ribonucleotide reductase represents a very clear example of the developmental regulation of enzyme activity by control of gene expression at the level of translation.
Structural analysis of oligomycin sensitivity-conferring protein (OSCP) revealed repeating sequences (residues 1-89, 105-190) suggesting an evolution of the protein by gene duplication. In addition to the reported homology with the δ-subunit of Escherichia coli F1ATPase, OSCP also shows a certain homology with the b-subunit of E. coli F0 and the ADP/ATP carrier of mitochondria.
We consider a novel 1-Dimensional representation of DNA, which is based on graphical representations of the 64 triplets of nucleic acids on the periphery of the unit circle. By using the polar coordinate of 64 codons (expressed in radians) four letter DNA sequence is transformed into a numerical sequence with no more than 64 different entries. By depicting the 1-Dim representation (plotted on z-coordinate) and using the x co-ordinate as running index one obtains "spectrum-like" graphical representation of DNA. The novel representation of DNA has some advantages over other spectrum-like 1-dimensional and 2-dimensional representations in using the same coordinates for the same codons thus avoiding computations of coordinates, which is characteristic of the Jeffrey's algorithm and graphical representations of DNA based on its modification.
Bioinformatics is facing a post-genomic era characterized by the release of large amounts of data boosted by the scientific revolution in high throughput technologies. This document presents an approach to deal with such a massive data processing problem in a paradigmatic application from which interesting lessons can be learned. The design of an out-of-core and modular implementation of traditional High-scoring Segment Pairs (HSPs) applications removes the limits of genome size and performs the work in linear time and with controlled computational requirements. Regardless of the expected huge I/O operations, the full system performs faster than state-of-the-art references providing additional advantages such as monitoring and interactive analysis, the exploitation of important intermediate results, and giving the specific nature of the modules, instead of monolithic software, enabling the plugging of external components to squeeze results.
The nucleotide sequences of the chloroplast genes for the alpha, beta and epsilon subunits of wheat chloroplast ATP synthase have been determined. Open reading frames of 1512 bp, 1494 bp and 411 bp are deduced to code for polypeptides of molecular weights 55201, 53796 and 15200, identified as the alpha, beta and epsilon subunits respectively by homology with the subunits from other sources and by amino acid sequencing of the epsilon subunit. The genes for the beta and epsilon subunits overlap by 4 bp. The gene for methionine tRNA is located 118 bp downstream from the epsilon subunit gene. Comparisons of the deduced amino acid sequences of the alpha and beta subunits with those from other species suggest regions of the proteins involved in adenine nucleotide binding.
Complement component C3, when activated on a cell surface via either the classical or the alternative pathway, becomes covalently bound on the cell surface1 as C3b. C3b serves as a cofactor for the activation of the lytic complement components C5-C9 as well as a ligand for the complement receptor type 1 (CR1). Both activities are terminated by factor I which cleaves C3b at two sites to yield iC3b and C3f. iC3b is subsequently cleaved to C3c and C3dg, which is further degraded to C3d, by serum proteases. iC3b and C3dg (and C3d) remain surface bound and can be recognized by at least two other leukocyte receptors, CR3 and CR2 respectively. iC3b was identified later but is probably the most stable C3 fragment bound to cell surfaces. iC3b is also opsonic and its binding to CR3 is considered to play a major role in the clearance of complement-coated targets. (For a more detailed account on complement and C3 degradation, see Refs. 2 and 3.)
In the 1960s, the multifaceted role of cyclic nucleotides and their associated phosphorylation systems became apparent Similar second messenger functions for Ca2+ were also suggested. In certain favorable instances, such as muscle contraction and neurotransmitter release, these were actually established. The role of Ca2+ in the control of neurotransmitter release has been investigated in great detail. In addition, the control of neuronal excitability and plasticity are of special interest. An effect of Ca2+ on the gating of voltage-gated ion channels has been suggested, but little is known about the initial Ca2+ receptor molecules along the stimulus-response pathway which decodes the information contained in a Ca2+ transient and which translates it into an effect on the gating of ion channels.
Previous work has defined the sequences sufficient to recapitulate the full expression pattern of the endogenous Hoxb-4 gene in transgenic mice. Several distinct regulatory regions have been identified which are responsible for particular aspects of Hoxb-4 expression. In this study I have begun a detailed analysis of the spatially-specific enhancer, region C, able to drive a major subset of Hoxb-4 expression in the mesoderm, central nervous system (CNS) and peripheral nervous system (PNS). Moreover, it is capable of imposing the correct anterior boundary of Hoxb-4 somitic expression on both the Hoxb-4 and hsp68 promoters. By a combination of sequence comparison and mutational analysis of a region C/hsp68-lacZ reporter gene, I have characterised two cw-regulatory elements that are critical for normal region C activity in transgenic mice. The first of these elements is located within an evolutionarily conserved region of the Hoxb-4 intron (CB1). Deletion of CB1 abolished lacZ reporter gene expression in the mesoderm, PNS and in the majority of the CNS of transgenic embryos. DNA electrophoretic-mobility shift assays revealed the presence of two overlapping binding sites for HoxTF and YY1 within CB1. Specific mutation of each binding-site showed that HoxTF is essential for efficient expression of Hoxb-4 in the embryo, whilst the role of YYl is unclear. UV crosslinking studies suggest that HoxTF binds to DNA as a heterodimer. Reporter gene analysis has shown that an isolated HoxTF binding element is capable of driving a consistent pattern of expression within the nervous system. These results show that the HoxTF element, though necessary, is not sufficient to drive Hoxb-4 mesodermal expression and requires the cooperation of other elements to achieve this. I have identified a second regulatory element, G5, required to achieve proper levels of region C enhancer activity. A transgenic reporter gene carrying an isolated fragment that contains the HoxTF/YY1 and G5 binding-sites drives a similar pattern of lacZ expression to that seen with the HoxTF site alone. This suggests that further elements are required to specify the Hoxb-4 pattern of mesodermal expression. 3' deletions have mapped these elements to a 269bp fragment that lies 3' to the HoxTF/YY1 site and within the 5' half of the intron. The 3' half of the intron is able to specify a limited pattern of expression in the PNS, implicating the role of a second HoxTF site that is involved in stabilising the activity of the region C enhancer.
We describe our characterization of kin-15 and kin-16, a tandem pair of homologous Caenorhabditis elegans genes encoding transmembrane protein tyrosine kinases (PTKs) with an unusual structure: the predicted extracellular domain of each putative gene product is only about 50 amino acids, and there are no potential autophosphorylation sites in the C-terminal domain. Using lacZ fusions, we found that kin-15 and kin-16 both appear to be expressed during postembryonic development in the large hypodermal syncytium (hyp7) around the time that specific hypodermal cells fuse with hyp7. kin-15 and kin-16 were positioned on the genetic and physical maps, but extrachromosomal arrays containing wild-type kin-15 and/or kin-16 genes were unable to complement candidate lethal mutations. The results suggest that kin-15 and kin-16 may be specifically involved in cell-cell interactions regulating cell fusions that generate the hypodermis during postembryonic development.
The complete nucleotide sequence of a mouse retro-element is presented. The cloned element is composed of 4,834 base pairs (bp) with long terminal repeats of 568 bp separated by an internal region of 3,698 bp. The element did not appear to have any open reading frames that would be capable of encoding the functional proteins that are normally produced by retro-elements. However, some regions of the genome showed some homology to retroviral gag and pol open reading frames. There was no region in VL30 corresponding to a retroviral env gene. This implies that VL30 is related to retrotransposons rather than to retroviruses. The sequence also contained regions that were homologous to known reverse transcriptase priming sites and viral packaging sites. These observations, combined with the known transcriptional capacity of the VL30 promoter, suggest that VL30 relies on protein functions of other retro-elements, such as murine leukemia virus, while maintaining highly conserved cis-active promoter, packaging, and priming sites necessary for its replication and cell-to-cell transmission.
The beta h3 pseudogene of the BALB/c mouse contains sequence defects which prevent transcription and translation to produce a beta-globin. Comparison with other globin gene sequences indicates that beta h3 arose by recombination between an adult beta-globin gene and some significantly diverged globin sequence. Analysis of noncoding sequences shows that the 3‘ end of mouse beta h3 and the human delta-globin gene are both descended from an ancestral gene, which we call proto-delta. The origin of proto-delta must predate the mammalian radiation. A member of the L1 family of interspersed repetitive elements is inserted into the 3‘ untranslated delta-homologous sequence in beta h3 from BALB/c. beta h3 is a widespread feature of the rodent beta-globin complex, which has been fixed in the genome for 35 million years. Independent inactivation events produced pseudogenes located between the adult and nonadult beta-globin genes in the rodent, primate, rabbit, and goat lineages. One model to explain the abundance and evolutionary persistence of pseudogenes postulates that the mammalian genome simply has no efficient mechanism for deleting nonessential sequences. Consequently, the genomes of higher eukaryotes have been growing, by the accumulation of duplications, with doubling times of 200 +/- 100 million years.
We report here the complete nucleotide and amino acid sequences for the α1-chain of mouse collagen IV which is 1669 amino acids in length, including a putative 27-residue signal peptide. In comparison with the amino acid sequence for the α2-chain (Saus, J., Quinones, S., MacKrell, A. J., Blumberg, B., Muthkumaran, G., Pihlajaniemi, J., and Kurkinen, M. (1989) J. Biol. Chem. 264, 6318–6324), the two chains of collagen IV are 43% identical. Most of the interruptions of the Gly-X-Y repeat are homologously placed but strikingly show no sequence similarity between the two chains. Availability of the amino acid sequences for human collagen IV allows a detailed comparison of the primary structure of collagen IV and reveals evolutionarily conserved domains of the protein. Between the two species, the α1(IV) chains are 90.6% and the α2(IV) chains are 83.5% identical in sequence. We discuss these data with respect to differential evolution between and within the collagen IV chain types.
We have isolated and sequenced a 2.1-kilobase cDNA encoding 86% of the sequence of alpha-actinin. The cDNA clone was isolated from a chick embryo fibroblast cDNA library constructed in the expression vector lambda gt11. Identification of this sequence as alpha-actinin was confirmed by immunological methods and by comparing the deduced protein sequence with the sequence of several CNBr fragments obtained from adult chicken smooth muscle (gizzard) alpha-actinin. The deduced protein sequence shows two distinct domains, one of which consists of four repeats of approximately 120 amino acids. This region corresponds to a previously identified 50-kDa tryptic peptide involved in formation of the alpha-actinin dimer. The last 19 residues of C-terminal sequence display an homology with the so-called E-F hand of Ca2+-binding proteins. Hybridization analysis reveals only one size of mRNA (approximately 3.5 kilobases) in fibroblasts, but multiple bands in genomic cDNA.
We have determined the primary structure of a phospholipid transfer protein (PLTP) isolated from maize seeds. This protein consists of 93 amino acids and shows internal homology originating in the repetition of (do)decapeptides. By using antibodies against maize PLTP, we have isolated from a cDNA library one positive clone (6B6) which corresponds to the incomplete nucleotide sequence. Another cDNA clone (9C2) was obtained by screening a size-selected library with 6B6. Clone 9C2 (822 base pairs) corresponds to the full-length cDNA of the phospholipid-transfer protein whose mRNA contains 0.8 kilobase. Southern blot analysis shows that the maize genome may contain several PLTP genes. In addition, the deduced amino acid sequence of clone 9C2 reveals the presence of a signal peptide. The significance of this signal peptide (27 amino acids) might be related to the function of the phospholipid-transfer protein. The amino acid sequence of maize PLTP was compared to those isolated from spinach leaves or castor bean seeds which exhibit physicochemical properties close to those of the maize protein. A high homology was observed between the three sequences. Three domains can be distinguished: a highly charged central core (around 40-60), a very hydrophobic N-terminal sequence characteristic of polypeptide-membrane interaction, and a hydrophilic C terminus. A model for plant phospholipid-transfer proteins is proposed in which the phospholipid molecule is embedded within the protein with its polar moiety interacting with the central hydrophilic core of the protein, whereas the N-terminal region plunges within the membrane in the transfer process.
The primary structure of the tetrameric plasma glycoprotein human alpha 2-macroglobulin has been determined. The identical subunits contain 1451 amino acid residues. Glucosamine-based oligosaccharide groups are attached to asparagine residues 32, 47, 224, 373, 387, 846, 968, and 1401. Eleven intrachain disulfide bridges have been placed (Cys25-Cys63, Cys228-Cys276, Cys246-Cys264, Cys255-Cys408, Cys572-Cys748, Cys619-Cys666, Cys798-Cys826, Cys824-Cys860, Cys898-Cys1298, Cys1056-Cys1104, and Cys1329-Cys1444). Cys-447 probably forms an interchain bridge with Cys-447 from another subunit. The beta-SH group of Cys-949 is thiol esterified to the gamma-carbonyl group of Glx-952, thus forming an activatable reactive site which can mediate covalent binding of nucleophiles. A putative transglutaminase cross-linking site is constituted by Gln-670 and Gln-671. The primary sites of proteolytic cleavage in the activation cleavage area (the “bait” region) are located in the sequence: -Arg681-Val-Gly-Phe-Tyr-Glu-. The molecular weight of the unmodified alpha 2-macroglobulin subunit is 160,837 and approximately 179,000, including the carbohydrate groups. The presence of possible internal homologies within the alpha 2-macroglobulin subunit is discussed. A comparison of stretches of sequences from alpha 2-macroglobulin with partial sequence data for complement components C3 and C4 indicates that these proteins are evolutionary related. The properties of alpha 2-macroglobulin are discussed within the context of proteolytically regulated systems with particular reference to the complement components C3 and C4.
The central role of actin in crucial cellular activities including muscle contraction, locomotion, cytokinesis, maintenance of cell shape and movement of cell surface receptors has been widely studied. Controlled modulation of the actin cytoskeleton is mediated by an array of molecularly diverse actin associated proteins that variously regulate its polymerisation state, geometric organisation and interactions with other ligands. I have cloned cDNAs encoding the transformation-sensitive actin gelating higher molecular weight isoform of a 21kDa polypeptide doublet (protein C4) found uniformly distributed along stress fibres in normal mesenchymal cells. This isoform, designated transgelin, was found to be the product of a single gene, conserved at the nucleotide level in the H sapiens, R norvegicus, D melanogaster, and Aplysia genomes with a single strong band as far back as the fission yeast S pombe. Northern blotting identified a single mRNA that was abundantly expressed in smooth muscle tissues and cultured fibroblasts but was absent in skeletal muscle, thymus and liver tissues. SV40-transformation of 3T3 fibroblasts was found to down-regulate transgelin expression at the level of transcription or mRNA stability. The protein encoded by these cDNAs was found to be significantly related to a number of other proteins (C41, M Smith unpublished; NP25, unpublished EMBL M84725; chick calponin α and β, Takahashi & Nadal-Ginard 1991; and Drosophila mp20, Ayme-Southgate et al 1989) suggesting that they may be classified as members of a new transgelin multigene family.
It is well established that the inhibitory neurotransmitter γ-aminobutyric acid (GABA) mediates many of its effects by binding to the GABAA receptor, which is present on the majority of mammalian brain neurons (Enna, 1983), resulting in the opening of an integral chloride channel. This receptor has considerable pharmaceutical importance as it contains binding sites for anti-convulsant (barbiturates), anxiogenic (β-carbolines) and convulsant (picrotoxin) drugs (reviewed in Turner and Whittle, 1983; Olsen and Venter, 1986). Although it is known that anxiolytic agents such as benzodiazepines also bind to the GABAA receptor, there exists a population of benzodiazepine-insensitive receptors (Study and Barker, 1981; Unnerstall et
al., 1981; de Bias et
al., 1988). Our understanding of the molecular structure and function of this receptor complex has been greatly increased by the application of recombinant DNA methodology and the subsequent cloning and expression of cDNA clones that encode its constituent subunits. This chapter will review these advances and describe some recent results arising from the use of the cloned sequences.
Abstract The computer software used for genomic analysis has become a crucial component of the infrastructure for life sciences. However, genomic software is still typically developed in an ad hoc manner, with inadequate funding, and by academic researchers not trained in software development, at substantial costs to the research community. I examine the roots of the incongruity between the importance of and the degree of investment in genomic software, and I suggest several potential remedies for current problems. As genomics continues to grow, new strategies for funding and developing the software that powers the field will become increasingly essential.
The Essential Bioinformatics Web Services (EBWS) are implemented on a new PHP-based server that provides useful tools for analyses of DNA, RNA, and protein sequences applying a user-friendly interface. Nine Web-based applets are currently available on the Web server. They include reverse complementary DNA and random DNA/RNA/peptide oligomer generators, a pattern sequence searcher, a DNA restriction cutter, a prokaryotic ORF finder, a random DNA/RNA mutation generator. It also includes calculators of melting temperature (TM) of DNA/DNA, RNA/RNA, and DNA/RNA hybrids, a guide RNA (gRNA) generator for the CRISPR/Cas9 system and an annealing temperature calculator for multiplex PCR. The pattern-searching applet has no limitations in the number of motif inputs and applies a toolbox of Regex quantifiers that can be used for defining complex sequence queries of RNA, DNA, and protein sequences. The DNA enzyme digestion program utilizes a large database of 1502 restriction enzymes. The gRNA generator has a database of 25 bacterial genomes searchable for gRNA target sequences and has an option for searching in any genome sequence given by the user. All programs are permanently available online at http://penchovsky.atwebpages.com/applications.php without any restrictions.
Repair of randomly occuring DNA injury in mammalian cells must require sophisticated and elaborative systems in view of the wide spectrum of different types of lesions that have to be recognized and removed, the enormous size of the mammalian genome and the complex chromatin structure, that should undergo reversible alterations for repair to take place. The finding of preferential repair of expressed genes as recently uncovered by Hanawalt and coworkers for the removal of pyrimidine dimers (Bohr et al., 1985, Mellon et al. 1986) elegantly illustrates how the cell deals with part of these problems: highest priority is given to repair of the most vital regions in the genome: namely those being used actively and of which transcription is hampered by damage in the template. The very recent discovery that the yeast repair gene RAD6 encodes a ubiquitin conjugating enzyme specific for histons 2A and 2B (Jentsch et al. 1987) and thought to be implicated in chromatin remodelling adds to the picture of tight interactions of repair events and chromatin structure and dynamics.
One of the proteins synthesized by the asexual blood stage of malaria which has been studied in some detail is the precursor to the major merozoite surface antigens (PMMSA), a glycoprotein that is synthesized throughout schizogony and present on the merozoite surface in processed form (Holder & Freeman, 1982, 1984; Howard et al., 1985). Complete or partial nucleotide sequences of the gene for this protein from a number of Plasmodium falciparum strains have been reported (Holder, et al., 1985; Mackay et al., 1985; Cheung et al., 1985; Weber et al,1986). Some features of the protein and its gene will be discussed here, in particular studies we have performed with a West African (Wellcome) strain of the parasite.
There are probably as many proteins involved in controlling the assembly, disassembly and interactions of microtubules in cells as there are for actin. But fewer have been studied in great detail and it is possible that some major categories remain to be discovered. In addition to the structural components described below, a number of minor enzymatic components are known to copurify with tubulin. These include a transphosphorylase that can convert bound GDP back to GTP, an enzyme that specifically removes tyrosine from the C-terminus of α-tubulin monomers, and another that can replace this terminal residue.
This conference is dedicated to examining new methods for the isolation and characterization of proteins. One extremely effective method for the characterization of a new protein involves the comparison of its amino acid sequence with the sequences of previously determined proteins. Although: this method is not new (but dates back to the early days of protein sequencing methodology), the wealth of information available is only recently being fully appreciated. The rapid increase in the accumulation of sequence data, owing to recombinant DNA technology, has greatly heightened interest in this area and has made large database searching a much more fruitful enterprise. The primary structures of well over 3, 000 proteins containing almost three quarters of a million residues are now known, more than double what was known just 5 years ago.
Our understanding of the molecular biology of viruses has always advanced through a combination of genetics, biochemistry, and structural analysis. For many viruses, genetic and biochemical approaches have been hampered by inadequate tissue culture systems for growth of the virus or hazard to the investigator of working with a particularly pathogenic virus. The advent of molecular cloning has permitted nucleotide sequence analysis of almost any virus. Even if only very small amounts of viral nucleic acid can be obtained, this can be cloned and produced in unlimited amounts. As a result of recent increases in the speed of nucleic acid sequencing, particularly the shotgun strategy using the M13 cloning/dideoxynucleotide chain-termination method , it is now quite common for the first detailed information on a particular viral genome to come from DNA sequencing. This technology is being applied to the larger, less well understood viruses such as the herpesviruses and has resulted in the sequences of large regions of herpes simplex virus , varicella zoster virus , human cytomegalovirus, and the complete sequence of Epstein—Barr virus (EBV) . The complete sequences of all these human herpesviruses (genome sizes of the order of 100–240 kilobase pairs) will be known within the next few years. Because the DNA sequencing is breaking new ground with little or no previous genetics, it is useful and important to assess what information can be interpreted from a DNA sequence and how this knowledge cab be used to design further experiments to understand these sequences.
RNA-binding proteins (RBPs) play a crucial role in key cellular processes, including RNA transport, splicing, polyadenylation
and stability. Understanding the interaction between RBPs and RNA is key to improve our knowledge of RNA processing, localization
and regulation in a global manner. Despite advances in recent years, a unified non-redundant resource that includes information
on experimentally validated motifs, RBPs and integrated tools to exploit this information is lacking. Here, we developed a
database named ATtRACT (available at http://attract.cnic.es) that compiles information on 370 RBPs and 1583 RBP consensus binding motifs, 192 of which are not present in any other database.
To populate ATtRACT we (i) extracted and hand-curated experimentally validated data from CISBP-RNA, SpliceAid–F, RBPDB databases,
(ii) integrated and updated the unavailable ASD database and (iii) extracted information from Protein-RNA complexes present
in Protein Data Bank database through computational analyses. ATtRACT provides also efficient algorithms to search a specific
motif and scan one or more RNA sequences at a time. It also allows discovering de novo motifs enriched in a set of related sequences and compare them with the motifs included in the database.
Database URL: http:// attract. cnic. es
The Notch locus is essential for proper differentiation of the ectoderm in Drosophila melanogaster. Notch corresponds to a 37-kilobase transcription unit that codes for a major 10.4-kilobase polyadenylated RNA. The DNA sequence of this transcription unit is presented, except for portions of the two largest intervening sequences. DNA sequences also were obtained from three Notch cDNA clones, allowing the 5' and 3' ends of the gene to be mapped, and the structures and locations of nine RNA coding regions to be determined. The major Notch transcript encodes a protein of 2,703 amino acids. The protein is probably associated with cell surfaces and carries an extracellular domain composed of 36 cysteine-rich repeating units, each of about 38 amino acids. The gene appears to have evolved by repeated tandem duplications of the DNA coding for the 38-amino-acid-long protein segments, followed by insertion of intervening sequences. These repeating protein segments are quite homologous to portions of mammalian clotting factors IX and X and to the product of the Caenorhabditis elegans developmental gene lin-12. They are also similar to mammalian growth hormones, typified by epidermal growth factor.
The amino acid sequences of the putative polypeptides of maize streak virus (MSV) have been systematically compared with those of cassava latent virus (CLV) and tomato golden mosaic virus (TGMV) using the programme DIAGON (8).
Conserved sequences have been detected between peptides encoded by the complementary (-) sense of MSV and those of CLV and TGMV, viz; the 40 200 Mr polypeptide of CLV-1 (3) and the 40 285 Mr polypeptide of TGMV-A (4) show extensive homologies with the 17 768 Mr and 31 388 Mr polypeptides of MSV (6).
Distant and variable homologies have been detected between the putative coat protein of MSV when compared with those of CLV and TGMV. No other relationships between the potential gene products of MSV and those of CLV and TGMV have been detected.
The extensive homologies detected between the complementary sense encoded peptides suggest that they are derived from functional genes, and that the directly conserved sequences may contain amino acids essential to the function of these proteins. The less extensive homologies among the putative coat proteins are considered in relation to their possible structures and functions.
A simple method for detecting similarities in sequences is described. It is used:1To provide similarity measures for classifying 25 cytochromes by their amino acid sequences,2to detect repetitions in the amino acid sequences of various proteins,3to detect regions of possible base-pairing in the nucleotide sequence of a nuclei acid.
Given two finite sequences, we wish to find the longest common subsequences satisfying certain deletion/insertion constraints. Consider two successive terms in the desired subsequence. The distance between their positions must be the same in the two original sequences for all but a limited number of such pairs of successive terms. Needleman and Wunsch gave an algorithm for finding longest common subsequences without constraints. This is improved from the viewpoint of computational economy. An economical algorithm is then elaborated for finding subsequences satisfying deletion/insertion constraints. This result is useful in the study of genetic homology based on nucleotide or amino-acid sequences.
We present interective computer programs for the analysis of nucleic acid sequences. In order to handle these programs, minimum
computer experience is sufficient. The nucleotide sequence of the human gamma globin gene complex is used as an example to
illustrate the data analysis.
We describe a computer program designed to facilitate the pattern matching analysis of homologies between DNA sequences. It
takes advantage of a two-dimensional plot in order to simplify the evaluation of significant structures inherited in the sequences.
The program can be divided into three parts, i) algorithm for search of homologies, ii) twodimensional graphic display of
the result, iii) further graphic treatment to enhance significant structures.
The power of the graphic display is presented by the following application of the program. We have conducted a search for
direct repeats in the mouse immunoglobulin K-chain genes. Both the five J DNA sequences and other shorter repeats were found.
We also found a longer stretch of homology that could indicate the presence of duplicated DNA in the J4, J5 region.
An interactive system for computer analysis of nucleic acid and protein sequences has been developed for the Los Alamos DNA
Sequence Database. It provides a convenient way to search or verify various sequence features, e.g., restriction enzyme sites,
protein coding frames, and properties of coded proteins. Further, the comprehensive analysis package on a large-scale database
can be used for comparative studies on sequence and structural homologies in order to find unnoted information stored in nucleic
We describe a computer program designed to facilitate the analysis of nucleic acid sequences. The program can search several nucleic acid sequences for oligonucleotides common to all of them. It can examine a DNA or RNA sequence for two kinds of homologous regions--repetitions and dyad symmetries. The homologies need not be perfect: mismatches and "looping out" of nucleotides are allowed. The program also finds (A+T)- and (G+C)-rich regions, locates restriction enzyme recognition sites, determines the distribution of di- and trinucleotides, and performs various other functions. We include two representative applications of the program. All published prokaryotic transcription termination sequences (June 1977) were found to share the following features: (i) a string of at least five T residues, (ii) the sequence CGGGC or a close analog immediately preceding the T cluster, (iii) a region of strong dyad symmetry preceding the Ts and including the CGGGC sequence. A sequence of 221 nucleotides consisting of the Escherichia coli trp promoter, operator, and leader was found to contain two strong dyad symmetries. These homologies both occur at known regulatory sites; no comparable homologies occur in regions without regulatory significance.
With modern fast sequencing techniques1,2 and suitable computer programs it is now possible to sequence whole genomes without the need of restriction maps. This paper
describes computer programs that can be used to order both sequence gel readings and clones. A method of coding for uncertainties
in gel readings is described. These programs are available on request.
The speed of the new DNA sequencing techniques has created a need for computer programs to handle the data produced. This
paper describes simple programs designed specifically for use by people with little or no computer experience. The programs
are for use on small computers and provide facilities for storage, editing and analysis of both DNA and amino acid sequences.
A magnetic tape containing these programs is available on request.
An improved method for testing similarities or repeats in protein sequences is described. It includes three features: a measure of similarity for amino acids, based on observed substitutions in homologous proteins; a search procedure which compares all pairs of segments of two proteins; new statistical tests which estimate the probabilities that observed correlations could have occurred by chance. Calculations show that gene duplication has probably not occurred in plant ferredoxins; phage Qβ and f2 coat proteins may be homologous; and repeats in cytochrome c are not statistically significant. The method predicted an alignment of cytochrome c and c551 sequences which later appeared consistent with Dickerson's atomic model of horse cytochrome c.
A method for optimally locating gaps in the amino acid sequences of homologous proteins is presented. The method involves three steps: (1) demonstration that the sequences are indeed homologous, (2) location of regions where the homologous pairing is reasonably certain, and (3) location of gaps between these regions so as to minimize the total number of mutations required to account for the differences between the two sequences. The major virtues of this procedure are that the assertion of homology does not depend upon the prior introduction of gaps and that a genetic rather than a chemical test is the basis for asserting a genetic relationship.
A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed. From these findings it is possible to determine whether significant homology exists between the proteins. This information is used to trace their possible evolutionary development.The maximum match is a number dependent upon the similarity of the sequences. One of its definitions is the largest number of amino acids of one protein that can be matched with those of a second protein allowing for all possible interruptions in either of the sequences. While the interruptions give rise to a very large number of comparisons, the method efficiently excludes from consideration those comparisons that cannot contribute to the maximum match.Comparisons are made from the smallest unit of significance, a pair of amino acids, one from each protein. All possible pairs are represented by a two-dimensional array, and all possible comparisons are represented by pathways through the array. For this maximum match only certain of the possible pathways must be evaluated. A numerical value, one in this case, is assigned to every cell in the array representing like amino acids. The maximum match is the largest number that would result from summing the cell values of every pathway.
The properties of a small internal deletion mutant, E675, have been exploited in the molecular cloning of the unc-54 gene. This mutation uniquely identifies unc-54 sequences as molecules of altered length in E675 and provides a genetic and physical marker for the active gene, its messenger RNA and its protein product.
The enhanced graphic matrix procedure analyzes nucleic acid and amino acid sequences for features of possible biological interest and reveals the spatial patterns of such features. When a sequence is compared to itself the technique shows regions of self-complementarity, direct repeats, and palindromic subsequences. Comparison of two different sequences, exemplified by immunoglobulin kappa light chain genes, by using colored graphic matrices showed domains of similarity, regions of divergence, and features explainable by transpositions. Analysis of mouse constant domain immunoglobulin sequences revealed self-complementary regions that can be used to fold the molecule into a structure consistent with electron microscopic observations. Computer translation of nucleic acid sequences into all possible amino acid sequences followed by graphic matrix analysis provides a way to detect the most likely protein encoding regions and can predict the correct reading frames in sequences in which splicing patterns are not defined. Application of this technique to regions of simian virus 40 and polyoma virus demonstrates the frames of translation and shows the agreement of sequences determined in separate laboratories with different virus isolates. The graphic matrix technique can also be used to assemble fragmentary sequences during determination, to display local variations in base composition, to detect distant evolutionary relationships, and to display intragenic variation in rates of evolution.
We present an algorithm--a generalization of the Needleman-Wunsch-Sellers algorithm--which finds within longer sequences all
subsequences that resemble one another locally. The probability that so close a resemblance would occur by chance alone is
calculated and used to classify these local homologies according to statistical significance. Repeats and inverted repeats
may also be found. Results for both random and biological nucleic acid sequences are presented. Fourteen complete genomes
are analyzed for dyad symmetries.
A FORTRAN program to analyze homology of letter strings (nucleotide or amino acid sequences) and to display the result in
the form of a dot matrix is presented. The program is generally usable, user-friendly and has a number of options (filtering,
“fudging,” i.e., consideration of groups of homologous residues, and screening, i.e., display of only particular groups of
residues) which greatly potentiate its analytical power.
This paper describes a computer method that uses codon preference to help find protein coding regions in long DNA sequences.
The method can distinguish between introns and exons and can help to detect sequencing errors.
This paper describes a new way of storing DNA gel reading data and an accompanying set of computer programs. These programs
will perform all the manipulations that are required on data gained by the so-called ‘shotgun’ method of DNA sequencing. This
system simplifies the computer processing involved with this sequencing method and also has the capability of being able at
any time during a project to display, lined up in register, all the gel readings covering any section of the sequence.
A R Macleod
Macleod, A.R., Karn, J. and Brenner, S. (1981) Nature 291, 386-390.
P M Mcguire
Jagadeeswarar, P. and McGuire, P.M., Jr. (1982) Nucl. Acids Res. 10,
A J Gibbs
G A Mcintyre
Gibbs, A.J. and McIntyre, G.A. (1970) Eur. J. Biochem. 16, 1-11.
Novotny, J. (1982) Nucl. Acids Res. 10, 127-131.
P H Sellers
Sellers, P.H. (1974) J. Appl. Math. (Siam) 26, 787-793.