Article

A new computer method for the storage and manipulation of DNA gel reading data

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

This paper describes a new way of storing DNA gel reading data and an accompanying set of computer programs. These programs will perform all the manipulations that are required on data gained by the so-called ‘shotgun’ method of DNA sequencing. This system simplifies the computer processing involved with this sequencing method and also has the capability of being able at any time during a project to display, lined up in register, all the gel readings covering any section of the sequence.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The main objective of these methods is to construct the original DNA sequence based on the small DNA fragments by maximising the overlapping scores between consecutive sequences. Greedy methods were firstly applied to deal with the DNA fragments assembly problem by Staden (1980), afterwards, metaheuristics were widely used to solve the same problem. Gheraibia et al (2013) proposed Penguins Search Optimisation Algorithm (PeSOA) based on the collaborative hunting strategy of penguins. ...
... Greely's methods were the first work on DNA assembly problem, these kind of methods can be easily implemented. The well-known and the most used greedy method in DNA fragment assembly is proposed by Staden (1980) (see Figure 1). Genetic algorithm has been widely used to solve the DNA fragment assembly. ...
Article
DNA Fragment Assembly (DFA) is a process of finding the best order and orientation of a set of DNA fragments to reconstruct the original DNA sequence from them. As it has to consider all possible combinations among the DNA fragments, it is considered as a combinatorial optimisation problem. This paper presents a method showing the use of Penguins Search Optimisation Algorithm (PeSOA) for DNA fragment assembly problem. Penguins search optimisation is a nature inspired metaheuristic algorithm based on the collaborative hunting strategy of penguins. The approach starts its operation by generating a set of random population. After that, the population is divided into several groups, and each group contains a set of active fragments in which the penguins concentrate on the search process. The search process of the penguin optimisation algorithm is controlled by the oxygen reserve of penguins. During the search process each penguin shares its best found solution with other penguins to quickly converge to the global optimum. In this paper, the authors adapted the original PeSOA algorithm to obtain a new algorithm structure for DNA assembly problem. The effectiveness of the proposed approach has been verified by applying it on the well-known benchmarks for the DNA assembly problem. The results show that the proposed method performed well compared to the most used DNA fragment assembly methods.
... The error rate is remarkably reduced in this encoding scheme. Staden (1980) [44] described a new way of storing DNA gel reading data. Shin and Pierce (2004) [41] described a DNA scaffold that supports a one-dimensional array of independently and reversibly addressable sites at 7 nm spacing. ...
... The error rate is remarkably reduced in this encoding scheme. Staden (1980) [44] described a new way of storing DNA gel reading data. Shin and Pierce (2004) [41] described a DNA scaffold that supports a one-dimensional array of independently and reversibly addressable sites at 7 nm spacing. ...
Technical Report
This is about my current project "Feature Extraction Method of Retinal Microvasculature", published in the magazine focused on the annual research update (Sustainable Community Transformation) of University Malaysia Sarawak(UNIMAS).
... The error rate is remarkably reduced in this encoding scheme. Staden (1980) [44] described a new way of storing DNA gel reading data. Shin and Pierce (2004) [41] described a DNA scaffold that supports a one-dimensional array of independently and reversibly addressable sites at 7 nm spacing. ...
... The error rate is remarkably reduced in this encoding scheme. Staden (1980) [44] described a new way of storing DNA gel reading data. Shin and Pierce (2004) [41] described a DNA scaffold that supports a one-dimensional array of independently and reversibly addressable sites at 7 nm spacing. ...
Article
Full-text available
DNA (Deoxyribonucleic Acid) computing is a recent computing technique which is also referred as bio molecular computing or molecular computing. DNA computing is a new avenue for solving the computational problem manipulating the distinct nanoscopic molecule and nowadays the approaches of DNA computing are being employed to resolve combinatorial problems utilizing the advantages of parallelism and high-density storage characteristics of DNA. Besides DNA is considered as the most feasible substance to shape the most nanoscopic materials, manufacture distinct nanomechanical devices and formulating large-scale nanostructures due to its expedient structural features and molecular recognition properties. A concise discussion regarding the splendid advances in constructing nanoelectronics employing DNA computing paradigm and challenges of DNA computing is focused in this paper.
... In the 1980s, the advent of information technology laid the foundations of bioinformatics. Processor computing allowed to automate the general principles of overlapping sequences by similarity using dedicated computer programs [16,161], the first genome assemblers. In 1980, to describe the data obtained after assembly of shotgun sequencing reads, Staden coined the word 'contig'. ...
... .] The gel readings in a contig can be summed to form a contiguous consensus sequence and the length of this sequence is the length of the contig" [161]. This is probably the first outline of the general principle of Overlap-Layout-Consensus (OLC) used to infer the original sequence from a subset of sequences. ...
Article
Full-text available
Genomes represent the starting point of genetic studies. Since the discovery of DNA structure, scientists have devoted great efforts to determine their sequence in an exact way. In this review we provide a comprehensive historical background of the improvements in DNA sequencing technologies that have accompanied the major milestones in genome sequencing and assembly, ranging from early sequencing methods to Next-Generation Sequencing platforms. We then focus on the advantages and challenges of the current technologies and approaches, collectively known as Third Generation Sequencing. As these technical advancements have been accompanied by progress in analytical methods, we also review the bioinformatic tools currently employed in de novo genome assembly, as well as some applications of Third Generation Sequencing technologies and high-quality reference genomes.
... Alternatively, DNA was labelled at the 5' end by treatment with calf intestinal phosphatase and subsequent rephosphorylation with T4 polynucleotide kinase and [-y-32P]ATP (Maniatis et al., 1982). DNA sequences were stored and analysed using the DBUTIL and other computer programs of Staden (1980). SI nuclease mapping This was carried out according to Berk and Sharp (1977) and Weaver and Weissman (1979). ...
Article
The mRNA sequence of the human intrinsic clotting factor IX (Christmas factor) has been completed and is 2802 residues long, including a 29 residue long 5′ non-coding and a 1390 residue long 3′ non-coding region, but excluding the poly(A) tail. The factor IX gene is approximately 34 kb long and we define, by the sequencing of 5280 residues, the presumed promoter region, all eight exons, and some intron and flanking sequence. Introns account for 92% of the gene length and the longest is estimated to be 10 100 residues. Exons conform roughly to previously designated protein regions, but the catalytic region of the protein is coded by two separate exons. This differs from the arrangement in the other characterized serine protease genes which are further subdivided in this region.
... Complete or near-complete genome assemblies of noncultivable members of microbial communities have been sometimes obtained using iterative assembly procedures (Pelletier et al. 2008); however, this only works for members that are largely overrepresented in the community. Recently, the availability of new methods to "bin" metagenomes into sets of groups of contiguous sequences ("contigs" [Staden 1980]) hypothetically coming from different species has triggered a shift from gene-centric to "genome-centric" approaches, where the aim is now to assemble and characterize the genome of each member of the community in order to understand its metabolic activities (Waldor et al. 2015). The most popular binning approaches are based on the guanine + cytosine (GC) content and/or on coverage (i.e., how often a given stretch of DNA is represented among the sequence reads), assuming that the abundance of each species in the mix should endow it with a characteristic coverage signature (Albertsen et al. 2013;Alneberg et al. 2014). ...
... An overlap is typically defined as maximally scoring alignment between two strings that allows arbitrary orientation and offset of the reads 74 . For two reads S 1 and S 2 , both of length O(L), the fastest method for determining their optimal alignment is Smith-Waterman (SW) dynamic programming, which has a computational complexity of O(L 2 ) 75 . ...
Preprint
Full-text available
We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.
... Beyond assembly, a key challenge with metagenomics is grouping contigs into genome bins. We use "contig" in the way it was originally defined by Rodger Staden, where a contig is a set of overlapping segments of DNA from shotgun sequencing [12]. It is rare for a complete genome to be assembled into a single piece de novo from short reads, so contigs are grouped into "bins," often based on coverage and tetranucleotide frequencies. ...
Article
Full-text available
Metagenomics facilitates the study of the genetic information from uncultured microbes and complex microbial communities. Assembling complete genomes from metagenomics data is difficult because most samples have high organismal complexity and strain diversity. Some studies have attempted to extract complete bacterial, archaeal, and viral genomes and often focus on species with circular genomes so they can help confirm completeness with circularity. However, less than 100 circularized bacterial and archaeal genomes have been assembled and published from metagenomics data despite the thousands of datasets that are available. Circularized genomes are important for (1) building a reference collection as scaffolds for future assemblies, (2) providing complete gene content of a genome, (3) confirming little or no contamination of a genome, (4) studying the genomic context and synteny of genes, and (5) linking protein coding genes to ribosomal RNA genes to aid metabolic inference in 16S rRNA gene sequencing studies. We developed a semi-automated method called Jorg to help circularize small bacterial, archaeal, and viral genomes using iterative assembly, binning, and read mapping. In addition, this method exposes potential misassemblies from k-mer based assemblies. We chose species of the Candidate Phyla Radiation (CPR) to focus our initial efforts because they have small genomes and are only known to have one ribosomal RNA operon. In addition to 34 circular CPR genomes, we present one circular Margulisbacteria genome, one circular Chloroflexi genome, and two circular megaphage genomes from 19 public and published datasets. We demonstrate findings that would likely be difficult without circularizing genomes, including that ribosomal genes are likely not operonic in the majority of CPR, and that some CPR harbor diverged forms of RNase P RNA. Code and a tutorial for this method is available at https://github.com/lmlui/Jorg and is available on the DOE Systems Biology KnowledgeBase as a beta app.
... Genome assembly from high-throughput sequencing (HTS) reads is a fundamental yet challenging computational problem in the genome research. Two major frameworks of assembly methods are the overlap-layout-consensus (OLC) paradigm (Staden 1980) and the de Bruijn graph (DBG) representation (Waterman 1995;Pevzner et al. 2001) of k-mers. ...
Article
Full-text available
Genome assembly from the high-throughput sequencing (HTS) reads is a fundamental yet challenging computational problem. An intrinsic challenge is the uncertainty caused by the widespread repetitive elements. Here we get around the uncertainty using the notion of uniquely mapped (UM) reads, which motivated the design of a new assembler BAUM. It mainly consists of two types of iterations. The first type of iterations constructs initial contigs from a reference, say a genome of a species that could be quite distant, by adaptive read mapping, filtration by the reference's unique regions, and reference updating. A statistical test is proposed to split the layouts at possible structural variation sites. The second type of iterations includes mapping, scaffolding/contig-extension, and contig merging. We extend each contig by locally assembling the reads whose mates are uniquely mapped to an end of the contig. Instead of the de Bruijn graph method, we take the overlap-layout-consensus (OLC) paradigm. The OLC is implemented by parallel computation, and has linear complexity with respect to the number of contigs. The adjacent extended contigs are merged if their alignment is confirmed by the adjusted gap distance. Throughout the assembling, the mapping criterion is selected by probabilistic calculations. These innovations can be used complementary to the existing de novo assemblers. Applying this novel method to the assembly of wild rice Oryza longistaminata genome, we achieved much improved contig N50, 18.8k, compared with other assemblers. The assembly was further validated by contigs constructed from an independent library of long 454 reads.
... Complete or near-complete genome assemblies of noncultivable members of microbial communities have been sometimes obtained using iterative assembly procedures (Pelletier et al. 2008); however, this only works for members that are largely overrepresented in the community. Recently, the availability of new methods to "bin" metagenomes into sets of groups of contiguous sequences ("contigs" [Staden 1980]) hypothetically coming from different species has triggered a shift from gene-centric to "genome-centric" approaches, where the aim is now to assemble and characterize the genome of each member of the community in order to understand its metabolic activities (Waldor et al. 2015). The most popular binning approaches are based on the guanine + cytosine (GC) content and/or on coverage (i.e., how often a given stretch of DNA is represented among the sequence reads), assuming that the abundance of each species in the mix should endow it with a characteristic coverage signature (Albertsen et al. 2013;Alneberg et al. 2014). ...
Chapter
Biofuels, that is, nonconventional liquid and gaseous fuels derived from renewable sources, such as crop plants, forest products, algae, or waste materials, are widely promoted as a sustainable alternative to fossil fuels and a means to secure our energy supply (Tilman et al. 2009). Other potential benefits of biofuel production are the creation of new, local employment (e.g., in rural areas) and the reduction of emissions of greenhouse gases (Fargione et al. 2008; Duke et al. 2013). A broader definition of 428biofuels (that we will use in this chapter) includes fuels produced from other renewable sources, such as carbon dioxide (CO2), exploited by biodiesel- or hydrogen (H2)-producing phototrophic organisms.
... ESTs were clustered separately for each of the seven rectal gland libraries using CAP3 (Huang and Madan, 1999). We refer to the clustered ESTs as contigs, after the definition of Staden (1980), which comprised both clustered EST and un-clustered singletons. Contigs were subjected to BLAST analysis (NCBI BLAST; Altschul et al., 1997) against protein and sequence databases using ad hoc Perl scripts based on the BioPerl programming interface (Stajich et al., 2002) to run BLAST analysis. ...
Thesis
Full-text available
Elasmobranchs (sharks, skates, and rays) are a primarily carnivorous group of vertebrates that consume very few carbohydrates and have little reliance on glucose as an oxidative fuel, the one exception being the rectal gland. This has led to a dearth of information on glucose transport and metabolism in these fish, as well as the presumption of glucose intolerance. Given their location on the evolutionary tree however, understanding these aspects of their physiology could provide valuable insights into the evolution of glucose homeostasis in vertebrates. In this thesis, the presence of glucose transporters in an elasmobranch was determined and factors regulating their expression were investigated in the North Pacific spiny dogfish (Squalus suckleyi). In particular, the presence of a putative GLUT4 transporter, which was previously thought to have been lost in these fish, was established and its mRNA levels were shown to be upregulated by feeding (intestine, liver, and muscle), glucose injections (liver and muscle), and insulin injections (muscle). These findings, along with that of increases in muscle glycogen synthase mRNA levels and muscle and liver glycogen content, indicate a potentially conserved mechanism for glucose homeostasis in vertebrates, and argue against glucose intolerance in elasmobranchs. In contrast to the other tissues examined, there was a decrease in glut4 mRNA levels within the rectal gland in response to natural feeding, a factor known to activate the gland, suggesting mRNA storage for rapid protein synthesis upon activation. A similar trend was also shown for sglt1 in the rectal gland, and the ability of GLUT and SGLT inhibitors to prevent chloride secretion solidified the importance of glucose uptake for gland function. The exogenous factor of salinity was also investigated and high levels of glut mRNA were observed within the rectal glands of low salinity-acclimated fish relative to control and high salinity fish, reiterating the idea of mRNA storage when the gland is expected to be inactive. Taken together, the results of this thesis demonstrate that glucose is an important fuel in the dogfish (and likely other elasmobranchs) and that the dogfish is fully capable of regulating its storage and circulation, contrary to prior beliefs.
... The great majority of de novo assembly algorithms are based on the OLC paradigms or on DBG, and both approaches exploit the overlap among sequenced reads to reconstruct an entire genome structure. The OLC is computational procedure, introduced by Staden [82] and extended by many other scientists, made of three steps that consist in first finding overlap (O) among all reads, then creating a layout (L) of all the reads and their overlap on a graph and finally inferring the consensus (C) sequence. In OLC, the overlap between pairs of reads is calculated explicitly by doing all-against-all pairwise read alignment, and in the resulting graph, two nodes (reads) are linked when two reads overlap larger than a length cutoff. ...
Article
Full-text available
The nanopore sequencing process is based on the transit of a DNA molecule through a nanoscopic pore, and since the 90s is considered as one of the most promising approaches to detect polymeric molecules. In 2014, Oxford Nanopore Technologies (ONT) launched a beta-testing program that supplied the scientific community with the first prototype of a nanopore sequencer: the MinION. Thanks to this program, several research groups had the opportunity to evaluate the performance of this novel instrument and develop novel computational approaches for analyzing this new generation of data. Despite the short period of time from the release of the MinION, a large number of algorithms and tools have been developed for base calling, data handling, read mapping, de novo assembly and variant discovery. Here, we face the main computational challenges related to the analysis of nanopore data, and we carry out a comprehensive and up-to-date survey of the algorithmic solutions adopted by the bioinformatic community comparing performance and reporting limits and advantages of using this new generation of sequences for genomic analyses. Our analyses demonstrate that the use of nanopore data dramatically improves the de novo assembly of genomes and allows for the exploration of structural variants with an unprecedented accuracy and resolution. However, despite the impressive improvements reached by ONT in the past 2 years, the use of these data for small-variant calling is still challenging, and at present, it needs to be coupled with complementary short sequences for mitigating the intrinsic biases of nanopore sequencing technology.
... The PacBio long-read data were assembled using an overlap-layout-consensus method (Staden, 1980). First, the longer reads were selected and corrected, and these were then used to obtain a draft assembly. ...
Preprint
Cylas formicarius is one of the most important pests of sweet potato worldwide, causing considerable ecological and economic damage. To improve the effect of comprehensive management and understanding of genetic mechanisms, the genetic functions of C. formicarius have been the subject of intensive study. Using Illumina and PacBio sequencing, we obtained a chromosome-level genome assembly of adult weevils from lines inbred for 15 generations. The high-quality assembly obtained had a size of 338.84 Mb, with contig and scaffold N50 values of 14.97 Mb and 34.23 Mb, respectively. In total, 157.51 Mb of repeat sequences and 11,907 protein-coding genes were predicted. A total of 337.06 Mb of genomic sequences was located on the 11 chromosomes, and the sequence length that could be used to determine the sequence and direction accounted for 99.03% of the total length of the associated chromosome. Comparative genomic analysis showed that C. formicarius was sister to Dendroctonus ponderosae, and C. formicarius diverged from D. ponderosae approximately 138.89 million years ago (Mya). Many important gene families that were expanded in the C. formicarius genome were involved in the chemosensory system. In an in-depth study, the binding assay results indicated that CforOBP4-6 had strong binding affinities for sex pheromones and other ligands. Overall, the high-quality C. formicarius genome provides a valuable resource to reveal the molecular ecological basis, genetic mechanism and evolutionary process of major agricultural pests, deepen the understanding of environmental adaptability and apparent plasticity, and provide new ideas and new technologies for ecologically sustainable pest control.
... The amount of short-read sequences produced from samples, like whole-genome DNA, RNA or metagenomic samples, can be in the order of billions. Construction of long-enough contiguous sequences, also referred to as contigs (Staden, 1980), from the sets of reads is known as the fragment assembly problem, and is a central and long-standing problem in computational biology. Many short-read fragment assembly algorithms [BCALM (Chikhi et al., 2014), BCALM2 (Chikhi et al., 2016), Bruno (Pan et al., 2018), and deGSM (Guo et al., 2019)] use the de Bruijn graph to represent the input set of reads, and assemble fragments through graph compaction. ...
Article
Full-text available
Motivation The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem. Results We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using ∼29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using ∼84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using ∼126 GB of memory, and over 16 h using ∼289 GB of memory, respectively. Availability and implementation Cuttlefish is implemented in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish. Supplementary information Supplementary data are available at Bioinformatics online.
... The OLC model for Genome Sequence Assembly has received much merit since being proposed in 1980 [6,7]. We only dwell on the layout phase in this paper because it is at the heart of the DNA fragment assembly problem. Figure 1 shows how the OLC model for Genome Sequence Assembly works with an example. ...
Preprint
Full-text available
With the advent of Genome Sequencing, the field of Personalized Medicine has been revolutionized. From drug testing and studying diseases and mutations to clan genomics, studying the genome is required. However, genome sequence assembly is a very complex combinatorial optimization problem of computational biology. PSO is a popular meta-heuristic swarm intelligence optimization algorithm, used to solve combinatorial optimization problems. In this paper, we propose a new variant of PSO to address this permutation-optimization problem. PSO is integrated with the Chaos and Levy Flight (A random walk algorithm) to effectively balance the exploration and exploitation capability of the algorithm. Empirical experiments are conducted to evaluate the performance of the proposed method in comparison to the other variants of the PSO proposed in the literature. The analysis is conducted on four DNA coverage datasets. The conducted analysis demonstrates that the proposed model attain a better performance with better reliability and consistency in comparison to other competitive methods in all cases.
... As sequencing technologies are optimized for moderate-to high-coverage individual samples, metagenomic samples often result in different read coverage profiles across different genomes [53]. Due to these differences, contigs (a gapless stretch of nucleotide sequence generated by overlapping sequencing reads [54]) obtained from metagenomic samples are frequently short, resulting in genome assemblies that are fragmented and/or incomplete [55]. This is a non-negligible factor in the prediction accuracy of most tools, with short viral contigs (<10 kb) generally experiencing a significant drop in prediction accuracy [26,37,44]. ...
Article
Full-text available
Increased antibiotic resistance has prompted the development of bacteriophage agents for a multitude of applications in agriculture, biotechnology, and medicine. A key factor in the choice of agents for these applications is the host range of a bacteriophage, i.e., the bacterial genera, species, and strains a bacteriophage is able to infect. Although experimental explorations of host ranges remain the gold standard, such investigations are inherently limited to a small number of viruses and bacteria amendable to cultivation. Here, we review recently developed bioinformatic tools that offer a promising and high-throughput alternative by computationally predicting the putative host ranges of bacteriophages, including those challenging to grow in laboratory environments.
... The OLC algorithm is based on constructing an overlap graph by overlapping similar sequences. This approach initially introduced in 1980 [7] and afterward extended and developed by many scientists. The first OLC assembler was introduced in 2000 [8] for Sanger data and later was updated for NGS data too. ...
Article
Full-text available
De novo genome assemblers assume the reference genome is unavailable, incomplete, highly fragmented, or significantly altered as in cancer tissues. Algorithms for de novo assembly have been developed to deal with and assemble a large number of short sequence reads from genome sequencing. In this manuscript, we have provided an overview of the graph-theoretical side of de novo genome assembly algorithms. We have investigated the construction of fourteen graph data structures related to OLC-based and DBG-based algorithms in order to compare and discuss their application in different assemblers. In addition, the most significant and recent genome de novo assemblers are classified according to the extensive variety of original, generalized, and specialized versions of graph data structures.
... The word contig denotes contiguous sequences, a term which will be used later in sequence assembly, whose task is to reconstruct a genome from sequencing data. It is first appearance of what I call a tig sequence, from Staden [1980], which quotes: "In order to make it easier to talk about data gained by the shotgun method of sequencing, we have invented the word 'contig' [...]". Afterwards, other words with the tig suffix gradually appeared in the literature (although some of the sequences presented below do not have the tig suffix, but still fall in the scope). ...
Preprint
Full-text available
This manuscript is a tutorial on tig sequences that emerged after the name "contig", and are of diverse purposes in sequence bioinformatics. We review these different sequences (unitigs, simplitigs, monotigs, omnitigs to cite a few), give intuitions of their construction and interest, and provide some examples of applications.
Thesis
Cette thèse a pour sujet les méthodes informatiques traitant les séquences ADN provenant des séquenceurs haut débit. Nous nous concentrons essentiellement sur la reconstruction de génomes à partir de fragments ADN (assemblage génomique) et sur des problèmes connexes. Ces tâches combinent de très grandes quantités de données et des problèmes combinatoires. Différentes structures de graphe sont utilisées pour répondre à ces problèmes, présentant des compromis entre passage à l'échelle et qualité d'assemblage. Ce document introduit plusieurs contributions pour répondre à ces problèmes. De nouvelles représentations de graphes d'assemblage sont proposées pour autoriser un meilleur passage à l'échelle. Nous présentons également de nouveaux usages de ces graphes, différent de l'assemblage, ainsi que des outils pour utiliser ceux-ci comme références dans les cas où un génome de référence n'est pas disponible. Pour finir nous montrons comment utiliser ces méthodes pour produire un meilleur assemblage en utilisant des ressources raisonnables.
Thesis
Amounts of data generated by Next Generation Sequencing technologies increase exponentially in recent years. Storing, processing and transferring this data become more and more challenging tasks. To be able to cope with them, data scientists should develop more and more efficient approaches and techniques.In this thesis we present efficient data structures and algorithmic methods for the problems of approximate string matching, genome assembly, read compression and taxonomy based metagenomic classification.Approximate string matching is an extensively studied problem with countless number of published papers, both theoretical and practical. In bioinformatics, read mapping problem can be regarded as approximate string matching. Here we study string matching strategies based on bidirectional indices. We define a framework, called search schemes, to work with search strategies of this type, then provide a probabilistic measure for the efficiency of search schemes, prove several combinatorial properties of efficient search schemes and provide experimental computations supporting the superiority of our strategies.Genome assembly is one of the basic problems of bioinformatics. Here we present Cascading Bloom filter data structure, that improves standard Bloom filter and can be applied to several problems like genome assembly. We provide theoretical and experimental results proving properties of Cascading Bloom filter. We also show how Cascading Bloom filter can be used for solving another important problem of read compression.Another problem studied in this thesis is metagenomic classification. We present a BWT-based approach that improves the BWT-index for quick and memory-efficient k-mer search. We mainly focus on data structures that improve speed and memory usage of classical BWT-index for our application
Thesis
Full-text available
In this thesis we investigate if deep learning methods can help solve problems also in the fields of bioinformatics and neuroinformatics.
Chapter
Introduction Statistical Analysis Kinetic Analysis Interfacing Computers with Analytical Equipment Analysis of Macromolecular Structure Computers and Microscopic Analysis Computers and Electrophoresis Computers and Analytical Ultracentrifugation Summary
Article
We describe a program, tRNAscan-SE, which identifies 99-100% of transfer RNA genes in DNA sequence while giving less than one false positive per 15 gigabases. Two previously described tRNA detection programs are used as fast, first-pass prefilters to identify candidate tRNAs, which are then analyzed by a highly selective tRNA covariance model. This work represents a practical application of RNA covariance models, which are general, probabilistic secondary structure profiles based on stochastic context-free grammars. tRNAscan-SE searches at approximately 30 000 bp/s. Additional extensions to tRNAscan-SE detect unusual tRNA homologues such as selenocysteine tRNAs, tRNA-derived repetitive elements and tRNA pseudogenes.
Chapter
A pjroqram package to analyse nucleotide sequences of DNA and RNA on a minicomputer is described. The package ‘DNAW’ is designed for rapid analysis of strictly known data to support the well known programmes of Rodger Stadeen for the analysis of partially known DNA-sequences.
Chapter
The digestion of plant structural polysaccharides in the rumen is a major, but relatively inefficient, process in ruminant production which depends totally on the activity of microbial enzymes. The rate of polysaccharide digestion could be increased significantly by the genetic modification of rumen bacteria. The species chosen for such a genetic modification should maintain itself at a high population level after introduction into the rumen. Certain strains of Selenomonas ruminantium, which have been reintroduced into the rumen, can survive at a constant population level for relatively long periods of time, and are therefore suitable candidates for genetic modification. Cloning vectors will initially be developed from indigenous S. rwninantium plasmids.
Chapter
Eight years ago procedures such as pyrimidine tracts (1) and wandering spots (2) were the methods of choice for determining the sequence of short stretches of nucleic acids. The use of computers at that time to assist in the collection and analysis of these sequences would have seemed unnecessary. However, the development of more productive and technically simpler methods for DNA sequencing has caused a dramatic increase in the total volume of nucleic acid sequence data. For example, within a single journal (Nucleic Acids Research) published bi-weekly, newly determined sequence data, including more than 27,000 nucleotides, were published during the first five months of 1980. The accumulation of nucleotide sequence data in the near future will undoubtedly continue at an increased rate. Such large quantities of sequence data have provided the stimulus to establish centralized sequence storage banks.* The utilization of computers at these centers to store and catalog sequence data is one obvious but trivial way in which computer technology has been useful to those interested in nucleotide sequences.
Chapter
Among the 250 Type II restriction endonucleases now characterized, there are more than 70 different specificities1 and yet there is no indication that the range of specificities is exausted. Indeed, there is good reason to believe that hundreds, if not thousands, of different specificities would be found if a diligent search were carried out. One reason for this speculation is illustrated in Table 1, which shows the range of sequence patterns with which different Type II restriction endonucleases interact. Among the simple symmetric hexanucleotide sequences designated here as Class A, almost half of the possible sequence patterns are already represented by well-characterized enzymes. There is no reason to believe that a similar number of enzymes will not be found for the other patterns in Classes B through F. Similarly, it seems likely that enzymes recognizing degenerate patterns, like HgiAI and AccI, are not the sole representatives of the class. Within the last year alone, five new classes (C, D, F, N, and O) were added to this list.
Chapter
Full-text available
The human mitochondrial (mt) genome consists of a closed circular duplex DNA approximately 10 x 106 daltons and has been the most intensely studied animal mt genetic system. The positions of the origin of replication of H strand synthesis (Crews et al. 1979), the 12S and 16S ribosomal RNA genes (Robberson et al. 1972) and 19 tRNA genes (Angerer et al. 1976) have been located on the genetic map shown in Figure 1. A number of discrete products of mitochondrial protein synthesis have been demonstrated and three of them identified as subunits 1, 2 and 3 of the cytochrome oxidase complex (Hare et al. 1980). In comparison with other mito-systems, genes for up to four subunits of the ATPase complex, one of the cytochrome bc1 complex and possibly for a ribosomal protein would be expected to be present (see review by Borst 1977). Both strands are thought to be completely transcribed symmetrically from a point near the origin of the H strand synthesis (Aloni and Attardi 1971; Murphy et al. 1975). These transcripts are then processed to give the rRNAs, the tRNAs and a number of polyadenylated but not capped mRNAs (Attardi et al. 1979). Both the L and H strands have been shown to be coding with the L strand containing the sense sequence of the rRNA genes, most of the tRNA genes and most of the stable polyadenylated mRNAs.
Chapter
Methods for managing large scale sequencing projects are available through the use of our GAP4 package and the applications to which it can link are described. This main assembly and editing program, also provides a graphical user interface to the assembly engines: CAP3, FAKII, and PHRAP. Because of the diversity of working practices in the large number of laboratories where the package is used, these methods are very flexible and are readily tailored to suit local needs. For example, the Sanger Centre in the UK and the Whitehead Institute in the United States have both made major contributions to the human genome project using the package in different ways. The manual for the current (2001.0) version of the package is over 500 pages when printed, so this chapter is a brief overview of some of its most important components. We have tried to show a logical route through the methods in the package: pre-processing, assembly, contig1 ordering using read-pairs, contig joining using sequence comparison, assembly checking, automated experiment suggestions for extending contigs and solving problems, and ending with editing and consensus file generation. Before this overview, two important aspects of the package are outlined: the file formats used, the displays and the powerful user interface of GAP4. The package runs on UNIX and Microsoft Windows platforms and is entirely free to academic users, and can be downloaded from Website: http:// www. mrc-lmb. cam. ac. uk/pubseq.
Article
Full-text available
There is a growing demand of storage devices in the world. So, there is a need to develop alternate data storage devices that can confer the growing demands of people. Thus, the concept of DNA data storage is being evolved. In this mechanism, first the given information is converted into machine language, which is further converted into the DNA language of A, G, C and T. DNA is a highly compact molecule and according to a research by Harvard University, one gram of DNA can store up to 700 terabytes of information. DNA computing offers large storage capacity along with high accuracy of data retrieval, so there is no doubt in the fact that DNA computing can possibly, in the near future, serve as one of the best alternatives for electronic storage devices. In this paper, we have developed the programs that can be used to convert text data into the DNA language and vice versa. We have designed a number of programs using different logic and assumptions. In order to select the best program which is reliable and cost effective, we compared all the programs on the basis of compilation time and time complexity.
Article
Soil microbiome is one of the most heterogeneous biological systems. State-of-the-art molecular approaches such as those based on single-amplified genomes (SAGs) and metagenome assembled-genomes (MAGs) are now improving our capacity for disentailing soil microbial-mediated processes. Here we analyzed publicly available datasets of soil microbial genomes and MAG´s reconstructed from the Amazon's tropical soil (primary forest and pasture) and active layer of permafrost, aiming to evaluate their genome size. Our results suggest that the Candidate Phyla Radiation (CPR)/Patescibacteria phyla have genomes with an average size 4-fold smaller than the mean identified in the RefSoil database, which lacks any representative of this phylum. Also, by analyzing the potential metabolism of 888 soil microbial genomes, we show that CPR/Patescibacteria representatives share similar functional profiles, but different from other microbial phyla and are frequently neglected in the soil microbial surveys. Finally, we argue that the use of MAGs may be a better choice over SAGs to expand the soil microbial databases, like RefSoil.
Thesis
Découverte il y a plus d’un siècle, Salmonella n’a cessé d’intriguer les chercheurs. Sa capacité à résister à de nombreux antibiotiques est de plus en plus préoccupante. La surveillance de ce pathogène repose sur un typage rapide et discriminant de façon à identifier le plus précocement possible les sources alimentaires contaminées. Les méthodes classiques sont longues, lourdes et non automatisables. Comprendre l’émergence et l’évolution des Salmonella est la clé pour éradiquer ce pathogène resté l’une des premières causes de diarrhées bactériennes d’origine alimentaire dans le monde. Au cours des dernières décennies, des progrès spectaculaires ont été menés dans le monde de la microbiologie avec l’arrivée des séquenceurs de paillasse, passant du traitement d’une dizaine à des centaines de millions de séquences. L’accès facilité aux séquences génomiques et aux outils qui leurs sont dédiés sont devenus une nécessité. Les outils actuellement disponibles ne sont pas assez discriminants pour sous-typer S. enterica sérotype Typhimurium (STM), sérotype prédominant de Salmonella. Nous avons voulu lors de ce travail, montrer l’intérêt du séquençage entier du génome, pour l’étude génomique de Salmonella. (1) Après avoir séquencé plus de 300 génomes de STM, nous avons mis au point un outil de sous-typage in silico de ce sérotype, basé sur le polymorphisme des CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats). La surveillance à haut débit des salmonelloses a été validée en routine sur plus de 800 génomes. L’étude de la coévolution entre le chromosome (SNPs) et les régions CRISPR ont permis d’établir une nomenclature définissant les différentes populations de STM. (2) L’analyse génomique de 280 souches historiques de STM a montré que les gènes de bêta-lactamase conférant une résistance à l’ampicilline et portés par des plasmides étaient répandus chez STM à la fin des années 1950, bien avant l’utilisation de cet antibiotique. La présence de la pénicilline G dans le milieu agricole où ces composés ont été utilisés en tant que promoteurs de croissance ont pu conduire à la sélection des premières souches résistantes à l’ampicilline. (3) L’étude phylogénétique d’un génome issu du cadavre d’une femme décédée il y a plus de 800 ans, probablement à cause de la fièvre entérique et de 219 génomes historiques et récents des sérotypes Paratyphi C, Choleraesuis et Typhisuis ont montré que leurs génomes étaient très similaires au cours des 4000 dernières années. Ainsi, la combinaison des approches génotypique et phylogénétique ont accru nos connaissances sur l’évolution de ce pathogène.Mots clés : Séquençage entier du génome, surveillance épidémiologique, CRISPR, SNP, résistance antibiotique, phylogénie, évolution
Article
Our previous studies have shown that spontaneously arising immunocytomas in the LOU/Ws1 strain of rats contain a t(6;7) chromosomal translocation in all seven tumors studied (F. M. Babonits, J. Spira, G. Klein, and H. Bazin, Int. J. Cancer 29:431-437, 1982). We have also shown that the c-myc is located on chromosome 7 (J. Sümegi, J. Spira, H. Bazin, J. Szpirer, G. Levan, and G. Klein, Nature (London) 306:497-499, 1983) and the immunoglobulin H cluster on chromosome 6 (W.S. Pear, G. Wahlström, J. Szpirer, G. Levan, G. Klein, and J. Sümegi, Immunogenetics 23:393-395, 1986). We now report a detailed cytogenetic and molecular analysis of nine additional rat immunocytomas. The t(6;7) chromosomal translocation is found in all tumors. Mapping of the c-myc breakpoints showed that in 10 of 14 tumors, the c-myc breakpoints are clustered in a 1.5-kilobase region upstream of exon 1. In contrast with sporadic Burkitt's lymphoma and mouse plasmacytoma, only 1 of 14 tumors contains the c-myc breakpoints in either exon 1 or intron 1. Analysis of the sequences juxtaposed to the c-myc show that immunoglobulin H switch regions are the targets in at least five tumors and that there is a strong correlation between the secreted immunoglobulin and the c-myc target. Unlike sporadic Burkitt's lymphoma and mouse plasmacytoma, at least two rat immunocytomas show recombination of the c-myc with sequences distinct from immunoglobulin switch regions.
Article
The sequence of a human beta-tubulin cDNA clone (D beta-1) is described; our data revealed 95.6% homology compared with the sequence of a human beta-tubulin processed pseudogene derived by reverse transcription of a processed mRNA (Wilde et al., Nature [London] 297:83-84, 1982). However, the amino acid sequence encoded by this cDNA showed less homology with pig and chicken beta-tubulin sequences than the latter did to each other, with major divergence within the 15 carboxy-terminal amino acids. On the other hand, an independently isolated, functionally expressed genomic human beta-tubulin sequence (5 beta) possessed a very high degree of homology with chicken and pig beta-tubulins in this region. Thus, human cells appear to contain two distinct beta-tubulin isotypes. Both the intact beta-tubulin cDNA clone and a subclone containing only the 3' untranslated region detected two mRNA species in HeLa cells; these mRNAs were 1.8 and 2.6 kilobases long and were present in about equal amounts. Two independently subcloned probes constructed from the 3' untranslated region of the 5 beta genomic sequence also detected a 2.6-kilobase beta-tubulin mRNA. However, the 3'-untranslated-region probes from the cDNA clone and the genomic sequence did not cross-hybridize. Thus, at least two human beta-tubulin genes, each specifying a distinct isotype, are expressed in HeLa cells, and the 2.6-kilobase mRNA band is a composite of at least two comigrating beta-tubulin mRNAs.
Article
The origin of introns and their role (if any) in gene expression, in the evolution of the genome, and in the generation of new expressed sequences are issues that are understood poorly, if at all. Multigene families provide a favorable opportunity for examining the evolutionary history of introns because it is possible to identify changes in intron placement and content since the divergence of family members from a common ancestral sequence. Here we report the complete sequence of the gene encoding the 68-kilodalton (kDa) neurofilament protein; the gene is a member of the intermediate filament multigene family that diverged over 600 million years ago. Five other members of this family (desmin, vimentin, glial fibrillary acidic protein, and type I and type II keratins) are encoded by genes with six or more introns at homologous positions. To our surprise, the number and placement of introns in the 68-kDa neurofilament protein gene were completely anomalous, with only three introns, none of which corresponded in position to introns in any characterized intermediate filament gene. This finding was all the more unexpected because comparative amino acid sequence data suggest a closer relationship of the 68-kDa neurofilament protein to desmin, vimentin, and glial fibrillary acidic protein than between any of these three proteins and the keratins. It appears likely that an mRNA-mediated transposition event was involved in the evolution of the 68-kDa neurofilament protein gene and that subsequent events led to the acquisition of at least two of the three introns present in the contemporary sequence.
Article
The Notch locus is essential for proper differentiation of the ectoderm in Drosophila melanogaster. Notch corresponds to a 37-kilobase transcription unit that codes for a major 10.4-kilobase polyadenylated RNA. The DNA sequence of this transcription unit is presented, except for portions of the two largest intervening sequences. DNA sequences also were obtained from three Notch cDNA clones, allowing the 5' and 3' ends of the gene to be mapped, and the structures and locations of nine RNA coding regions to be determined. The major Notch transcript encodes a protein of 2,703 amino acids. The protein is probably associated with cell surfaces and carries an extracellular domain composed of 36 cysteine-rich repeating units, each of about 38 amino acids. The gene appears to have evolved by repeated tandem duplications of the DNA coding for the 38-amino-acid-long protein segments, followed by insertion of intervening sequences. These repeating protein segments are quite homologous to portions of mammalian clotting factors IX and X and to the product of the Caenorhabditis elegans developmental gene lin-12. They are also similar to mammalian growth hormones, typified by epidermal growth factor.
Article
As shown by Southern blot analysis, the metallothionein-1 (MT-1) genes in rats comprise a multigene family. We present the sequence of the MT-1 structural gene and compare its features with other metallothionein genes. Three MT-1 pseudogenes which we sequenced apparently arose by reverse transcription of processed mRNA transcripts. Two of these, MT-1 psi a and MT-1 psi c, are retrogenes which derive from the MT-1 mRNA, having diverged from the MT-1 gene 6.9 and 2.6 million years ago, respectively. The third, MT-1 psi b, differs from the MT-1 cDNA by only three nucleotide alterations. Surprisingly, MT-1 psi b also preserves sequence homology for 142 base pairs 5' to the transcription initiation site of the parent gene; it contains a promoter sequence sufficient for specifying metal ion induction. We identified, by S1 nuclease mapping, an RNA polymerase II initiation site 432 base pairs 5' of the MT-1 transcription initiation site of the MT-1 structural gene which could explain the formation of the mRNA precursor to this pseudogene. We were unable to detect MT-1 psi b transcripts, either in liver tissue or after transfection. We conclude that the absence of detectable transcripts from this pseudogene is due to either a reduced level of transcription or the formation of unstable transcripts as a consequence of the lack of a consensus sequence normally found 3' of transcription termination in the MT-1 structural gene.
Article
The nucleotide sequence and intron-exon structure of the Drosophila melanogaster vermilion (v) gene have been determined. In addition, the sites of several mutations and the effects of these mutations on transcription have been examined. The major v mRNA is generated upon splicing six exons of lengths (5' to 3') 83, 161, 134, 607, 94, and 227 nucleotides (nt). A minor species of v mRNA is initiated at an upstream site and has a 5' exon of at least 152 nt which overlaps the region included in the 83-nt exon of the major v RNA. The three v mutations, v1, v2, and vk, which can be suppressed by mutations at suppressor of sable, su(s), are insertions of transposon 412 at the same position in exon 1, 36 nt downstream of the major transcription initiation site. Despite the 7.5-kilobase insertion in these v alleles, a reduced level of wild-type-sized mRNA accumulates in suppressed mutant strains. The structure and transcription of several unsuppressible v alleles have also been examined. The v36f mutation is a B104/roo insertion in intron 4 near the splice donor site. A mutant carrying this alteration accumulates a very low level of mRNA that is apparently polyadenylated at a site within the B104/roo transposon. The v48a mutation, which deletes approximately 200 nt of DNA, fuses portions of exons 3 and 4 without disruption of the translational reading frame. A smaller transcript accumulates at a wild-type level, and thus an altered, nonfunctional polypeptide is likely to be synthesized in strains carrying this mutation.(ABSTRACT TRUNCATED AT 250 WORDS)
Article
Several P element insertion and deletion mutations near the 5' end of Drosophila melanogaster RpII215 have been examined by nucleotide sequencing. Two different sites of P element insertion, approximately 90 nucleotides apart, have been detected in this region of the gene. Therefore, including an additional site of P element insertion within the coding region, there are at least three distinct sites of P element insertion at RpII215. Both 5' sites are within a noncoding portion of transcribed sequences. The sequences of four revertants of one P element insertion mutation (D50) indicate that the P element is either precisely deleted or internally deleted to restore RpII215 activity. Partial internal deletions of the P element result in different RpII215 activity levels, which appear to depend on the specific sequences that remain after excision.
Article
Full-text available
The myc family of genes contains five functional members. We describe the cloning of a new member of the myc family from rat genomic and cDNA libraries, designated B-myc. A fragment of cloned B-myc was used to map the corresponding rat locus by Southern blotting of DNA prepared from rat X mouse somatic cell hybrids. B-myc mapped to rat chromosome 3. We have previously mapped the c-myc to rat chromosome 7 (J. Sümegi, J. Spira, H. Bazin, J. Szpirer, G. Levan, and G. Klein, Nature [London] 306:497-498, 1983) and N-myc and L-myc to rat chromosomes 6 and 5, respectively (S. Ingvarsson, C. Asker, Z. Wirschubsky, J. Szpirer, G. Levan, G. Klein, and J. Sümegi, Somat. Cell Mol. Genet. 13:335-339, 1987). A partial sequence of B-myc had extensive sequence homology to the c-myc protein-coding region, and the detection of intron homology further indicated that these two genes are closely related. The DNA regions conserved among the myc family members, designated myc boxes, were highly conserved between c-myc and B-myc. A lower degree of homology was detected in other parts of the coding region in c-myc and B-myc not present in N-myc and L-myc. A 1.3-kilobase B-myc-specific mRNA was detected in most rat tissues, with the highest expression in the brain. This resembled the expression pattern of c-myc, although at different relative levels, and was in contrast to the more tissue-specific expression of N-myc and L-myc. B-myc was expressed at uniformly high levels in all fetal tissues and during subsequent postnatal development, in contrast to the stage-specific expression of c-myc.
Article
To examine the sequence complexity and differential expression of human alpha-tubulin genes, we constructed cDNA libraries from two unrelated tissue types (epidermis and fetal brain). The complete sequence of a positively hybridizing alpha-tubulin clone from each library is described. Each is shown to represent an abundantly expressed gene from fetal brain and keratinocytes, respectively. Although the coding regions are extensively homologous (97%), the 3' untranslated regions are totally dissimilar. This property has been used to dissect the human alpha-tubulin multigene family into members bearing sequence relatedness in this region. Surprisingly, each of these noncoding regions shares very high (65 to 80%) interspecies homology with the 3' untranslated region of one of the two rat alpha-tubulin genes of known sequence. These unexpected homologies imply the existence of selective pressure on the 3' untranslated regions of some cytoskeletal genes which maintains sequence fidelity during the course of evolution, perhaps as a consequence of an as yet unidentified functional requirement.
Preprint
Full-text available
Metagenomics facilitates the study of the genetic information from uncultured microbes and complex microbial communities. Assembling complete microbial genomes (i.e., circular with no misassemblies) from metagenomics data is difficult because most samples have high organismal complexity and strain diversity. Only 63 circularized bacterial and archaeal genomes have been assembled from metagenomics data despite the thousands of datasets that are available. Circularized genomes are important for (1) building a reference collection as scaffolds for future assemblies, (2) providing complete gene content of a genome, (3) confirming little or no contamination of a genome, (4) studying the genomic context and synteny of genes, and (5) linking protein coding genes to ribosomal RNA genes to aid metabolic inference in 16S rRNA gene sequencing studies. We developed a method to achieve circularized genomes using iterative assembly, binning, and read mapping. In addition, this method exposes potential misassemblies from k-mer based assemblies. We chose species of the Candidate Phyla Radiation (CPR) to focus our initial efforts because they have small genomes and are only known to have one copy of ribosomal RNA genes. We present 34 circular CPR genomes, one circular Margulisbacteria genome, and two circular megaphage genomes from 19 public and published datasets. We demonstrate findings that would likely be difficult without circularizing genomes, including that ribosomal genes are likely not operonic in the majority of CPR, and that some CPR harbor diverged forms of RNase P RNA.
Chapter
DNA fragment assembly (DFA) is one of the most important and challenging problems in computational biology. DFA problem involves reconstruction of target DNA from several hundred (or thousands) of sequenced fragments by identifying the proper orientation and order of fragments. DFA problem is proved to be a NP-Hard combinatorial optimization problem. Metaheuristic techniques have the capability to handle large search spaces and therefore are well suited to deal with such problems. In this chapter, quantum-inspired genetic algorithm-based DNA fragment assembly (QGFA) approach has been proposed to perform the de novo assembly of DNA fragments using overlap-layout-consensus approach. To assess the efficacy of QGFA, it has been compared genetic algorithm, particle swarm optimization, and ant colony optimization-based metaheuristic approaches for solving DFA problem. Experimental results show that QGFA performs comparatively better (in terms of overlap score obtained and number of contigs produced) than other approaches considered herein.
Chapter
DNA fragment assembly (DFA) is one of the most important and challenging problems in computational biology. DFA problem involves reconstruction of target DNA from several hundred (or thousands) of sequenced fragments by identifying the proper orientation and order of fragments. DFA problem is proved to be a NP-Hard combinatorial optimization problem. Metaheuristic techniques have the capability to handle large search spaces and therefore are well suited to deal with such problems. In this chapter, quantum-inspired genetic algorithm-based DNA fragment assembly (QGFA) approach has been proposed to perform the de novo assembly of DNA fragments using overlap-layout-consensus approach. To assess the efficacy of QGFA, it has been compared genetic algorithm, particle swarm optimization, and ant colony optimization-based metaheuristic approaches for solving DFA problem. Experimental results show that QGFA performs comparatively better (in terms of overlap score obtained and number of contigs produced) than other approaches considered herein.
Article
Full-text available
The lc gene of the lambdoid bacteriophage PA-2 and the nmpC gene located on a defective lambdoid prophage in the 12-min region of the Escherichia coli K12 chromosome have been sequenced. The porin proteins encoded by these two genes were almost identical, with only 4 of the 365 residues of the precursor forms of the proteins being different. The Lc and NmpC proteins were strongly homologous to the OmpC, ompF, and PhoE proteins, with greater than 56% of the residues identical in each case. Sequencing of the region flanking the lc gene allowed precise positioning of this gene with respect to the rightward cos site of the phage and to sequences which are homologous between PA-2 and lambda. In wild-type strains of E. coli K12, the nmpC gene is not expressed and contains an IS5 insertion near the 3' end of the coding region. This insertion deletes 18 residues from the COOH terminus of NmpC protein and adds 8 residues from an open reading frame extending into IS5 sequence. Expression of this form of the gene in an expression vector plasmid demonstrated that this altered protein is still capable of being translocated to the outer membrane. Plasmid expression experiments using lc-nmpC hybrid genes show that it is the presence of the IS5 insertion which prevents expression of the porin in wild-type E. coli K12. In the nmpC mutant which expresses the protein, there has been a precise excision of the IS5 which regenerates a COOH terminus of NmpC protein which is identical to that of the Lc protein. Blot hybridization detected no mRNA transcripts from the wild-type nmpC gene, although transcripts were readily detected from the lc gene in PA-2 lysogens and from the nmpC mutant which has excised the IS5. This indicates that IS5 affects the production or stability of transcripts from the adjacent nmpC gene.
Preprint
Full-text available
Motivation The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short and long read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, designing time and memory efficient algorithms for the construction of this graph from reference sequences is an important problem. Results We introduce a new algorithm, implemented in the toolCuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata; it constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On our test hardware, Cuttlefish constructed the graph for 100 human genomes in under 9 hours, using ~29 GB of memory while no other tested tool completed this task. On 11 diverse conifer genomes, the compacted graph was constructed by Cuttlefish in under 9 hours, using ~84 GB of memory, while the only other tested tool that completed this construction on our hardware took over 16 hours and ~289 GB of memory. Availability Cuttlefish is written in C++14 , and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish . Contact rob@cs.umd.edu Supplementary information Supplementary text are available at Bioinformatics online.
Chapter
DNA fragment assembly (DFA) is one of the most important and challenging problems in computational biology. DFA problem involves reconstruction of target DNA from several hundred (or thousands) of sequenced fragments by identifying the proper orientation and order of fragments. DFA problem is proved to be a NP-Hard combinatorial optimization problem. Metaheuristic techniques have the capability to handle large search spaces and therefore are well suited to deal with such problems. In this chapter, quantum-inspired genetic algorithm-based DNA fragment assembly (QGFA) approach has been proposed to perform the de novo assembly of DNA fragments using overlap-layout-consensus approach. To assess the efficacy of QGFA, it has been compared genetic algorithm, particle swarm optimization, and ant colony optimization-based metaheuristic approaches for solving DFA problem. Experimental results show that QGFA performs comparatively better (in terms of overlap score obtained and number of contigs produced) than other approaches considered herein.
Article
DNA Fragment Assembly Problem (FAP) is concerned with the reconstruction of the target DNA, using the several hundreds (or thousands) of sequenced fragments, by identifying the right order and orientation of each fragment in the layout. Several algorithms have been proposed for solving FAP. Most of these have solely dwelt on the single objective of maximizing the sum of the overlaps between adjacent fragments in order to optimize the fragment layout. This paper aims to formulate this FAP as a bi-objective optimization problem, with the two objectives being the maximization of the overlap between the adjacent fragments and the minimization of the overlap between the distant fragments. Moreover, since there is greater desirability for having lesser number of contigs, FAP becomes a tri-objective optimization problem where the minimization of the number of contigs becomes the additional objective. These problems were solved using the multi-objective genetic algorithm NSGA-II. The experimental results show that the NSGA-II-based Bi-Objective Fragment Assembly Algorithm (BOFAA) and the Tri-Objective Fragment Assembly Algorithm (TOFAA) are able to produce better quality layouts than those generated by the GA-based Single Objective Fragment Assembly Algorithm (SOFAA). Further, the layouts produced by TOFAA are also comparatively better than those produced using BOFAA.
Chapter
The availability of the chromosome-scale axolotl genome sequences has made it possible to explore genome evolution, perform cross-species comparisons, and use additional sequencing data to analyze both genome-wide features and individual genes. Here, we will focus on the UCSC genome browser and demonstrate in a step-by-step manner how to use it to integrate different data to approach a broad question of the Fgf8 locus evolution and analyze the neighborhood of a gene that was reported missing in axolotl – Pax3.Key wordsGenome BrowserSyntenyCustom TracksGenome EvolutionAnnotationGene Expression
Article
Full-text available
A collection of user-interactive computer programs is described which aid in the assembly of DNA sequences. This is achieved by searching for the positions of overlapping common nucleotide sequences within the blocks of sequence obtained as primary data. Such overlapping segments are then melded into one continuous string of nucleotides. Strategies for determining the accuracy of the sequence being analyzed and reducing the error rate resulting from the manual manipulation of sequence data are discussed. Sequences mapping from 97.3 to 100% of the Ad2 virus genome were used to demonstrate the performance of these programs.
Article
With modern fast sequencing techniques1,2 and suitable computer programs it is now possible to sequence whole genomes without the need of restriction maps. This paper describes computer programs that can be used to order both sequence gel readings and clones. A method of coding for uncertainties in gel readings is described. These programs are available on request.
Article
The speed of the new DNA sequencing techniques has created a need for computer programs to handle the data produced. This paper describes simple programs designed specifically for use by people with little or no computer experience. The programs are for use on small computers and provide facilities for storage, editing and analysis of both DNA and amino acid sequences. A magnetic tape containing these programs is available on request.
Article
A previous paper1 described programs for sequence data handling and analysis by computer. The facilities of this basic set are extended by further easily used programs.
Article
This paper describes a computer program that can find tRNA genes within long DNA sequences. The program obviates the need to map the tRNA genes.