Figure 1 - uploaded by Seung-jong Park
Content may be subject to copyright.
Source publication
Genome sequencing technology has witnessed tremendous progress in terms of throughput as well as cost per base pair, resulting in an explosion in the size of data. Consequently, typical sequence assembly tools demand a lot of processing power and memory and are unable to assemble big datasets unless run on hundreds of nodes. In this paper, we prese...
Contexts in source publication
Context 1
... to keep in-tune with the de- creasing cost of sequencing, there is a need for assembly tools to be more memory efficient. Figure 1 categorizes representative assemblers based on memory utilization and scalability. On one hand, we have the first generation assemblers which are multithreaded applications that run on a single node but require ter- abytes of memory to assemble the larger genomes. ...
Context 2
... effect can be seen in figure 10, where partition id 0 clearly stands out from the rest. This is an inherent dis- advantage of using the minimum substring partitioning scheme, something that comes as a compromise for the reduced memory footprint. ...
Context 3
... mentioned earlier, there is a significant imbalance in the size of the partitions can easily overwhelm the hardware resources if they are not carefully distributed among the cluster nodes. In Figure 10, we show the distribution of intermediate key- value pairs among peers after the intermediate data is shuffled. It can be observed that the larger partitions are scattered across different nodes in a fairly uniform manner so that no single node is assigned more than one. ...
Similar publications
As software permeates more and more aspects of daily life and becomes a central component of critical systems around the world,
software quality and effective methods to ensure it are paramount.
There is a huge variety of both static and dynamic analyses that aim to provide such guarantees.
Typically, such analyses are based on the analysed program...
The aim of this study is to examine the use of short text message (via SMS) service, found in mobile phones, in the programming training in the context of personalization principle. This study was conducted with 74 students enrolled in the course of Assembler, and used experimental models with posttest control group. The students were divided into...
Bien que les applications utilisant les modèles 3D soient de plus en plus nombreuses, la création des modèles 3D en elle-même reste une tâche fastidieuse. La modélisation à partir d’exemples rend cette étape plus efficace en permettant à l'utilisateur d’exploiter des modèles existants pour les assembler afin d’obtenir un nouveau modèle. Les méthode...
This paper describes the implementation of a garbage collector as an undergraduate research project. The garbage collector is a continuation of a project where an Assembler, Virtual Machine and Compiler were implemented as a capstone project. The project required modifying the compiler to allocate memory compatible with a mark and sweep algorithm,...
Citations
... Among various kinds of data, graph data are getting a lot of attention because graphs are everywhere (e.g., online social networks, brain networks, transportation 1 3 networks) and, more importantly, people can get deeper insights into big data based on the explicit and implicit relationships among real-world entities. For example, in bioinformatics, scientists are building a De Bruijn graph or an overlap graph to construct a whole genome sequence based on short reads generated from a nextgeneration sequencing machine [9,32]. ...
Despite having several distributed graph processing frameworks, scalable iterative processing of large graphs is a challenging problem since the graph and intermediate data need a global view of the graph topology in distributed memory. Although some systems support out-of-core iterative computations, they use a single machine and often require fast storage. In this paper, we present a new distributed iterative graph computation framework, called GraphMap, that utilizes a disk-based NoSQL database system for scalable graph processing while ensuring competitive performance. Extensive experiments on several real-world graphs show that GraphMap is more scalable and often faster than existing distributed memory-based systems for various graph processing workloads.
... Since they can generate a huge number of short reads at a significantly reduced cost and a high throughput, many error correction tools without using any reference have been developed based on a fact that the k-mers resulting from an error base will have a significantly lower coverage compared to the actual k-mers, such as Quake [4], Reptile [5], Hammer [6], RACER [7], Coral [8], Lighter [9], Musket [10], Shrec [11], DecGPU [12], Echo [13], and ParSECH [14]. In addition to the error correction tools, there are many genome assembly tools for large-scale short reads including MPI-based tools (e.g., SWAP [15]), Hadoop-based tools (e.g., GiGA [16]), extremescale assemblers (e.g., HipMer [17], Lazer [18]), and GPUaccelerated tools (e.g., LaSAGNA [19]). ...
... Many distributed assemblers have also emerged to expedite the assembly of high-throughput sequencing datasets. Notable among them are Ray [29] and SWAP [30] built on MPI, Lazer [31] which uses ZeroMQ, Contrail [32] and Giga [33] based on Hadoop, and HipMer [34] based on the global-address space model. Among the string graphbased assemblers, SGA [35] is the only one that can process large datasets on a single node using compressed data structures. ...