GeneJax: A Prototype CAD tool in support of Genome Refactoring
MIT Department of Electrical Engineering and Computer Science
Undergraduate Advanced Project
May 18, 2006
Assistant Professor, Biological Engineering
Ph.D. Candidate, Biological Engineering
GeneJax: A Prototype CAD tool in support of
Abstract: Refactoring is a technique used by computer scientists for improving program design.
The Endy Laboratory has adapted this process to make the genomes of biological organisms
more amenable to human understanding and design goals. To assist in this endeavor, we
visualization stages of the genome refactoring process. This paper reviews key genome
refactoring concepts and then discusses the features, development history, user-interface, and
underlying implementation issues faced during the making of GeneJax. In addition, we provide
recommendations for future GeneJax development. This paper may be of interest to engineers
of CAD tools for synthetic biology.
Introduction: Genome Refactoring
Refactoring is a technique traditionally used by software engineers to redesign computer
software (Fowler et. al., 1999). By refactoring, engineers modify the design of an existing
program without adding new features or functionality. Instead the new design improves program
readability and maintainability:
Refactoring does not fix bugs or add new functionality. Rather it is designed to improve the
understandability of the code or change its structure and design, and remove dead code, to make it
easier for human maintenance in the future. In particular, adding new behavior to a program might
be difficult with the program's given structure, so a developer might refactor it first to make it easy,
and then add the new behavior. (Wikipedia contributors, 2006)
Inspired by this technique, the MIT Endy Lab has begun refactoring genomes. In particular, the
team seeks to precisely understand how the encoded genetic elements interact with one another
to generate an entire functioning organism. However, the sequences encoded by natural
genomes are usually hard to study for a variety of reasons. For example, in natural genomes
multiple genes and functions often overlap on the same DNA sequence making it hard to
determine each gene's contribution to the organism. One way to simplify such a genome would
be to engineer a new genome where each region of the DNA sequence encodes a single genetic
function. By isolating these functional elements from one another, each element's function
becomes easier to model and simpler to manipulate. As a result, this "refactored" genome is
easier for humans to understand and modify.
For example, the Endy Lab recently disassembled the genome of bacteriophage T7, a bacterial
virus, and constructed a new man-made “T7.1” virus sequence (Chan et. al., 2005). This T7.1
virus was functionally similar to the natural T7 virus but with a more structured design that
removed overlaps between genetic elements. Removing these overlaps allowed the team to
independently manipulate each specific gene element, and by doing so, fully describe its
function. Figure 1 compares a region from the T7 virus and its refactored T7.1 variant,
illustrating the improved isolation between genetic elements in the refactored version.
Figure 1: Refactoring the T7 bacteriophage virus. The figure shows the natural T7 genome
(top) and a region on the natural T7 genome that has been magnified to show the genetic
parts (middle). The Endy Lab refactored this region by isolating these genetic parts and
reassembling them without any overlaps (bottom). Image is not to scale. Adapted from
Chan et. al., 2005.
The process of genome refactoring can be broken down into five steps (visualization, dissection,
editing, synthesis, comparison) outlined in Figure 2. First a naturally occurring genome and its
annotations are examined during the visualization phase. These annotations indicate the known
functionality of various sections in the genome DNA sequence. Using this knowledge the genome
is dissected into parts that isolate each gene function from the natural genome. Since naturally
occurring sequences often have overlapping functional elements, the resulting parts may overlap
or contain sequences for additional, unintended functions. These naturally occurring parts must
therefore be edited to refine their isolation from other genetic elements. The refined parts are
then synthesized into a refactored genome. Additional parts may come from other refactored
genomes as shown in the figure. Finally, the naturally occurring and refactored genome may be
compared against each other to examine the differences where the genome has and has not
been changed. The refactoring process described here is adapted from the genome design
algorithm presented in the supplementary materials for (Chan et. al., 2005).
Figure 2: The Genome Refactoring process. First, a natural genome is visualized and then
dissected into raw parts. Next, these raw parts are edited to refine their function. The
refined parts are then combined with parts from other genomes to synthesize a new
refactored genome. Finally, the refactored genome and natural genome are compared to
examine where the genome has and has not been changed. Adapted from Chan et. al.,
The Endy Lab has now begun work on a “T7.2” virus and is looking for software tools to assist in
this process. Instead of rearranging existing genetic elements within the organism (as was done
with T7.1), the T7.2 design will actively remove gene elements and substitute other elements
with new genes from other genomes:
Moving beyond our design of T7.1, we will actively erase or delete elements of unknown function. In
addition, efforts will be made to made to remove unknown genetic elements... To attempt to make
our modeling of gene expression easier, we will use standard synthetic elements in place of the
natural elements that regulate transcription and translation. (T7.2, 2005)
The original T7.1 virus was designed using existing tools such as Vector NTI and custom perl
scripts coerced into ad hoc refactoring tasks. In particular, these tools do not easily allow the
arbitrary definition genetic parts and their reassembly into multiple refactored genome variants.
As the goals of T7.2 increase the complexity of gene edits beyond that of T7.1, the team seeks
a new tool to streamline this process of part definition and manipulation.
Software programs for almost all phases of the refactoring process already exist. For the
visualization phase there are a plethora of tools available as stand alone programs or on the
web. For example, applications such as VectorNTI from Invitrogen and Workbench from CLC Bio
offer excellent visualization features. In addition, the National Center for BioInformatics, the
University of California at Santa Cruz, and University of California at Berkeley all maintain
online genome browsers for visualizing genomes and their annotations. Online tools for the
editing and synthesis phase are already under development at the MIT Registry of Standard
Biological Parts. The Registry supports the synthesis of genes from the parts in its database
using BioBricks and more recently as simple blunt-end concatenation (Sriram Kosuri, personal
from within the web browser (Randy Rettberg, personal communication). Finally, BioViz is a web
means exhaustive, but we are unaware of any tools designed specifically for the dissection
phase of the refactoring process.
This lack of tools for the dissection stage led us to the GeneJax project. Specifically, GeneJax
was intended to let users define and extract arbitrary parts from a natural genome sequence.
This dissection functionality also required a method to visualize newly defined parts and existing
annotations within a genome. As a result GeneJax necessarily had to minimally implement
visualization features already handled by the tools described previously.
The development of GeneJax used Google Maps as the starting point for these visualization
features. We were influenced by Google Maps for two reasons. First, we reasoned that at an
abstract level, the functionality of Google Maps could be applied to the visualization and
navigation of genomic sequences. Both genomes and geographic mapping programs must
provide a localized view of a very large data set at multiple resolutions. Put more plainly, the
map of the human genome is a map and Google Maps is an excellent tool for traversing maps.
These similarities are summarized in the following table.
Displays relevant subset of
larger a database
Maps a point of interest from a
map of the entire world.
Maps gene sequence of interest
from an entire genome.
movement allows users to scan
adjacent genes or gene
base-pairs without reloading
the browser page.
Users can zoom in to see an
individual gene or base pairs
and zoom out to view the
Functional, sequence, and
movement allows users to scan
adjacent maps without
reloading the browser page.
Users can zoom in to street
level details, or zoom out to a
world map view.
Satelite views, road map
views, and an overlay
Second, Google Maps and the accompanying AJAX movement demonstrated that application-like
interactivity can be built into a web based tool. This would allow us to go beyond the pure
visualization capabilities of Google Maps and add application-like functionality for defining and
manipulating genetic parts. Thus a Google-like map for the genomes was to provide the
navigational foundation upon which we would build the features for defining parts from the gene
In the following sections we will describe the details of the GenJax project and the lessons
learned. We first describe the GeneJax features that have been completed and their
development from an end-user perspective. Next we don our programming hat and explore key
implementation issues faced during the project. Finally, we speculate and propose enhancements
for future versions of the project.
We now describe the current functionality of the GeneJax program, and the strengths and
weaknesses of those features from the perspective of the end-user. This section is not intended
to be a user manual for the program. Instead we critically assess the history, quality and
usefulness of the implemented features, since that information will be the most useful to
designers of future genome refactoring tools. Our discussion starts with GeneJax's layout, and
then moves on the basic map visualization and navigation features. These features form the
foundation for the part definition functions that we explore at the end of this section.
Before discussing the strengths and weaknesses of GeneJax, it is helpful for the reader to get a
feel for the program's interface. A screen shot of GeneJax running in the Firefox browser is
shown in Figure 3. In the top left a large box forms the viewing pane that displays the gene
sequence, annotations, and parts. Beneath the viewing pane, is a map of the entire genome
sequence and a slider indicating the current position of the viewing pane on the map. Beneath
the map is a search field that lets the user search for a region of interest. The lower right
contains status fields that indicate the absolute base pair position of the sequence in the viewing
pane and the current zoom setting. Finally the upper right contains tool buttons that enable the
features of the program.
Figure 3: The GeneJax User Interface. A screen shot of the GeneJax program, identifying
the key features of the program's layout. The program has a viewing pane that displays
genome data; a search box, genome map, and status fields for navigating the genome;
and tool buttons for enabling the different functions of the program.
Navigation in GeneJax is similar to the basic map navigation features of Google Maps. The
viewing pane shows the base pair sequence, annotations, and parts for a region of interest on
the genome. By using the MoveTool, users can click-drag to examine nearby regions of the
genome. In the current implementation mouse movements in the vertical direction are ignored