with the European SPINE project).
Data management has been identified as a crucial issue in all large-scale
experimental projects. In this type of project, many different persons manipulate
multiple objects in different locations; thus, unless complete and accurate records
are maintained, it is extremely difficult to understand exactly what has been
done, when it was done, who did it, and what exact protocol was used. All of
this information is essential for use in publications, reusing successful protocols,
determining why a target has failed, and validating and optimizing protocols.
Although data management solutions have been in place for certain focused activities
(e.g., genome sequencing and microarray experiments), they are just emerging
for more widespread projects, such as structural genomics, metabolomics, and
systems biology as a whole. The complexity of experimental procedures, and
the diversity and high rate of development of protocols used in a single center, or
across various centers, have important consequences for the design of informa-
tion management systems. Because procedures are carried out by both machines
and hand, the system must be capable of handling data entry both from robotic
systems and by means of a user-friendly interface. The information management
system needs to be flexible so it can handle changes in existing protocols or newly
added protocols. Because no commercial information management systems have
had the needed features, most structural genomics groups have developed their
own solutions. This chapter discusses the advantages of using a LIMS (labora-
tory information management system), for day-to-day management of structural
genomics projects, and also for data mining. This chapter reviews different
solutions currently in place or under development with emphasis on three systems
developed by the authors: Xtrack, Sesame (developed at the Center for Eukaryotic
Structural Genomics under the US Protein Structural Genomics Initiative), and
HalX (developed at the Yeast Structural Genomics Laboratory, in collaboration
The genome sequencing centers constituted the first large-scale experimental
biological projects, and they were among the first centers to attack the problem
of experimental data management. These projects quickly became automated,
Data Management in Structural
Genomics: An Overview
S. Haquin, E. Oeuillet, A. Pajon, M. Harris, A. T. Jones, H. van Tilbeurgh,
J. L. Markley, Z. Zolnai, and A. Poupon
From: Methods in Molecular Biology, Vol. 426: Structural Proteomics: High-throughput Methods
Edited by: B. Kobe, M. Guss and T. Huber © Humana Press Inc., Totowa, NJ
Kobe_Ch04.indd 49Kobe_Ch04.indd 49 8/28/2007 1:46:19 AM8/28/2007 1:46:19 AM
data dictionaries that are freely available from the Web. With these standards
for data exchange in place, it is possible for different laboratories and centers to
exchange data even though their individual LIMS use site-specific data represen-
tations. The pairing of experimental data with protocols are critical features of the
LIMS systems used in structural genomics. In recognition of this, the PEPCdb
ties all information collected from centers in the US Protein Structure Initiative
Network to specified protocols.
Haquin et al.
and robots were soon used for all experimental steps. These projects required
little in the way of standardization because it has not been necessary to compare
experimental protocols. Currently, the most difficult tasks confronting genome
sequencing centers concern the annotation of the genomes; much of this is
automated, but hand curation is still mandatory.
A second large-scale biological activity is in the field of chip microarray
experiments. Because the results of a microarray experiment cannot be con-
sidered without taking into account the experimental protocol, a list of the
minimal data to be registered for these experiments has been developed (1–3).
MIAME (Minimum Information About Microarray Experiment) has been
largely accepted as a standard, and nearly all LIMSs developed for microarray
experiments are now “MIAME-compliant” (4–6).
Metabolomics represents a third and newer large-scale activity and another
in which experimental protocols are critical. The Metabolomics Standards
Initiative recently recommended that metabolomics studies should report
the details of study design, metadata, experimental, analytical, data process-
ing, and statistical techniques used (7). Capturing these details is imperative,
because they can play a major role in data interpretation (8,9). As a result,
informatics resources need to be built on a data model that can capture all
of the relevant information while maintaining sufficient flexibility for future
development and integration into other resources (10).
In the domain of data management, structural genomics projects share some
of the same problems as microarray and metabolomics experiments. In all three
cases, many targets are manipulated, and the results must be considered within
the context of the experimental conditions. In all such large-scale efforts, data
exchange standardization is a necessity (11,12). These issues were discussed
in the early meetings that led to the development of the US Protein Structure
Initiative, and the Protein Data Bank and BioMagResBank (now partners in the
World Wide Protein Data Bank) jointly developed data structures for capturing
experimental details of protein production and structure determination by x-ray
crystallography and NMR spectroscopy. These PDB standards were made pub-
licly available and were picked up in part by European SPINE project and used
in designing its data model (13). The data model has been modified since its
publication in 2005 and the latest version is available within the PiMS project
(http://www.pims-lims.org). The data model the authors have designed allows
the registration of experimental data from any protocol already in use in the
numerous laboratories that have been consulted. It has also been designed to
easily accommodate new protocols. Similar standards were incorporated into the
TargetDB (Target Registration Database, http://targetdb.pdb.org/), which used
by many of the worldwide structural genomics centers, and PEPCdb (Protein
Expression, Purification and Crystallization Database, http://pepcdb.pdb.org/),
which although publicly available is currently collecting information only from
the US Protein Structure Initiative Network. All of these systems are based on
Kobe_Ch04.indd 50 Kobe_Ch04.indd 50 8/28/2007 1:46:19 AM8/28/2007 1:46:19 AM
ments of this kind and their results in a laboratory notebook. Information can
be entered into spreadsheets such as Excel, but because separate spreadsheets
are needed for different stages of the pipeline at different locations, the data
become fragmented. With multiple spreadsheets, it becomes difficult to trace
all of the manipulations carried out with a particular target. Moreover, people
working on a single part of the project are overwhelmed with information that
is not relevant to their task.
Chapter 4 Data Management in Structural Genomicss
Projects aiming at developing a LIMS have to keep in mind that laboratory
workers will not gladly use a LIMS just because it serves the scientific
community; it is important that the LIMS simplifies their day-to-day work
and that it becomes essential to their tasks. Because traditional laboratory
notebooks cannot be used in a multiuser and multitask environment, a LIMS
is an attractive alternative. A LIMS can bridge data flow involving individu-
als and robots and can provide very useful information to the experimenter
carrying out a particular step in a complex pipeline. This chapter presents the
rationale for laboratory information systems in structural genomics, and
discusses the desirable features of such systems. It presents detailed informa-
tion about three systems developed in the authors’ laboratories: Xtrack, developed
at the University of Uppsala, Sweden; Sesame, developed by the Center for
Eukaryotic Structural Genomics in the context of the US Protein Structure
Initiative (14); and HalX (15), developed in the Yeast Structural Genomics
Project (16,17) and as part of the SPINE project (13).
2. Rationale and Design Features of a LIMS
for Structural Genomics
As stated, there are two main reasons for using a LIMS in structural genomics.
The first is to facilitate the day-to-day work of the experimenters. The second is
to allow the experimental data to be searched in different ways in the context of
information available from other databases via the Web; this facilitates the man-
agement of projects and data mining used to evaluate the efficiency of protocols
and develop hypotheses to be tested that may lead to improved protocols.
2.1. Limitations of Laboratory Notebooks and Spreadsheets
The laboratory notebook that almost all experimenters use for recording their
day-to-day work is not well adapted to structural genomics projects. The main
reasons are that persons laboring at each stage of a pipeline need to have
access to complete information about the prior history of each target, need to
add their information to this history, and need to pass this information to the
next stage in the pipeline. Each person deals with a large number of targets
over in a short period of time and can benefit from the organization of the task
at hand that a LIMS system can offer. Typically, a cloning “campaign” in the
Yeast Structural Genomics laboratory, or “workgroup” (Sesame nomenclature)
consists of 96 different protein constructs (corresponding to a 96-titer well
plate). These may be investigated as multiple constructs (coding for different
N- and C-termini), subcloned into multiple vectors, or expressed in different
hosts under multiple conditions. Although these tasks are routine from the
structural genomics point of view, they generate a large amount of information
specific to the exact protocol followed. It is highly unwieldy to capture experi-
Kobe_Ch04.indd 51 Kobe_Ch04.indd 51 8/28/2007 1:46:19 AM8/28/2007 1:46:19 AM
• The system should carry out routine calculations and organize information
in useful ways; it should be capable of launching large-scale computations
carried out on computer clusters or computational grids.
• The system should be capable of launching experiments carried out on com-
plex instrumentation, such as crystallization robots and NMR spectrometers
and of importing results achieved.
Haquin et al.
In a typical structural genomics project, many persons work on a given
target over its history. These persons often are not at the same site. Multiple
targets are operated on by multiple persons and multiple robots at multiple
sites. A given instrument can even be programmed to carry out different
tasks. For example, a particular pipetting robot might be used for cloning,
expression, and small-scale purification. This situation is not compatible with
conventional laboratory notebooks or Excel spreadsheets. Even file sharing
is problematic, because different persons cannot work with a file at the same
time. Moreover, with file sharing, it is difficult to ensure the existence of a
single true record of activity. From the preceding discussion, it is clear that
structural genomics projects require information management quite different
from that used in traditional structural biology projects. As discussed in the
following, a well-structured LIMS can provide the solution.
2.2. Design Features of a LIMS for Structural Genomics
The ideal LIMS for structural genomics should have the following properties:
• The system should reside on a commercial-quality relational database
management system to take advantage of the security, data organization,
robustness, and other features these provide.
• It should have a user-friendly Web-based interface so data can be entered and
accessed from multiple sites.
• The interface should provide flexible views of the database configured to
individual stages in the pipeline so that persons entering data see only the
parts relevant to their task and are not overwhelmed by unnecessary detail.
• The system should manage and facilitate data entry; it should associate
entered data with a particular protocol, time stamp the entry and provide
the name of the person entering the data or launching the run of a robot or
other laboratory instrument. As indicated by the protocol, the outcome of a
given step, leads to a decision or “action” that determines the next step to
be followed. The system needs to be capable of recording that action and of
reorganizing targets that follow different pathways.
• The system must make it possible to create variations on a given ORF (e.g.,
different domains, different constructs, chemically modified proteins) and to
trace the relationships. The system must be capable of dealing with structural
targets that consist of two or more different proteins (coexpressed or
combined after isolation) or proteins and ligands.
• The system should be capable of flexible configuration by a laboratory
manager (rather than software developers) so that the system can be tailored
to fit the dynamic needs of a particular laboratory.
• It should be capable of flexible data input from external databases, a variety
of laboratory robots, and laboratory personnel.
• The system should allow the user to search through and organize data from
Kobe_Ch04.indd 52Kobe_Ch04.indd 52 8/28/2007 1:46:19 AM8/28/2007 1:46:19 AM
same expression systems, and same purification protocols. Structural genom-
ics renders such studies possible. A number of comparative studies emerging
from structural genomics centers illustrate how information about the behavior
of multiple targets can lead to improved protocols and better methods for pre-
dicting whether a given sequence is likely to succeed as a structural target.
Canaves et al. (18) presented a study on 539 proteins of Thermotoga
maritima that had been successfully produced and purified in the Joint Center
Chapter 4 Data Management in Structural Genomicss
• The system should manage information about project personnel, laboratory
resources (refrigerators and freezers), reagents used by the project, and prod-
ucts of the project (e.g., cloned genes, plasmids, expression hosts, proteins).
• It should create and read bar codes.
• It should have a flexible query interface that makes it possible to mine the
data in useful ways.
• The system should be capable of creating reports of the type needed by labo-
ratory personnel and project managers.
• Users should be able to attach a wide variety of files (text, images, data) at
relevant positions and to view and manipulate these (e.g., carry out simple
image processing) from the user interface.
• The LIMS should be capable of creating valid depositions for relevant databases,
such as the TargetDB, the PEPCdb, and the World Wide Protein Data Bank.
2.3. The LIMS as a Project Management Tool
A well-organized LIMS automatically provides information needed by persons
carrying out individual steps in a pipeline. For example, at the target selection
level, it should be capable of displaying information about potential targets
and should be able to sort these according to target selection protocols. At the
small-scale expression level, it should carry forward information about the
ORFs to be tested, the plasmids and expression hosts to be used as specified
by the protocol. At some levels, such as cell growth and protein purification,
tasks can be carried out by technicians who have minimal training. Because
all operations are logged into the LIMS, it is possible for supervisors to create
reports that trace the activities of individual technicians and compare them
with results from the project as a whole. This helps to identify problems with
training and understanding. The use of a LIMS also ensures that no target has
been forgotten at a given stage of the experimental workflow. For example, a
project manager might ask the list of proteins that were successfully tested for
solubility but did not undergo large-scale production, and what amount of time
these have been waiting. A LIMS can help to tackle bottlenecks in the experi-
mental pipeline. If for example proteins spend a long time in the freezer after
large-scale production and before purification, this might mean that additional
purification hardware is needed.
2.4. The LIMS as a Data Mining Tool
It has often been hypothesized that some sequence features, or different
sequence engineering techniques could improve protein expression, solubility,
and stability or could lead to better crystals. However, it was never before
possible to statistically validate these hypotheses. Indeed, such validation
requires having a sufficient number of proteins prepared under the same
conditions: using the same vectors, same constructions (tags in particular),
Kobe_Ch04.indd 53 Kobe_Ch04.indd 53 8/28/2007 1:46:19 AM8/28/2007 1:46:19 AM
3. Examples of LIMS Developed for Structural
Genomics and Structural Biology
Haquin et al.
for Structural Genomics. For each protein, 480 crystallization conditions were
tested, and 464 proteins were crystallized, leading to 86 crystal structures.
From these results, the authors deduced that a subset of 67 crystallization
conditions led to crystals for 86% of the 464 (19). A second paper (18) presented
correlations between crystallization results and different parameters, such as
isoelectric point, sequence length, average hydropathy, or low complexity
regions. The results showed that most proteins successfully crystallized could
be distinguished on the basis of combinations of these parameters. Thus, these
criteria can be used to select for sequences that are more likely to yield crystals
and consequently save time and effort.
Another study carried out at the New York Structural Genomics Center (20)
was based on access to database information about numerous proteins produced
and crystallized under same experimental conditions. In this study, the authors
used tree-based analyses to discover correlations between properties of the
target and its behavior in the different stages of the experimental workflow.
They showed that the parameters most predictive of success were existence of
homologues in other organisms, the frequencies of charged residues, the group-
ing of hydrophobic residues, the number of interaction partners, and the length
of the protein. Using these parameters to drive target choice should improve the
success rate of the whole platform.
The Center for Eukaryotic Structural Genomics has used data collected in its
LIMS to carry out three studies of this kind. In the first study, DNA arrays were
used to analyze a cDNA library for Arabidopsis thaliana, and it was found that
detection of the presence of an ORF was a good predictor of the success of clon-
ing attempts (21). CESG now uses available gene chip information in choosing
targets. In the second study, a number of software packages were used to predict
the degree of disorder on proteins corresponding to Arabidopsis ORFs. The results
showed that PONDR predictions (22) could be used in target selection to improve
the likelihood success (23). In the third study, 96 targets were taken in parallel
through CESG’s E. coli cell–based and wheat germ cell–free pipelines, and the
results for expression and solubility were compared. Whenever possible, each tar-
get was labeled with nitrogen-15 for analysis by 15N HSQC NMR spectroscopy.
The results showed that, in comparison with the E. coli cell–based approach, the
wheat germ cell–free approach yielded a significantly larger number of soluble
protein samples suitable for structural analysis by NMR spectroscopy (24).
These studies did have an impact on the concerned projects, and have con-
tributed to increase the efficiency of the experimental workflow. However,
many of the results are, or could be, species specific, and their full demonstra-
tion requires statistical analysis on broader data sets. The generalization of
LIMS usage, in structural genomics, but also to a wider audience in classical
structural biology laboratories is the only possibility that all data can be used.
3.1. A Crystallography LIMS: Xtrack
3.1.1. Introduction and Overview
Determining the three-dimensional structure of a protein by x-ray crystal-
lography often involves the collection of a large number of data sets, at
different wavelengths, and from multiple crystals. These crystals can be
Kobe_Ch04.indd 54Kobe_Ch04.indd 54 8/28/2007 1:46:19 AM 8/28/2007 1:46:19 AM
produced from stock solutions. The list of stock solutions can be edited. For
convenience, most used crystallization screens are already available for use.
The authors have been entering the new screens on demand for the past years.
Entry of new screens is an involved manual process, because of the lack of
an interface for this and as the result of manufacturers’ inconsistencies in the
naming of chemicals. The interface permits the user to fully customize the
content of the drops by changing the proportions of the protein and reservoir
Chapter 4 Data Management in Structural Genomicss
different from one another: native protein, selenomethionine-labeled protein,
and crystals soaked in different heavy metals. Structure determination can
be a lengthy task and often spreads over a few months. Finally, because the
synchrotron where data are collected is frequently far from the laboratory,
the people who collect the data are not necessarily those who determine the
structure, and more than one person may work on the same structure. The
multiuser and multitask problem was present in crystallography before it
arose in molecular biology. This is why the Xtrack LIMS focused first on
Xtrack (25) was originally designed to manage data from crystal to
structure. More recent versions can manage crystallization data and
contain the chemical compositions of the major crystallization screens.
Xtrack can also store data concerning the expression and purification of
the samples used in crystallization. Thus, after completion of the structure,
all data needed for deposition in the PBD are present in the database. The
use of Xtrack is very simple and intuitive. It contains scripts able to read
log files from crystallography programs, such as CNS (26), SCALEPACK
(27), or REFMAC (28).
3.1.2. Parts and Pages
18.104.22.168. Protein Production and Target Description: The protein production
part of Xtrack is separated into three different pages: Cloning, Expression, and
Purification. Each page is very simple and allows only basic data entry of the
kind required for structure deposition. The data that can be entered are:
• For cloning: the cloning vector, the tag if any, the storage location, the protocol
as free text, and notes
• For expression: the organism, the expression system, the vector, the inducer,
the temperature, the duration, the OD prior to induction, the storage location,
and the protocol as free text and notes
• For purification: the purity, the storage location, the picture of the gel, the
protocol as free text, and notes
A Chemistry page provides for entry of various data concerning the protein:
mutations, assembly, ligands, molecular weight, isoelectric point (pI), number
of cysteines, peptide sequence, and nucleic acid sequence. The protein pro-
duction pages of Xtrack are available only for structure deposition needs and
cannot be used as a LIMS for protein production.
22.214.171.124. Crystallization: Xtrack’s crystallization interface handles the setup
of plates, the examination of plates, and searching through plates. The plate
setup interface offers all needed options for setting up a plate, either for a
crystallization screen or screen optimization. The user first chooses a plate
type, the number of drops in each well and names the plate. The interface
(Fig. 4.1) allows for specification of the crystallization screens or gradients
Kobe_Ch04.indd 55Kobe_Ch04.indd 55 8/28/2007 1:46:19 AM8/28/2007 1:46:19 AM
directly through the interface, either by providing a numerical code (0: clear;
1: cloudy precipitate; 2: gelatinous precipitate; 3: spherulites; 4: needles;
5: plates; 6: prisms) or by choosing from drop-down menus; the results of
these two choices are synchronized (Fig. 4.2). For each drop, a photo can
be uploaded. Observations and photos can also be entered as uploaded files.
The required file format is very simple, and the output from most observation
robots can easily be converted to it.
Haquin et al.
The plates can then be examined. The user can score the observation
Fig. 4.1 Xtrack plate setup interface. While entering a plate, user defines which protein
is used (A), and gives the concentration of the solution. The reservoir solutions can be
defined either by using a crystallization screen (E), or designing gradients (C). The
volumes of reservoir and protein solutions in the drop(s) are defined using the fields and
drop-down menus in (D). Finally, the user can choose which data are displayed on the
page, and which actions on is these data are editable using the drop-down menus in (B).
Fig. 4.2 Xtrack observation entry interface. The observations made on a given crystal-
lization plates can be manually entered in two different ways: either using a number
(A), or entering the corresponding observations with a drop-down menu (B). Both
interfaces are displayed on the same page, and are synchronized. For example, entering
“2” for drop 1, well A2, in the simple interface (A), the text “Drop1: Gelatinous ppt”
(Gelatinous precipitate) appears in well A2 on the complete interface (B).
Kobe_Ch04.indd 56 Kobe_Ch04.indd 56 8/28/2007 1:46:20 AM 8/28/2007 1:46:20 AM
The obtained R value can be registered.
For refinement, a list of programs is proposed, but again additional ones
can be entered. This page also allows for tracking the number of refinement
cycles done, the highest and lowest resolution, the R value, and the R free. At
this stage one can define how the noncrystallographic symmetry was taken
into account. This interface is also used to keep track of the latest coordinates
file and the most recent backup file. “Analysis” concerns the final structure.
Chapter 4 Data Management in Structural Genomicss
A search interface (Fig. 4.3) allows for the listing of crystallization plates
according to various filters, including the project, the protein, the date, or the
person responsible. The list of plates can then be sorted on the basis of the best
score in each plate.
126.96.36.199. Data Collection and Structure Determination: The data collection
and structure resolution tracking system is divided into seven different pages:
Crystallization, Collection, Data Reduction, Structure Solution, Refinement,
Analysis, and Deposition.
The Crystallization page allows the user to define which crystal is going to
be used for data collection. All data concerning the crystal can be extracted
from the crystallization plate if it is part of the database. If not, the pH, temperature
of crystallization, other conditions, and crystal type and size can be entered
manually. In either case, additional data can be entered: heavy atom or ligand
introduction and cryo conditions.
The Collection page concerns the data collection itself. The data relative to
collection are numerous; however, many of these can be extracted from image
files. Moreover, many fields are linked to one another. For example, upon
choosing one of the listed beam lines, the “collection site” and “source” fill
Similarly, the data to be entered on the Data Reduction page can be extracted
from software, for example, from Denzo/Scalepack files.
The Structure Solution page registers the method used. A drop-down menu
offers the choice of rigid body refinement, molecular replacement, MIR, or
MAD. If another method is used, SAD for example, this can be entered in a
text box. Similarly, a drop-down menu proposes a list of major structure solu-
tion programs; if a nonlisted program is used, this can be entered in a text box.
Fig. 4.3 Xtrack plate searching interface. Different criteria can be used to search
plates in Xtrack’s database: the date (plates done between starting and ending dates),
keywords, project (all plates in a given project), group, responsible, protein, or “Score
above.” For the later, if entering “4” for example, only plates containing needles, plates
or prisms will be returned.
Kobe_Ch04.indd 57Kobe_Ch04.indd 57 8/28/2007 1:46:20 AM8/28/2007 1:46:20 AM
188.8.131.52. Overall Goals: The overall aims of the Sesame project have been
to develop a LIMS that: (1) provides a flexible resource for storing, recover-
ing and disseminating data that fits well in a research environment (multiple
views of data and parameters); (2) allows the worldwide community of sci-
entists to utilize the system; (3) allows remote collaborations at all stages; (4)
provides full user data access security and data storage security; (5) permits
data mining exercises and analysis of laboratory operations to identify best
Haquin et al.
Various details concerning the crystal structure can be entered: the number of
atoms of the protein, the error estimates, the RMSDs for bond length or angles.
All of these parameters can be extracted from the last coordinates file.
The final page is filled in after deposition in the wwPDB (http://www.
Xtrack’s user interface is programmed in (a server-side HTML embedded
relational database. The model contains 10 main tables, corresponding to
the different pages the user has access to: Structure, Chemistry, Expression,
Crystallization, Collection, Data Reduction, Structure Solution, Refinement,
Analysis, and Deposition. The first rows of each table contain the necessary
data for the interface (e.g., field names, field types, field size, default values,
help text). Because most of the data are in the database and not hard coded, the
PHP code can be much more generic. Thus, a field name or a field size can be
changed very easily in the interface.
Xtrack is distributed freely (http://xray.bmc.uu.se/xtrack/). It runs on most
platforms and is easy to install. A demo version and a help file can also be
found on the web site.
3.2. A Generalized LIMS: Sesame
3.2.1. Introduction and Overview
Sesame (14) is a set of software tools for information and hardware man-
agement. It was originally designed for the management of NMR experi-
ments at the National Magnetic Resonance Facility in Madison, WI, and
has been in use there since 1999. With the launch of structural genomics
projects, Sesame was extended to include protein production and structure
determination by x-ray crystallography and NMR spectroscopy. Sesame has
evolved to be capable of handling and tracking back to the levels of gene(s)
and compound(s): full-length genes, domains with varying termini, protein:
protein complexes, and protein:ligand complexes. These developments have
been supported since 2001 as part of the Center for Eukaryotic Structural
Genomics (CESG). More recently the development of a Sesame module for
metabolomics was developed with support from an NIH Roadmap Initiative
Grant. Information on Sesame is available from its web page (http://www.
Although, much of Sesame’s development has been in the context of a
structural genomics project, the software also is suitable for use by individual
investigators pursuing research in structural biology. Six stand-alone instances
of Sesame have been deployed around the world; however, several projects
make use of the Sesame instance in Madison. Sesame is freely available to
academic users and will be open source under a GPL license.
Kobe_Ch04.indd 58Kobe_Ch04.indd 588/28/2007 1:46:20 AM 8/28/2007 1:46:20 AM
184.108.40.206. Lab Resources The lab resources are: the status (e.g., active, inac-
tive, request), which can be attached to individual items (e.g., workgroup,
protein, sample, plasmid); lab protocols that correspond to different items
and include the protocol name, type, contact, the whole protocol in text form,
actions associated with it, and a protocol tree that contains tag-value nodes;
record type (e.g., pipeline, research and development, medically relevant,
outside user request, inactive); plates described by name, number of rows and
Chapter 4 Data Management in Structural Genomicss
practices; and (6) simplifies the dissemination, installation, maintenance,
and interoperability of software.
220.127.116.11. Sesame Operation Sesame was designed to support multiple, over-
lapping projects (multiple projects involving individual scientists or many
collaborators located at the same or separate sites who may be working on
more than one project). Individuals working in a Sesame environment may
participate in a variety of associations (as ordinary users, members of labs, or
members of collaborative groups).
18.104.22.168. Users Each user registers during his or her first access to the system,
choosing a user name and password and supplying full name, e-mail, and
organization. Every user is by default assigned to the “World” collaborative
group and “World” lab, the parts of Sesame that are visible to all registered
22.214.171.124. Labs A virtual lab may represent a center, facility, research labora-
tory, or research project, and may consist of one or many members. Any user
can create multiple labs or be a member of labs created by others. In a typical
scenario, a principal investigator (PI) initiates a Sesame lab by designating a
member of his or her laboratory to be the “Lab Master,” who then creates the
virtual lab by giving it a name and asking members of the PI’s physical lab
to register as Sesame users. The Lab Master then invites these persons (who
can be located anywhere in the world) to join the Sesame lab. Individual users
may accept or reject the invitation. Sesame users may leave a virtual lab or
be removed by the Lab Master at any time. Thus, a Sesame lab can reflect
the personnel in a real physical lab or project. A given PI may have multiple
virtual labs with separate or overlapping membership. The Lab Master handles
the configuration of the Sesame lab, that is, defines and maintains the member-
ship and lab resources.
In the case of a lab that accepts user service requests (a facility), PIs can
register and request access or be invited to use the facility. The PI maintains
a list of the group members that are permitted to use the facility’s services.
If a Sesame instance supports multiple facilities, a PI may be associated with
more than one facility and can differentially control the access of members to
126.96.36.199. Collaborative Groups These are a set of associated users who intend
to share their data. Any user can initiate a group by defining a name for the vir-
tual group. The initiating user becomes the “Group Master” and invites other
users to join. Members of a collaborative group control visibility and/or access
to their data records by defining privilege levels. Three levels are supported:
user, group, and world. A user sees only his or her own personal records, if a
member of a collaborative group, those flagged to be visible by members of
the group, and those with a world flag.
Kobe_Ch04.indd 59Kobe_Ch04.indd 59 8/28/2007 1:46:21 AM 8/28/2007 1:46:21 AM
around 7 gigabytes. Monitoring of the second and third Tier servers indicates
that the existing Madison installation has never approached full capacity.
Haquin et al.
columns, row and column labeling type and makers number; source organism
(genus, species, strain, NCBI taxonomy id); barcode and label definitions for
different items; barcode printers; locations (room, freezer, rack, shelf); 5? and
3? primer prefixes; plasmid components (plasmid source, host restriction, tag
and cleavage); lock solvents (NMR); internal reference compounds (NMR);
mass spectrum type; mass spectrometers; stock solutions for crystallization
(precipitant, salt, buffer, additive, pH adjuster, pre-made solution, buffered
precipitant, cryoprotectant) and associated data (reagent name, pH, concentra-
tion, concentration unit, type, molecular weight, form, volume, flush mode,
dispense speed, dispense delay, pKa, number of mixing cycles, and batch
number); robot rack type; robot racks; crystal scores. The lab resources are
only modifiable by the Lab Master; however, by using the “Can Do” resource,
the Lab Master can delegate the maintenance of individual lab resources to
specific lab members. This imparts extraordinary flexibility to the system and
enables the virtual lab to adapt to changes in the operation of the associated
real-world lab. Each Lab Master can export or import lab resources; this ena-
bles the dissemination of protocols and simplifies the setup of a new lab.
188.8.131.52. Objects and Records Objects in a Sesame lab can be thought of as
those available to all lab members (e.g., reagents on shelves, instruments on
benches, shared laboratory protocols). Objects attributed to a given lab are
visible to all lab members, but only to them. Lab data are not visible to the
world or to other labs. Each record (e.g., sample description) belongs to the
person who created it. Other lab members can copy it and then modify the
newly created record for their own use. Only the owner can update key fields
or delete a record. Records usually contain fields that can be modified by other
lab members: these include “actions” (reports on steps carried out), locations
and quantities of intermediates (e.g., plasmids, cell pastes, purified protein,
crystals, and solutions), and images (gel scans, spectra). Files can be attached
to records by any lab member. Lab Masters are able to control the accessibility
of records within the lab.
184.108.40.206. Data and System Access Security Passwords are encoded in the
client using the MD5 message-digesting algorithm. The unencoded form of
the password never leaves, nor is stored in the client, and only the encoded
password is stored in the user’s profile in the database. Security is maintained
over all tiers throughout Sesame. Twice weekly, backups of the UW-Madison
Sesame server are performed to a Network Attached Storage unit (RAID)
located in a separate building.
220.127.116.11. Sesame Usage In addition to the UW-Madison Sesame instance
described here, six external (stand-alone) Sesame instances have been
installed. The UW-Madison Sesame instance currently has over 1,000 reg-
istered users who are members of more than 60 labs. The current content of
the relational database is around 400 megabytes, and the file server contains
18.104.22.168. Sesame Reliability The Madison Sesame installation has proved
exceptionally robust and stable. The Sesame system has been fully operational
except during software upgrades and other routine maintenance. Over the past
4 years, the system has logged fewer than 7 lost days of operation (99.5%
Kobe_Ch04.indd 60Kobe_Ch04.indd 60 8/28/2007 1:46:21 AM8/28/2007 1:46:21 AM
ticular record, when and who modified it last, the status and type, external ID,
user label, user- and time-stamped info entries, all linked items, and all attached
files and images. This flexible combination of bioinformatics information,
actions, images, files, and links to other records creates a comprehensive compu-
terized record of all phases of a project accessible to all members of the lab.
Sesame users can retrieve data from various Views as CSV, FASTA, XML,
or formatted text files and at the Module level as XML files. These files are
Chapter 4 Data Management in Structural Genomicss
reliability); most of these were the result of network related problems. Thus far,
the second Tier of Sesame has never crashed or locked up.
3.2.2. Sesame Modules and Views
To simplify and customize the system, users access Sesame through
“Modules” that are tailored to specific kinds of experimentation. A Module
consists of a collection of “Views” that handle data specific to specific stages
of experimentation. The Modules and Views provide simplified graphical
user interfaces for data entry and retrieval (suitable for use with notebook or
handheld computers); however, all data are stored in a single relational data-
base. Current Sesame modules include: Sheherazade (structural genomics),
Jar (outside user requests to a structural genomics facility), Well (crystalliza-
tion trials), Camel (NMR spectroscopy), Lamp (metabolomics), Rukh (yeast
two-hybrid molecular interactions), and Sundial (shared resource scheduling
and management). These modules are described briefly in the following. All
Sesame modules are publicly available from the Sesame web site <http://
www.sesame.wisc.edu> ready for immediate use by all registered users. The
features of Sesame are documented with Web-based help pages.
Modules share many common features. Users can attach any number of
files and images (e.g., gel scans, images of spectra, text files) to most record
types. Images can be attached to every individual data record and viewed,
manipulated, and printed. For most Modules, the Lab Master maintains a
set of lab resources that are used to construct Sesame records. Records in all
Modules can be searched and retrieved on the basis of a variety of criteria
(e.g., owner, type, status, components, location, different IDs, date created,
date last modified).
Sesame supports data input from various devices, including barcode read-
ers, plate readers and NMR spectrometers. Data can be exported in various
forms, including barcoded labels, robot work lists, and text or XML reports of
Almost all Sesame Views contain “lab protocol” and “actions” cells to allow
lab specific fields to be defined, and capture information about work carried
out. To document lab activities, the Lab Master is able to define controlled
vocabulary terms called actions. Actions are fully definable to meet the objec-
tives of the project and can describe different steps for different protocols, such
as a PCR, protein expression, protein purification, transferring a sample to a
collaborator, or an instrument is down and a repair is in progress. The record
contains the name of each person adding an action and a time stamp indicat-
ing when it was entered. This information can be used to analyze productivity,
determine bottlenecks in procedures, and identify who should be coauthors on
publications, or see what was ever done with a given ORF, protein, sample, or
other Sesame item.
Every Sesame View contains fields that show when and who created the par-
Kobe_Ch04.indd 61Kobe_Ch04.indd 61 8/28/2007 1:46:21 AM8/28/2007 1:46:21 AM
The “Workgroup” View handles a set of ORFs that have been selected for
experimentation. (The authors plan to generalize the concept of workgroups
to handle other sets of Sesame items of the same type, e.g., protein, crystal.)
The number of targets in the set is variable, but typically is 96 (or a multiple
thereof) for compatibility with plates used by robots and high-throughput
instrumentation. Once a workgroup of ORFs is formed, the Workgroup View
can be used to construct primers for PCR, to create orders for primers, calculate
Haquin et al.
readable by many spreadsheet software packages and are amenable for most
general purpose computational exercises.
Sesame can generate specialized reports for depositing data into public
databases (e.g., PEPCdb and TargetDB) and tracking progress on a lab/center
level. Sesame currently is capable of generating the XML file required for
depositing information into the first incarnation of PEPCdb. This file includes
the protocols used to produce proteins for structure analysis, a dated status history
for each protein target and a failure remark when appropriate, information
on each trial used to produce a protein, references to target-related database
accession codes, and information on the DNA and protein sequence. To be
compliant with future PEPCdb requirements, an enhanced version is planned
that will output quantitative information on protein cloning, purification, and
crystallization trials from data captured in existing Sesame Views.
Sesame has the capability of printing and facilities for querying based on
scanned barcodes. Printed barcodes can be attached to physical items, such
as mass spectrometry samples, vials containing plasmids, NMR samples, and
multiwell plates, and connect them to data records held inside the Sesame
database. The Lab Master can customize the barcode and label definition, add-
ing the contents of selected record fields (e.g., user name, date, linked record
database id) to the unique database id.
Sheherazade is the client Module, that contains Views relevant to structural
and functional genomics/biology. Sheherazade supports progress tracking and
report generation. Views under Sheherazade include: “ORF,” “ORF Details,”
“Workgroup,” “Primer,” “Plasmid,” “Protein,” “Sample,” “Compound,” “Mass
Sample,” “Screen,” “Crystal,” “NMR Experiment,” “NMR Assignment,”
“Structure,” “NMR Deposition,” “X-Ray Deposition,” “Software,” “Hardware,”
“Vendor,” “Recipe,” “Amino Acid Calculator,” “Citation,” “DBNO List,” “Job
Request,” and “Target Request.” The different Views provide partially overlap-
ping information. This makes it possible to traverse all parts of a project, both
horizontally (following all the links and branches between different steps) and
vertically (viewing and editing all the available records in a particular step of
The “ORF” and “ORF Details” Views handle ORF records that contain
available sequence information and annotation captured from genomics and
bioinformatics web sites, for example, the target score, source organism, gene
locus and gene locator, different IDs (e.g., main, lab, swissprot), name, cat-
egory, predicted structure class, structure known, nucleotide code and length,
amino acid code and length, number of introns, molecular weight, calculated
absorbance coefficient (e280), pI, amino acid composition/count, signal peptide
length, closest homologue, and actions applied in a lab. All these fields can
be updated as the coverage and annotation improves. The ORF records can be
searched, sorted, and grouped by various criteria, including name, number of
introns, and sequence length.
Kobe_Ch04.indd 62Kobe_Ch04.indd 62 8/28/2007 1:46:21 AM 8/28/2007 1:46:21 AM
Uncorrected Proof Download full-text
of the solution, which is used to determine the pipetting speed).
“Crystal” View is for keeping track of crystals obtained from crystallization
screens. The crystal-specific fields are: screen-, well- and droplet-number,
pin and vial barcode, dimensions, morphology, effective and best resolution,
quality, space group, unit cell, and parameter reliability.
“NMR Experiment” View is for managing NMR experiments. This tool
provides a solution to the problem of storing, recovering, and disseminating
Chapter 4 Data Management in Structural Genomicss
digestion patterns, and attach images (e.g., gel scans). ORFs within a work-
group can also be linked directly to records from other Views, allowing data
derived from experiments and calculations to be captured (see the following).
“Primer” View is designed to handle primer records. The primer-specific
fields are: primer name, purpose, restriction enzyme, location (room, freezer,
tower box, and position in box), reading direction, genetic code, length, met-
ing temperature, stock concentration and PCR concentration, and date of
“Plasmid” View handles plasmid records. Some of the plasmid specific
fields are: protein name, clone type, clone utility, clone tissue source, insert
source, stock type, variant name, host, host requirement, selectable marker,
detailed construct description, amino acid sequence and data, and description
of N-terminus modification during cloning.
“Protein” View manages information about the expression and purification of
proteins. Protein records contain data about protein name, cleaved, uncleaved,
and quality-controlled amino acid sequences (including automated calculation
of a variety of physical parameters for each sequence), isotope labeling, storage
location, and external ids.
“Compound” View is designed to track compounds. The compound-specific
fields are: name, formula, molecular weight, isotope labeling, details, URL, stor-
age temperature, storage location, synonyms, external ids, and amount tracking.
“Sample” View is designed to handle physical laboratory samples; describe
where they are kept; specify the components (constituent type, name, concen-
tration, and isotope labeling), pH, and ionic strength; and track changes in the
location and quantities of a stored sample.
“Mass Sample” View manages information about samples used for mass
spectrometry (amino acid sequence if the sample is a protein; expected molec-
ular mass, concentrations of protein, buffer, and salt; list of impurities) and
the results of the analysis (instrument; spectrum type, experimental mass[es],
images of the obtained spectra, and raw data files).
“Screen” View handles information about x-ray crystallization and NMR
sample solubility screens. It allows users to design screens and keeps a
detailed description of the conditions of the trial (e.g., sample[s] used, volume,
pH, date, temperature, droplet additives, screen type, all the screen component
details and volumes, and location of the bar-coded plate), along with informa-
tion on the observed results and progress toward obtaining diffraction-quality
crystals or NMR samples suitable for solution structure determinations. The
Screen View enables the user to store multiple images of the droplets, exam-
ine them using the included image processing tools, and finally score them.
The Screen View can generate worklists for the Tecan Genesis and Gilson/
Cyberlab C-200G robots. In addition to the screening parameters, the Screen
View and the associated robot-related Lab Resources maintain all the param-
eters required to operate the robot (racks and the solutions in them, viscosity
Kobe_Ch04.indd 63Kobe_Ch04.indd 63 8/28/2007 1:46:21 AM 8/28/2007 1:46:21 AM