Page 1

Model-Based Global Analysis of Heterogeneous Experimental

Data Using gfit

Mikhail K. Levin, Manju M. Hingorani, Raquell M. Holmes, Smita S. Patel, and John H. Carson

Summary

Regression analysis is indispensible for quantitative understanding of biological systems and for

developing accurate computational models. By applying regression analysis, one can validate models

and quantify components of the system, including ones that cannot be observed directly. Global

(simultaneous) analysis of all experimental data available for the system produces the most

informative results. To quantify components of a complex system, the dataset needs to contain

experiments of different types performed under a broad range of conditions. However, heterogeneity

of such datasets complicates implementation of the global analysis. Computational models

continuously evolve to include new knowledge and to account for novel experimental data, creating

the demand for flexible and efficient analysis procedures. To address these problems, we have

developed gfit software to globally analyze many types of experiments, to validate computational

models, and to extract maximum information from the available experimental data.

Keywords

Regression analysis; Computational model; Curve fitting; MATLAB; Computer simulation; Least-

squares

1. Introduction

Computational models play increasingly important roles in biology. Constructing a model that

accurately represents the mechanism of a system, reliably simulates its behavior, and has well-

defined parameter values is the ultimate goal of many research projects. Models are used for

interpreting experimental observations, testing hypotheses, integrating knowledge,

discovering components responsible for certain behavior, designing more informative

experiments, and making quantitative predictions (1). Remarkably, computational models act

both as tools for studying biology and as representations of the resulting knowledge. Indeed,

quantitative mechanistic information incorporated into a model allows it to make predictions

outside the domain of existing observations.

The focus of this chapter is on understanding experimental data and extracting useful

information from it. The role of a model in this process is to postulate a relationship between

conditions of experiments and the observed results. Using regression analysis, different models

can be tested for their ability to explain the experimental observations, and their parameters

can be estimated. Thus, regression analysis ties together models and data, validating the former

and extracting information from the latter (2,3).

Unfortunately, practical application of this procedure to biological systems can be complicated.

As will be shown in this chapter, even relatively simple models may contain too many

NIH Public Access

Author Manuscript

Methods Mol Biol. Author manuscript; available in PMC 2010 April 7.

Published in final edited form as:

Methods Mol Biol. 2009 ; 500: 335–359. doi:10.1007/978-1-59745-525-1_12.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

parameters to estimate based on a single experiment of any type. Therefore, to test whether the

model is consistent with the data and to determine its parameters, data from multiple

experiments need to be analyzed globally, while applying all known constraints to the values

of parameters (4).

In this chapter, we discuss the challenges associated with practical application of regression

analysis to biological systems. The problems we describe are exacerbated in complex models

and experimental designs, and thus are especially frustrating for quantitative biologists. We

describe our software, gfit, which helps to overcome these problems and illustrate its utility

with three biological systems of increasing complexity.

1.1. Regression Analysis

Regression analysis includes a range of methods for establishing a model that accurately

represents a system and makes accurate predictions of its behavior. The specific tasks include

searching for optimal parameter values, testing whether the model agrees with experimental

data, estimating parameter confidence intervals, testing whether more experimental data are

needed, detecting outlier points, and selecting the preferred model from two possible ones. In

regression analysis, model F is defined as a quantitative relationship between experimental

measurements (dependent variables) ϒ ϒ and experiment conditions (independent variables) C

(1)

where x is a vector of model parameters (variables affecting behavior of the system that cannot

be controlled or directly observed during experiment), and ε is a set of measurement errors

(see Note 1) (5).

Goodness of fit, the closeness of model simulations to the measurements, is quantified by

objective function S(x). The most commonly used objective function is a sum of squared

residuals (see Note 2),

(2)

or, in case of nonuniformly distributed ε, a weighted sum of squared residuals

(3)

Curve fitting is a problem of finding parameters x that produce the best fit, that is minimize

the objective function:

1In regression analysis literature, dependent variables may be referred to as response or observed variables; independent variables may

be referred to as predictor or explanatory variables.

2Residual is the difference between the experimental measurement and the simulation produced by the model.

Levin et al.Page 2

Methods Mol Biol. Author manuscript; available in PMC 2010 April 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

(4)

Curve fitting is an optimization problem, performed by optimization engines. Many tasks of

regression analysis are based on curve fitting.

1.2. Applying Regression Analysis to Experimental Data

One common obstacle to broader application of regression analysis to biological problems is

failure of many models to directly simulate the experimentally observed variable. For example,

a typical system model may simulate concentrations of reacting species, values that are rarely

observed in an experiment directly. One way of addressing this discrepancy is to convert

measured values into the type simulated by the model. However, such conversions often

introduce statistical errors and are not always possible. The better solution is to simulate exactly

the same value type as measured in the experiment. To achieve that, separate experiment

models may be required. Experiment models use the system model to simulate the system's

response to manipulations and the experimentally measured signal (see Fig. 1). The approach

of separating system models and experiment models is used in Virtual Cell software (6).

A curve fitting procedure for a heterogeneous dataset can be quite complex and require

extensive communication between its entities, i.e., model, optimization engine, experiment

conditions, measurements, parameters, and constraints (see Fig. 2A). Before a search for

optimal parameter values can begin, the data for each experiment has to be examined:

– To determine which variables need to be simulated and their sizes

– To check that the data required for the simulation has been provided

– To check against constraints on variable dimensions and values imposed by the model

– To determine what parameters can be estimated and to choose their starting values

Once the data have been examined, the optimization procedure can be initiated by passing a

vector of starting parameter values to the optimization engine. Depending on the engine type,

parameter constraints can be also provided. The engine conducts optimization by repeatedly

changing parameters and recalculating the objective function on the basis of experimental

measurements and simulations. To simulate each experiment, the input data for the model has

to be assembled from applicable optimization parameters and experiment conditions. The input

data also have to be checked against the constraints, since not all of them can be enforced by

optimization engines. After simulating all experiments, the appropriate objective function can

be computed and used by optimization engine to determine the direction of the search.

Curve fitting procedure follows complicated rules that depend on the computational model,

experimental data, and optimization engine. In addition, parameter constraints need to reflect

various considerations related to the research project. These factors make the analysis

procedure not only complex, but also highly variable, making design and maintenance of

project-specific software prohibitively expensive. Fortunately, the patterns of data flow during

regression analysis are largely independent of the system under investigation. This fact allowed

us to design software that solves the analysis problem generally and for any model type.

1.3. Design of gfit

The purpose of gfit is connecting models with various types of experimental data. First, it

simplifies the model's task of directly simulating experimentally observable variables. Second,

during regression analysis, gfit maintains communications between the analysis components,

acting as a mediator (see Fig. 2B). Third, by defining standard application interfaces for models,

Levin et al.Page 3

Methods Mol Biol. Author manuscript; available in PMC 2010 April 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

optimization engines, objective functions, and other entities, it facilitates customization of the

analysis procedure.

Of all components, application interfaces of models represent the biggest problem. Almost

every step of regression analysis procedure depends on what information is required and

produced by the model. Yet, every model has different inputs and outputs. To be able to perform

regression analysis with any kind of computational model, gfit uses a metadata approach. Any

model used by gfit is expected to have an attached Model Description (see Note 3) defining its

inputs and outputs as sets of variables (see Note 4). More information about Model Descriptions

is provided later in this chapter. Once the rules for performing simulations with the model are

known, the analysis process becomes more straightforward and independent of the model type

(Fig. 3).

Regression analysis is a complicated process with many pitfalls. gfit strives to provide

information that can help researchers avoid mistakes related to the analysis. In the protocols

that follow, the reader will build simple models and use the existing models and experimental

data for parameter estimation.

2. Materials

2.1. Software Requirements

1.

Version 6.5 or later of MATLAB (Mathworks, Natick, MA), a common science and

engineering computing software, is required for running simulations.

2.

MATLAB Optimization Toolbox (Mathworks) is required for regression analysis.

2.2. Installation of gfit

1.

Download the latest version of gfit from http://gfit.sourceforge.net. The zip-archive

contains gfit.jar library and other files required for interaction of gfit with

MATLAB.

2.

Unzip the file to a convenient location on your hard disk. For this chapter we will

assume location C:/. Folder C:/Mgfit will be created.

3.

Start MATLAB.

4.

Change MATLAB's current directory to C:/Mgfit.

5.

To start installation, type mgfit in MATLAB's command line and press Enter.

6.

Respond Yes to the query about adding C:/Mgfit to MATLAB's path.

7.

Restart MATLAB if requested.

8.

After installation, the same command, mgfit, will bring up gfit user interface

window.

3Model Description is metadata attached to gfit models that defines their correct usage. It contains model name, version, general human-

readable comments about the purpose of the model and its algorithm, and, most importantly, machine-readable descriptions of the model's

input and output variables. For each variable, it specifies name, type, physical unit, dimensions, and a range of acceptable values. Variable

dimensions are defined either as constants or in relationship to another variable dimension or index variable. Variables may change their

size depending on experimental data and user input. Dimensions of each variable usually change in concert with dimensions of other

variables.

4Variable (in gfit context) is an array of elements (numbers) defined in Model Description. A variable may contain a single element

(scalar variable, 0D), a vector of elements (1D), a matrix (2D), etc. Variables are used for storing information about an experiment, for

passing data to the model and for receiving data simulated by the model. Depending on the Model Description, each variable dimension

may be fixed, or vary individually or in concert with other variable dimensions. This property of variables increases flexibility of gfit

models.

Levin et al. Page 4

Methods Mol Biol. Author manuscript; available in PMC 2010 April 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

2.3. Installation of rsys Library

rsys library is used for solving ODEs for mass-action reaction systems, as described in

Subheading 3.3.

1.

Download the latest version of rsys for your operating system from

http://gfit.sourceforge.net.

2.

Unzip the file and put the library file anywhere on the MATLAB path. For example,

to C:/Mgfit folder.

2.4. Data and Models

A zip archive containing all model and data files mentioned in this chapter can be downloaded

from http://gfit.sourceforge.net. The data files included are in tab/newline-delimited format.

These files can be opened in a text editor, but it is more convenient to view them in a

spreadsheet. Please check the readme.txt file for the most current information.

3. Using gfit: Examples

3.1 Simple Model Example: Equilibrium Binding

In this section, we will create a model for equilibrium binding of a protein, E, to a ligand, L,

(Eq. 5) and use it for analysis of experimental data. This analysis is quite simple and can be

accomplished with many existing programs (including commonly used spreadsheet

applications). We will use it to illustrate the principles of data analysis and model validation

used by gfit, and later apply them to more interesting examples.

(5)

If only total concentrations of E and L are known, ET = [E] + [EL] and [LT] = [L] + [EL],

equilibrium concentration of the complex can be written as

(6)

3.1.1. Create Standalone Model

1.

Open MATLAB editor by typing edit in the command line. Editor window will

appear.

2.

Add the code shown in Listing 1 and save the file as eq_binding.m in C:/Mgfit/

Models folder.

Listing 1

MATLAB function simulating binding equilibrium

1.

function signal = eq_binding(Et, Lt, Kd)

2.

%simulate equilibrium binding E + L <=> EL

3.

EL = (Kd + Et + Lt - sqrt((Kd + Et + Lt). ˆ2 - 4 * Et.* Lt))/2;

Levin et al. Page 5

Methods Mol Biol. Author manuscript; available in PMC 2010 April 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript