ArticlePDF Available

Differential expression analysis for sequence count data

March 2010
Nature Precedings 11(R106)

March 2010
11(R106)

DOI:10.1038/npre.2010.4282.1

License
CC BY 3.0

Authors:

High-throughput DNA sequencing is a powerful and versatile new technology for ob-taining comprehensive and quantitative data about RNA expression (RNA-Seq), protein-DNA binding (ChIP-Seq), and genetic variations between individuals. It addresses es-sentially all of the use cases that microarrays were applied to in the past, but produces more detailed and more comprehensive results. One of the basic statistical tasks is inference (testing, regression) on discrete count values (e.g., representing the number of times a certain type of mRNA was sampled by the sequencing machine). Challenges are posed by a large dynamic range, heteroskedas-ticity and small numbers of replicates. Hence, model-based approaches are needed to achieve statistical power. I will present an error model that uses the negative binomial distribution, with vari-ance and mean linked by local regression, to model the null distribution of the count data. The method controls type-I error and provides good detection power. I will also discuss how to use the GLM framework to detect alternative transcript isoform usage. A free open-source R software package, DESeq, is available from the Bioconductor project.

Available via license: CC BY 3.0

Content may be subject to copyright.

Diﬀerential expression analysis for sequence

count data

Wolfgang Huber

EMBL Heidelberg

wolfgang.huber@embl.de

High-throughput DNA sequencing is a powerful and versatile new technology for ob-

taining comprehensive and quantitative data about RNA expression (RNA-Seq), protein-

DNA binding (ChIP-Seq), and genetic variations between individuals. It addresses es-

sentially all of the use cases that microarrays were applied to in the past, but produces

more detailed and more comprehensive results.

One of the basic statistical tasks is inference (testing, regression) on discrete count

values (e.g., representing the number of times a certain type of mRNA was sampled by

the sequencing machine). Challenges are posed by a large dynamic range, heteroskedas-

ticity and small numbers of replicates. Hence, model-based approaches are needed to

achieve statistical power.

I will present an error model that uses the negative binomial distribution, with vari-

ance and mean linked by local regression, to model the null distribution of the count

data. The method controls type-I error and provides good detection power. I will also

discuss how to use the GLM framework to detect alternative transcript isoform usage. A

free open-source R software package, DESeq, is available from the Bioconductor project.

* joint work with Simon Anders

Min P test: a resampling based gene

region-level testing procedure for genetic

case-control studies implemented in R

Stefanie Hieke, Harald Binder, Alexandra Nieters

and Martin Schumacher

Institute of Medical Biometry and Medical Informatics,

Center of Chronic Immunodeﬁcieny,

University Medical Center Freiburg,

Freiburg Center for Data Analysis and Modeling,

University Freiburg

hieke@imbi.uni-freiburg.de

Introduction Current technologies generate a huge number of single nucleotide poly-

morphism (SNP) genotype measurements in case-control studies. The resulting multiple

testing problem can be ameliorated by considering candidate gene regions. The minPtest

R package provides the ﬁrst widely accessible implementation of a gene region-level sum-

mary for each candidate gene using the min P test.

Method The gene region-level summary, as the min P test, assesses the statistical

signiﬁcance of the smallest p-trend within each gene region and, therefore, considers

a reduced number of tests. The min P test is a permutation-based method that can

be based on several univariate tests per SNP. In permutation resampling, the observed

variable (case/control status) is randomly re-assigned without replacement to ”pseudo

case/control status”. A test statistic is then recomputed using the pseudo data and com-

pared to the marginal test statistic in the original data set. This procedure is repeated

B times. The inference is based on the permutation distribution of the minimum of the

ordered p-values from the marginal test of each SNP. The gene region-level summary is

mostly compatible with univariate statistical tests per SNP conducted separately over

multiple loci.

Results Combining the p-values from tests in a permutations-based approach prevents

an increase of the false-positive rates, as correlations of SNPs are automatically taken

into account. We developed an R package that brings together three diﬀerent kinds

of tests that are scattered over several R packages, and automatically selects the most

appropriate one for the design at hand. The implementation in the minPtest package

integrates two diﬀerent parallel computing packages, thus optimally leveraging available

resources for speedy results. The package comprises a function to simulate SNP data

with known structure, allowing the user to explore diﬀerent scenarios and settings.

Conclusion The minPtest package provides a useful and feasible implementation of a

gene region-level summary, using the min P test, controlling the false-positive rate and

having higher power. In addition minPtest provides acceleration by parallel computing.

References

Chen,B.E. et al. (2006). Resampling-based multiple hypothesis testing procedures for

genetic case-control association studies. Genetic Epidemiology, 30, 495-507.

R Development Core Team (2010). R: A Language and Environment for Statistical

Computing. ISBN 3-900051-07-0. url = http://www.R-project.org.

Westfall,P.H. et al.(2002). Multiple tests for genetic eﬀects in association studies. Meth-

ods Mol Biol, 184, 143-168.

Westfall,P.H. and Young,S.S. (1993). Resampling-Based Multiple Testing: Exam- ple

and Methods for p-Value Adjustment. Wiley, New York.

Survival models with preclustered gene

groups as covariates

K. Kammers, M. Lang and J. Rahnenf¨uhrer

Departments of Statistics,

TU Dortmund University

lang@statistik.tu-dortmund.de

An important application of high dimensional gene expression measurements is the

prediction of survival times and the interpretation of the variables in the resulting regres-

sion models. When the response variables are censored survival times, an appropriate

hazard framework is required. The largest problem in this context is the typically large

number of genes compared to the number of observations (individuals). We thus apply

feature selection procedures to construct predictive models for future patients. This ap-

proach aims at identifying models with high prediction accuracy and at the same time

low model complexity. However, interpretability of the resulting models is still limited

due to little knowledge on many of the remaining selected genes. In order to improve

the interpretability of the estimated models, we summarize genes as gene groups deﬁned

by the hierarchically structured Gene Ontology (GO) and include these gene groups

as covariates in the hazard regression models. Though the expression proﬁles present

in GO groups are often heterogeneous, leading to several diﬀerent expression proﬁles

within one group. Preclustering genes within GO groups according to the correlation of

their gene expression measurements leads to homogeneous subclasses. This allows the

aggregation of each subclass to single covariates with predictive importance as well as,

as a result of GO annotations, additional interpret- ability. Besides the genomic data,

we include clinical information to reveal the real beneﬁt of the preclustered genomic

models. To evaluate the prediction performance of the models, we examine both Brier

scores and p-values derived from the prognostic index in a nested cross-validation setup.

Survival models with preclustered gene groups as covariates have similar prediction ac-

curacy to models built only with single genes. Using only gene groups as covariates can

lead to decreased prediction accuracy since many genes are not yet annotated to any

corresponding function. However, integrating the preclustering information improves

the interpretability of the models while prediction performance remains stable.

Evaluation and validation of gene

expression signatures for prognostic use in

node negative breast cancer patients

Aslihan Gerhold-Ay, Anja Victor and Marcus Schmidt

Institute of Medical Biostatistics, Epidemiology and Informatics,

University Medical Center of the Johannes Gutenberg University Mainz,

Merck KGaA, Darmstadt,

Department of Obstetrics and Gynaecology,

University Medical Center of the Johannes Gutenberg University Mainz

aslihan.gerhold-ay@unimedizin-mainz.de

Introduction The most widely used treatment guidelines for breast cancer are based

on classical risk factors like the St. Gallen classiﬁcation. The guidelines recommend

adjuvant systemic therapy for almost all breast cancer patients because this therapy

has greatly improved survival in early breast cancer. However, adjuvant therapy also

has a lot of negative eﬀects with respect to quality of life. For this reason there is

a need to specify an individual risk proﬁle for each patient to avoid over- as well as

under treatment. To get useful risk proﬁles diﬀerent predictors based on patients gene

expression have been developed for breast cancer (1; 2; 3; 4; 5). Furthermore, two gene

expression predictors are currently tested in prospective clinical trial (6; 7). The aim of

our project is the evaluation and validation of these well known signatures on a cohort

of Mainz.

Methods The cohort of Mainz consist of 199 node-negative breast cancer patients

treated between 1989 and 1998 at the Department of Obstetrics and Gynaecology, Med-

ical Center of the Johannes Gutenberg University Mainz. All patients were treated with

surgery and did not receive any systemic therapy. Data that have been collected are

classical risk factors and in addition the gene expression data from the Aﬀmetrix chip

HG-U133A (8). To analyse the eﬀect of a signature on survival we apply log-rank test

and uni- and multivariate Cox-regression. ROC curves with distant metastasis within 5

years as the deﬁned endpoint were used to describe the quality of the signatures clas-

siﬁcation into low- and high risk group. Cluster analyses were performed to identify

the intrinsic subtypes of breast cancer. Currently simulations are initiated to analyse

the stability of the intrinsic subtype signature (3; 4; 5), which is based on previously

reported molecular subtypes of breast cancer. Furthermore approaches were identiﬁed

to develop a new tumor grade signature based on gene expression data. About half of

the breast cancers are assigned histological grade 1 or 3. The other berast tumors are

classiﬁed as histological grade 2, which is not informative for clinical decision mak- ing

because of the intermediate risk of recurrence. To increase the prognostic value of tumor

grade 2 new methods are necessary to classify them to tumor grade 1 or tumor grade 3.

Results The Mainz cohort is similar to the described populations used for the gene

signature development with respect to classical risk factors. Not all of the published

prognostic values of the gene signatures could be validated on the cohort of Mainz.

Discussion Gene signatures can provide a powerful tool for identiﬁcation of patients

with high risk of recurrence. Many potential sources of bias (dye bias, sampling bias,

time lag bias and publication bias (9; 10)) can make the transmission of the methods

into practice diﬃcult. Based on our results we recommend prospective studies to test

the validity of the signatures.

References

[1] Y Wang, J G M Klijn, Y Zhang, A M Sieuwerts, M P Look, F Yang, D Talantov, M

Timmermans, M E Meijer-van Gelder, J Yu, T Jatkoe, E M J J Berns, D Atkins, J A

Foekens. Gene-expression proﬁles to predict distant metastasis of lymph-node-negative

primary breast cancer. Lancet 2005; 365:671–79.

[2] C Sotiriou, P Wirapati, S Loi, A Harris, S Fox, J Smeds, H Nordgren, P Farmer, V

Praz, B Haibe- Kains, C Desmedt, D Larsimont, F Cardoso, H Peterse, D Nuyten, M

Buyse, M J Van de Vijver, J Bergh, M Piccart, M Delorenzi. Gene-expression Proﬁling

in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve

Prognosis. J Natl Cancer Inst. 2006; 98:262-72.

[3] C M Perou, T Srlie, M B Eisen, M va de Rijn, S Jeﬀrey, C A Rees, J R Pollack, D

T Ross, H Johnsen, L A Akslen, O Fluge, A Pergamenschikov, C Williams, S X Zhu,

P E Lnning, A L Brresen-Dale, P O Brown,D Botstein. Molecular portraits of human

breast tumors Nature 2000; 406(6797):747-52.

[4] Z Hu, C Fan, D S Oh, J S Marron, X He, B F Qaqish, C Livasy, L A Carey, E

Reynolds, L Dressler, A Nobel, J Parker, M G Ewend, L R Sawyer, J Wu, Y Liu, R N,

M Tretiakova, A Ruiz Orrico, D Dreher, J P Palazzo,L Perreard, E Nelson, M Mone, H

Hansen, M Mullins, J F Quackenbush, M J Ellis, O I Olopade, P S Bernard, C M Perou.

The molecular Portraits of Breast Tumors Are Conserved Across Microarray Platforms.

BMC Genomics. 2006; 7:96.

[5] M Smid, Y Wang, Y Zhang, A M Sieuwerts, J Yu, J G M Klijn, J A Foekens, J W

M Martens. Subtypes of Breast Cancer Show Preferential Site of relapse. Cancer Res.

2008; 1;68(9):3108-14.

[6] S Paik, S Shak, G Tang, C Kim, J Baker, M Cronin, F L Baehner, M G Walker,

D Watson, T Park, W Hiller, E R Fisher, D Wickerham, J Bryant, N Wolmark. A

multigeneassay to predict recurrence of tamoxifen -treated, node-nagtive breast cancer

N Engl J Med 2004, 351(27):2817- 2826.

[7] MJ van de Vijver,YD He, LJ vant Veer,H Dai, AAM Hart, DW Voskuil, GJ Schreiber,JL

Peterse, C Roberts, M J Marton, M Parrish, D Atsma, A Witteveen, A Glas, L Dela-

haye, T van der Velde, H Bartelink, S Rodenhuis, E T Rutgers, S H Friend, R Bernards.

A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med

2002, 347(25):1999-2009.

[8] M Schmidt, D Bhm, C von Trne, E Steiner, A Puhl, H Pilch, H Lehr, J G Hengstler,

H Klbl, M Gehrmann. The Humoral Immune System Has a Key Prognostic Impact in

Node-Negative Breast. Cancer. Cancer Research 2008; 68:5405–5413.

[9] K K Dobbin, E S Kawasaki, D W Petersen, R M Simon. Characterizing dye bias in

microarray expreiments. Bioinformatics 2005; 15;21(10):2430-7.

[10] J P Ioannidis, E E Ntzani, T A Trikalinos, D G Contopoulos-Ioannidis. Replication

validity of genetic association studies. Nat Genet. 2001; 29(3):306-9.

Diﬀerential Expression Analysis and Cluster

Method for Time Course Microarray Data

Khalid A. Abnaof and Holger Fr¨ohlich

Bonn-Aachen International Center for Information Technology (B-IT),

Bonn University

Institute of Molecular Biotechnology

RWTH Aachen University

abnaof@bit.uni-bonn.de

Understanding the mechanism by which transcription factors dynamically regulate

genes and other transcription factors (TF) in multicellular organisms is a very impor-

tant and interesting task in the research activities of molecular biology. However this task

is not easy to tackle, as the dynamic process underlying this regulatory system is very

complex. Here, we were particularly interested in TF-target gene networks (transcrip-

tional programs) in multipotent progenitor (MPP) and common dendritic progenitor

cells (CDP) in mice, which are dependent on TGF-β stimulation.

The data used in this study consisted of time series microarray data at six time points.

We applied a Bayesian approach to determine diﬀerentially expressed time series between

stimulated and unstimulated cells and between cell types (1). A logistic regression model

was used to perform an analysis of diﬀerential expression on pathway level (2). After-

wards time courses for each cell type were grouped into clusters utilizing an EM based

approach, that describes mean curves within a cluster via smoothing spline models (3).

This approach, unlike conventional clustering methods, considers the dynamics of ex-

pressions changes. Further analysis of enriched transcription factor binding sites (TFBS)

within clusters allowed for partial reconstruction of gene regulatory modules. Hypothe-

ses on the dependencies between these gene regulatory modules may be derived in the

future via Dynamic Bayesian Networks (DBNs). A meta-analysis of TFBS enriched in

MPPs, but not in CDPs points towards gene regulatory mechanisms, that drive stem

cell development in dendritic progenitor cells.

References

Martin J. Aryee, Jos´e A. Guti´errez-Pabello, Igor Kramnik, Tapabrata Maiti and John

Quackenbush (2009): An improved empirical bayes approach to estimating diﬀerential

gene expression in microarray time-course data. BMC Bioinformatics, 9(1).

Montaner D, Dopazo J, 2010 Multidimensional Gene Set Analysis of Genomic Data.

PLoS ONE 5(4): e10348. doi:10.1371/journal.pone.0010348.

Ping Ma, Cristian I. Castillo-davis , Wenxuan Zhong and Jun S. Liu (2006): A data-

driven clustering method for time course gene expression data. Nucleic Acids Research,

34(4):1261-1269.

Phenotype Microarray – Data organisation

and analysis of respiration curves

Lea A. I. Vaas, Johannes Sikorski and Markus G¨oker

DSMZ German Collection of Microorganisms and Cell Cultures GmbH,

Braunschweig

Lea.Vaas@dsmz.de

Recently, the set of techniques generating so called -omics data was augmented by

yet another one, Phenotype Microarrays (PM). In contrast to the existing major tech-

nologies, i.e. DNA Microarrays, 2D-Proteomic and chromatographic applications, PM

monitors cell respiration over time. Through a redox reaction that alters the colour of

a tetrazolium dye in the presence of respiration, kinetic response curves are generated.

This provides a high throughput means to characterize microbial metabolism. Yet, the

system consists of about 2000 assays for monitoring the cells respiration in the presence

of macro- and micronutrients or their reactions to osmotic stress factors, ion or pH ef-

fects. The application of a number of chemicals, such as antibiotics, antimetabolites,

membrane-active agents, respiratory inhibitors and toxic metals to investigate the cells

sensitivity is also possible. Beside the application in identiﬁcation and drug screening

scenarios, where mainly presence-absence calls from each assay are of interest, many re-

search projects emphasize the interest in more sophisticated comparisons of phenotypes

of diﬀerent strains, isolates, mutants, etc. The in-depth evaluation of redox kinetics

should gain knowledge about the metabolic diﬀerences and provide indications on which

genetic features of the investigated organisms these diﬀerences are based. The main

steps in those analyses would be (1) data organisation and graphical presentation of the

curves, (2) application of methods for the reliable estimation of the growth parameters

of each curve, and (3) extraction and comparison of such growth curve characteristics.

Based on a summary of the available statistical tools covering large parts of the de-

manded analyses, an application strategy of these ready-to-use software tools for the

data analysis of the PM data in R will be proposed. In addition to the statistical chal-

lenges this new type of high dimensional data brings along, this talk will give an outlook

on features to be provided for a more convenient data analysis pipeline comprehending

the organisation of meta-data, concepts for handling the raw data, data analysis and

presentation of the results.

Analysing structured data – symbolic and

probabilistic appraoches.

Luc De Raedt

Katholieke Universiteit Leuven,

luc.deraedt@cs.kuleuven.be

Structured data in the form of graphs, networks or relational databases are om-

nipresent across numerous application-areas such as biology, chemistry, the internet,

social or bibliographic networks, robotics, vision, etc. The machine learning and data

mining literature has devoted a lot of attention to coping with such data giving rise

to a class of techniques that is known under the name of graph and network mining

or relational learning. In this talk, an introduction will be given to this class of tech-

niques, which often extend or upgrade more traditional techniques for dealing with ”ﬂat”

data (that is, data in feature vector format) towards graph-based and relational data.

This talk will introduce and motivate the problems and techniques of relational and

graph-based learning and look into both symbolic and probabilistic methods. Symbolic

methods attempt to identify patterns in the form of subgraphs in graph data (e.g., in

molecular datasets) while probabilistic methods extent graphical models (like Bayesian

and Markov networks) for dealing with relational data. The approaches will be illus-

trated and motivated by several real-life examples.

Association of complex human pain

phenotypes with complex pain genotypes

using a self-organizing maps approach

J¨orn L¨otsch and Alfred Ultsch

pharmazentrum frankfurt/ZAFES, Institute of Clinical Pharmacology,

Johann Wolfgang Goethe University Hospital,

Data Bionics Research Group,

University of Marburg

ultsch@mathematik.uni-marburg.de

BACKGROUND Pain is a complex trait. While clinical pain syndromes can already

be diagnosed by a set of neurological parameters, the complexity of experimental pain

is only incompletely accounted for, which often impedes associations of pain data with

clinical or genetic parameters.

METHODS Pain phenotype markers (n = 8) and genotype markers (n = 30) were

available from previous assessments in 125 healthy volunteers. A U-Matrix on an Emer-

gent Self organizing map (ESOM) was used for visualization of the distance structures

in the data. Subsequently, the prediction of the clusters by the genetic markers was

assessed using a classiﬁcation and regression tree (CART) approach.

RESULTS On the U-Matrix of the pain phenotypes, eight clusters were identiﬁed. This

clustering showed advantages over a Ward clustering on the same data. Rules could be

derived to describe the cluster contents that corresponded to three basic types of pain

thresholds: low, mean and high sensitivity. In the mean and low sensitivity (stoical)

phenotypes, subgroups could be identiﬁed. A cluster consisting of persons with high

overall pain threshold but selectively low resistance to heat, the predictive accuracy of

the classiﬁers was 84.56%. Among the genetic variants that were used for the CART

decision in that cluster were polymorphisms in TRPV1, a gene coding for a heat sensor.

CONCLUSIONS ESOM-based clustering of pain data provides biologically meaning-

ful results and satisfying the complexity of pain. The thus obtained clusters seem to

facilitate the otherwise only insuﬃciently successful genotype phenotype association in

common pain.

Spectral graph features for the classiﬁcation

of graphs and graph sequences

Miriam Schmidt, G¨unther Palm and Friedhelm Schwenker

Institute of Neural Information Processing,

University of Ulm

{miriam.k.schmidt,friedhelm.schwenker}@uni-ulm.de

Spectral graph theory is an important branch in the area of graph classiﬁcation. Matri-

ces associated with graphs such as the adjacency matrix or the Laplacian matrix contain

essential information about graph connectivity (Cvetkovi´c 1998). In this study the power

of the principal eigenvalues of adjacency matrices for classiﬁcation tasks is investigated.

In order to illustrate the proposed method, a toy problem to classify 2D-objects has

been deﬁned. The goal was to discriminate between two classes: circles and squares.

An object is represented by a set of 2D points, describing the object’s outer shape.

These points are considered as the graph’s nodes, and the graph’s adjacency matrix is

deﬁned through the pairwise Euclidean distances of the points. From this adjacency

matrix principal eigenvalues are computed as the object’s characteristic features. This

method is then evaluated on a problem of optical character recognition. For this, the

capital letters data set from Bern repository of graph data sets (see Riesen et al 2008)

has been selected. Additionally, this method has been applied to the problem of human

activity recognition based on sequences of camera images. In this task, hidden Markov

models are modeling the sequential structure of the data. In the ﬁrst step locations of

the person’s body parts (hand, head, etc.) and objects (table, cup, etc.), relevant for

the human activity, have to be estimated in each camera image. Subsequently distances

between all pairs of detected objects and body parts are computed and the eigenvalues

of this Euclidean distance matrix are calculated. These eigenvalues serve as inputs to

Gaussian mixture models estimating the emission probabilities of the hidden Markov

models.

References

Cvetkovi´c, D.M., Doob, M., Horst, S.: Spectra of Graphs. Theory and Applications.

Vch Verlagsgesellschaft Mbh (1998)

Riesen, K., Bunke, H.: IAM Graph Database Repository for Graph Based Pattern Recog-

nition and Machine Learning. Structural, Syntactic, and Statistical Pattern Recognition

(LNCS 5342), 287–297 (2008)

Tuning distance measures in k-nearest

neighbor classiﬁcation via evolution

strategies

Alexander Melkozerov, Ludwig Lausser and Hans A. Kestler

Institute of Neural Information Processing,

University of Ulm

{alexander.melkozerov,ludwig.lausser,hans.kestler}@uni-ulm.de

Molecular high-throughput technologies usually generate data with a high dimension-

ality and a low cardinality. In the context of classiﬁcation this poses a serious problem

as many classical (model based) methods turn out to be too complex for this task. This

stimulated the development of new classiﬁers that are of a lower complexity thus attain-

ing higher generalization performance by additional regularization terms and more rigid

model assumptions.

Some classiﬁcation methods completely omit the usage of model assumptions. For

example the transductive k nearest neighbor (k − NN) classiﬁers directly predict the

class label of a datapoint x according to the k training examples closest to x. The

performance of this technique is always coupled to the chosen distance measure, which

is often, due to the lack of a better choice, the Euclidian one. Other, non-standard

distance measures may be more suitable for this task. Here, we investigate the usability

of optimized distance measures of type

(~x, ~y) =

i=1

− y

)

for k − NN classiﬁcation. The weights ~w are optimized to minimize the empirical risk of

the current classiﬁer. Two types of evolutionary strategies (ES) were utilized: the stan-

dard self-adaptive ES with intermediate recombination (the so-called (µ/µ

, λ)-σSA-ES)

and the covariance matrix adaptation ES (the (µ

, λ)-CMA-ES). While the former is a

simple ES which provides baseline performance and serves as a reference in our compar-

ison, the (µ

, λ)-CMA-ES is the state-of-the-art algorithm for continuos optimization

showing very good performance in recent experimental benchmarks. Observing that the

ﬁtness function under consideration is multi-modal, the following restart strategy was

used for the (µ/µ

, λ)-σSA-ES: if no improvement of the best ﬁtness function value oc-

curs for 300 generation or the mutation strength gets too small, the ES starts the search

again from a random point.

The advanced (µ

, λ)-CMA-ES uses the following restart criteria in addition to the

check of the best ﬁtness function value improvement:

• the standard deviation σ

(g)

of the normal distribution used to sample new points

or evolution path is smaller than given value;

• numerical precision problems: the mean hyi

(g)

of newly sampled points does not

change when adding to hyi

(g)

a 0.1σ

(g)

-vector in a principal axis direction of the

covariance matrix or a 0.2σ

(g)

to each coordinate of hyi

(g)

;

• the condition number of the covariance matrix is too large.

After each restart, the (µ

, λ)-CMA-ES runs from a random point with the population

size increased by factor of 2.

The performance of these modiﬁed k − NN techniques is investigated within a com-

parative study on diﬀerent microarray datasets. The new classiﬁers are compared to

other well known classiﬁcation techniques on their generalization ability, robustness and

sparsity.

optile: Optimizing k-dimensional graphical

classiﬁcation analysis via category

reordering

Alexander Pilh¨ofer and Alexander Gribov

Department of Computer Oriented Statistics and Data Analysis,

Institute of Mathematics,

University of Augsburg

alexander.pilhoefer@math.uni-augsburg.de

In cluster analysis it is good practice to regard several diﬀerent clustering methods

instead of focusing on only one type. The interest lies in the agreements as well as

the diﬀerences between the clusterings. A good way to interpret the results is to use

visualizations such as ﬂuctuation diagrams or categorical parallel coordinate plots Pil-

hoefer and Unwin (2011). Clustering classiﬁcations usually are of a nominal categorical

structure and therefore the variable orders can be changed to improve the displays and

make their interpretation easier. Diﬀerent seriation methods (see Chen et al., 2008)

have been proposed to improve the displays for 2-dimensional problems, mostly using

one-mode optimizations like the Anti-Robinson-Criterion in distance matrices which do

not directly account for the associations between the variables. The talk will present a

family of criteria and related optimization algorithms which can be used to choose the

category orders for 2- and k-dimensional categorical classiﬁcation data with respect to

their multidimensional associations using the concept of agreement in a pseudo-diagonal

form. An eﬀective optimization algorithm will be presented and the applicability to

both table-like plots such as ﬂuctuation diagrams and line-based plots such as parallel

coordinates plots will be discussed. A special modiﬁcation of the algorithm which takes

account of hierarchical classiﬁcation structures will be presented using implementations

in the software package Seurat Gribov (2010). The talk will use real data clustering

results from the US Current Population Survey (CPS) for illustration.

References

Chen, C.-h., W. Haerdle, A. Unwin, H.-M. Wu, S. Tzeng, and C.-h. Chen (2008). Matrix

visualization. In Handbook of Data Visualization, Springer Handbooks Comp.Statistics,

pp. 681708. Springer Berlin Heidelberg.

Gribov, A. (2010, June). Seurat - visual analytics for the integrated analysis of mi-

croarray data. http: //seurat.r-forge.r-project.org/. Hofmann, H. (2000). Exploring

categorical data: Interactive mosaic plots. Metrika 51(1), 1126.

Pilhoefer, A. and A. Unwin (2011). Multiple barcharts for relative frequencies and

parallel coordinates plots for categorical data - package extracat. Journal of Statistical

Software. submitted.

Order constrained clustering for music

structure analysis

Sebastian Krey, Uwe Ligges and Friedrich Leisch

Fakult¨at Statistik,

Technische Universit¨at Dortmund,

Universit¨at f¨ur Bodenkultur Wien

krey@statistik.tu-dortmund.de

In music structure analysis, unsupervised machine learning methods are desirable to

get a ﬁrst overivew and to segment into parts that may be relevant for further analyses.

Traditional unconstrained clustering methods may yield unstable and uninterpretable

results, particularly when used on sound features derived of recordings of real music.

Therefore, it is helpful to constrain the possible solutions in a way that frequently alter-

nating cluster assignments are suppressed.

One intuitive constraint is the temporal order of the recording which implies that a

sensible cluster consists of connected time segments. Steinley and Hubert [1] describe a

method to introduce an order constraint in clustering. Using an R implementation [2,3]

of this idea, we get promising results. Due to the exponential runtime in the number

of clusters, we propose a recursive tree-based approach of order constrained clustering

with small cluster numbers (e.g. 2).

This way, on a piece of popular music, it is possible to get clusters which represent

musical parts like intro, verse, refrain, bridge. This is promising in the sense that it is

typically more beneﬁcal to train a musical recognition system with a characteristic part

of a song rather than with artiﬁcal chosen segements.

In addition, it is possible to segment separate tones into attack, sustain, decay and

silence - without splitting some of these tone phases into several clusters.

References

[1] Douglas Steinley, Lawrence Hubert (2008). ”Order-constrained solutions in k-means

clustering: Even better than being globally optimal”, Psychometrika, Vol. 73, No. 5,

pp. 647-664.

[2] Sebastian Hoﬀmeister (2009). ”Partitionierende Clusterverfahren unter Ordnungs-

Nebenbedingungen”, Diplomarbeit, Institut f¨ur Statistik, Ludwig-Maximilians-Universit¨at

M¨unchen.

[3] Friedrich Leisch (2006). ”A Toolbox for K-Centroids Cluster Analysis”, Computa-

tional Statistics and Data Analysis, Vol. 51, No. 2, pp. 526-544.

Inference of Boolean networks by fuzzy sets

Martin Hopfensitz, Markus Maucher and Hans A. Kestler

Internal Medicine I,

University Hospital Ulm,

Institute of Neural Information Processing,

Ulm University

{martin.hopfensitz, markus.maucher, hans.kestler}@uni-ulm.de

Molecular systems biology usually refers to integrated experimental and computational

approaches for studying biomolecular networks, such as signal transduction, gene regula-

tion or metabolic systems. At the core of systems biology research lies the identiﬁcation

of gene-regulatory networks from experimental data via reverse-engineering methods.

Network inference algorithms can assist life scientists in unraveling gene-regulatory sys-

tems on a molecular level. In this context, Boolean networks (Kaufmann, 1969) provide a

well founded framework for reverse-engineering and analysis of gene-regulatory networks

(Hickman et al., 2009). In a Boolean network, a gene is modeled as a Boolean variable

that can attain two alternative levels: expressed (1) or not expressed (0). In spite of this

restriction, the behaviour of real genetic networks can be described well by this ”coarse-

grained” model (Bornholdt, 2005). To infer a Boolean network solely from quantitative

time series data, the continuous data have to be binarized. But the binarization is often

unreliable, since noise on gene expression data and the low number of temporal mea-

surement points frequently lead to an uncertain binarization of values. We developed

a novel reverse-engineering method based on Boolean networks that incorporates this

uncertainty in the binarized data for the inference process.

First, we binarize the data with the fuzzy-2-means algorithm in order to obtain a

binarization and a membership coeﬃcient p

for each Boolean value indicating the reli-

ability of the binarization. Based on the fuzzy model, multiple binarized time series are

sampled via randomized rounding, using the coeﬃcients as probabilities of membership.

For each of these binarizations and each of the genes in the network, we infer possible

dependencies by scoring all combinations of input genes. The scoring of an input gene

combination is based on the error of the best Boolean function and on the number of

involved genes. An accumulated score for each input gene combination over all binariza-

tions is calculated, and the dependencies are modeled by the best-ranked combinations.

By incorporating uncertainty into the reverse-engineering process, we improve the accu-

racy in terms of state transitions and network wiring. For validation, our new approach

was applied on artiﬁcial data and yeast expression time series data.

References

KAUFFMAN, S. A. (1969): Metabolic Stability and Epigensis in Randomly Constructed

Genetic Nets. Journal of Theoretical Biology, 22(3):437–467.

HICKMAN, G. J. and HODGMAN T.C. (2009). Inference of gene regulatory networks

using Boolean-network inference methods. Journal of Bioinformatics and Computa-

tional Biology, 7(6):1013–29.

BORNHOLDT, S. (2005): Systems Biology. Less is More in Modeling Large Genetic

Networks. Science, 310(5747): 449–451.

Inferring Boolean network structure via

correlation

M. Maucher, B. Kracher, M. K¨uhl, H.A. Kestler

Institute of Neural Information Processing,

Institute for Biochemistry and Molecular Biology,

University of Ulm

{markus.maucher,hans.kestler}@uni-ulm.de

The dynamic behavior of genetic regulatory networks can be described and analyzed

using Boolean network models. The reconstruction of such a Boolean network from time

series data requires the identiﬁcation of dependencies within the network. To facilitate

the dependency structure of such a network, one can take advantage of the fact that

in a gene regulatory network a speciﬁc transcription factor often will consistently either

activate or inhibit a speciﬁc target gene. In this case, the observed regulatory behavior

can be modeled by the use of monotone functions.

We show that Pearson correlation can identify the dependencies in a Boolean network

from time series data if that network consists of monotone Boolean functions. This

approach enables fast inference of Boolean networks based on an intuitive correlation

measure. In experiments, we could reconstruct large fractions of both a published E.

coli transcriptional regulatory and metabolic network from simulated data and a yeast

cell cycle network from microarray data.

Meta-analysis Methods for Gene Expression

Proﬁles

Berthold Lausen

Department of Mathematical Sciences

University of Essex

blausen@essex.ac.uk

A fast increasing amount of public available gene expression data sets allows the use

of meta analysis techniques to validate and to identify molecular signatures. I review

several recent approaches. An important condition for preprocessing methods is that

the data analysis work ﬂow should not be inﬂuenced by properties of other data sets

included in the meta analysis. For example a preprocessing method of one Aﬀymetrix

cel ﬁle should be invariant under diﬀerent sets of cel ﬁles included in the meta analysis.

I illustrate the talk with gene expression data sets of colorectal cancer.

References

BUFFA, F.M., HARRIS, A.L., WEST, C.M., MILLER, C.J. (2010): Large Meta- anal-

ysis of Multiple Cancers Reveals a Common, Compact and Highly Prognostic Hypoxia

Metagene. British Journal of Cancer, 102, 428–35.

CRONER, R., F

ORTSCH, T., BR

UCKL, W., R

ODEL, F., et al. (2008): Molecular

Signature for Lymphatic Metastasis in Colorectal Carcinomas. Annals of Surgery 247,

803–810.

GORLOV, I.P., SIRCAR, K., ZHAO, H., et al. (2010): Prioritizing genes associated

with prostate cancer development. BMC Cancer 10:599.

MCCALL, M.N., BOLSTAD, B.M., IRIZARRY, R.A. (2010): Frozen robust multi-array

analysis (fRMA). Biostatistics 11, 2, 242–253.

MPINDI, J.P., SARA, H., HAAPA-PAANANEN, S. et al. (2011): GTI: A Novel Al-

gorithm for Identifying Outlier Gene Expression Proﬁles from Integrated Microarray

Datasets. PLOSone 6, 2, e17259.

SHI, F., ABRAHAM, G., LECKIE, C., HAVIV, I., KOWALCZYK, A. (2011): Meta-

analysis of gene expression microarrays with missing replicates. BMC Bioinformatics

12,84

Connecting miRNA and mRNA expression

proﬁles for medical classiﬁcation problems

Klaus Jung, Tim Beißbarth and Mathias Fuchs

Department of Medical Statistics,

University Medical Center G¨ottingen,

Department of Bioinformatics,

University Medical Center G¨ottingen

kjung1@uni-goettingen.de

In biomedical research, it is by now very common that diﬀerent types of high-dimensional

molecular data are studied in parallel. In the past, many studies concentrated for ex-

ample only on gene expression, protein expression or genetic data, due to the high cost

of each technique or because techniques were just established in many research groups.

Meanwhile, techniques have become cheaper and more common, so that they can be

applied in parallel to study the same biological sample, e.g. a tumour biopsy or a cell

line. A typical question for studying molecular data is to ﬁnd diﬀerences in the samples

from diﬀerent biological groups, e.g. individuals with diﬀerent phenotypes or diﬀerent

response to a therapy. In particular, many studies aim at ﬁnding molecular signatures

that can be used for diagnosis or prediction. In this context, we currently study methods

for connecting the information from mRNA and miRNA expression data for classiﬁca-

tion problems in medicine. Our particular questions are as follows. Should we ﬁrst

merge mRNA and miRNA data and search then for a common signature? Or should

we ﬁrst search individual signatures and combine then the correspond-ing classiﬁcation

rules (Kittler et al., 1998)? Is a merged classiﬁer better than an individual one? Is there

a beneﬁt of having both data sources available? We evaluate the diﬀerent approaches

within a simulation study and on several publicly avail-able data (e.g. Peng et al., 2009).

More precisely, we compare the prediction accuracies obtained with each approach. In

addition, we discuss several technical diﬃculties of each approach, for example a common

normalization of mRNA and miRNA data.

References

Kittler, J., Hatef, M., Duin, R.P.W. and Matas, J. (1998) On combining classiﬁers.

IEEE Transactions on pattern analysis and machine intelligence, 20, 226–239.

Peng, X., Li, Y.,Walters, K.A., Rosenzweig, E.R., Lederer, S.L., Aicher, L.D., Proll,

S. and Katze, M.G. (2009) Computational identiﬁcation of hepatitis C virus associated

microRNA-mRNA regulatory modules in human livers. BMC Genomics, 10: 373.

Integration of copy number variation and

gene expression data in Bayesian models for

prediction

Manuela Zucknick, Stefan Pﬁster and Axel Benner

Division of Biostatistics (C060),

Division Molecular Genetics (B060),

German Cancer Research Center, Heidelberg,

Department of Pediatric Oncology, Hematology and Immunology,

University Hospital Heidelberg

m.zucknick@dkfz-heidelberg.de

Bayesian variable selection models are an alternative to well-known regularisation

methods like lasso regression and boosting for prognostic modelling based on high-

dimensional input spaces. A typical application is prediction of clinical endpoints such

as therapy response using microarray gene expression data.

High-throughput microarray technologies are available for many other types of ge-

nomic data in addition to gene expression, and in recent years clinical researchers have

begun to systematically collect genome-wide data from various sources on the DNA-

and RNA-level as well as epigenetic data. If data from several sources are available for

the same set of biological samples, the data can be analysed together in an integrative

manner, with the aim of providing a more comprehensive picture of the disease biology

as well as improving the performance of clinical prediction models.

For example, the integration of copy number variation data into classical gene ex-

pression based prognostic models promises to improve both prognostic value and in-

terpretability of the model, because genomic deletions and ampliﬁcations are known to

aﬀect expression levels of genes located in the corresponding genomic regions. In fact,

the deletion of chromosomal regions harbouring important tumour suppressor genes is

a well-known cause of certain cancers.

In contrast to methods like lasso and boosting, Bayesian variable selection models are

very ﬂexible in their setup and are naturally well-suited to extensions allowing for the

integration of additional data sources.

We will propose a hierarchical Bayesian variable selection model, which combines

whole-genome information on copy number variation and gene expression in a manner

that is intuitive from a biological point of view. The model setup will be demonstrated,

as well as aspects of the MCMC sampling algorithm and posterior inference. The model

will be further illustrated in an application to pediatric brain tumour data.

On the utility of partially labeled data for

classiﬁcation in high dimensional settings

Ludwig Lausser, Florian Schmid and Hans A. Kestler

Institute of Neural Information Processing,

University of Ulm,

Department of Internal Medicine I,

University Hospital Ulm

{ludwig.lausser,hans.kestler}@uni-ulm.de

Initial results gained by high throughput technologies such as microarrays or deep

sequencing are common starting points for investigations within molecular medicine or

biology. They allow the tracking of several thousand signals within a single experiment.

The data produced by such technologies is of usually high dimensionality but also of a

low cardinality. Many inferences regarding these data can be formulated as clustering

or classiﬁcation tasks. In a clustering scenario the task is to ﬁnd groups in a sample of

data points. It is an example for unsupervised learning. The sample does not contain

explicit information on the involved classes or groups; especially it does not contain class

labels. In a classiﬁcation scenario the involved categories are known a priori. The task

is to predict the correct category of a unseen data point. Classiﬁcation is an example

of a supervised learning task. Here the training set contains examples labeled according

the these categories.

Supervised and unsupervised learning are widely used paradigms within the anal-

ysis of microarray data. Other methodologies that bridge these paradigms are often

neglected. These methods are based on partially labeled datasets and incorporate in-

formation gained from labeled and unlabeled data. Examples for such concepts are

transductive learning and semi-supervised learning. In this work we investigate the

beneﬁt of these learning schemes within the classiﬁcation of microarray datasets.

We compare several supervised algorithms to their transductive (or semi-supervised)

counterparts in real and artiﬁcial settings. Aim of the study is the investigation of the

inﬂuence of the high dimensionality on the generalization ability and the robustness of

the algorithms.

Multi-objective parameter selection for

classiﬁers

C. M¨ussel, L. Lausser, M. Maucher and H. A. Kestler

Institute of Neural Information Processing,

University of Ulm

{christoph.muessel,ludwig.lausser,markus.maucher,hans.kestler}@uni-ulm.de

The choice of appropriate values for parameters is an essential step in classiﬁer training

and can have a major inﬂuence on the classiﬁcation performance. Often, such parameters

are set according to rules of thumb. Parameter tuning is an automated way of adapting

parameters. Most frequently, parameters are tuned according to single criterion, such

as the cross-validation error, which can be a good estimate of the generalization per-

formance. However, it is sometimes desirable to obtain parameter values that optimize

several concurrent criteria at the same time. For example, sensitivity and speciﬁcity are

important – but usually conﬂicting – characteristics of a classiﬁer. Dominance-based

selection procedures allow for a simultaneous optimization of multiple objectives. They

leave the ultimate decision on the desired trade-oﬀ of objectives to the human expert.

We devised the R package TunePareto for multi-objective selection of parameters for

classiﬁers. The software chooses candidate parameter conﬁgurations according to so-

phisticated sampling strategies and search heuristics, such as quasi-random sequences

and evolutionary algorithms. It then determines the optimal conﬁgurations using Pareto

dominance. The package provides ﬂexible interfaces for classiﬁers and objective func-

tions. The decision making process is supported by various visualizations as well as the

formal deﬁnition of desired and undesired objective values.

We present a tutorial on the functionality and usage of the TunePareto package.

The Daim package – Diagnostic accuracy

of classiﬁcation models

Sergej Potapov, Berthold Lausen and Werner Adler

Department of Biometry and Epidemiology

University of Erlangen-Nuremberg

Department of Mathematical Sciences

University of Essex

{sergej.potapov, werner.adler}@imbe.med.uni-erlangen.de

The Daim package contains several functions for evaluating the accuracy of classiﬁ-

cation models by ROC analysis (Fawcett, 2006). It provides the following performance

measures: ”cv”, ”bcv”, ”0.632” and ”0.632+” estimation of the misclassiﬁcation rate,

sensitivity, speciﬁcity and AUC (Efron & Tibshirani, 1997; Adler & Lausen, 2009).

The package provides a ﬂexible interface to classiﬁer functions and facilitates intuitive

evaluation of predictive models. If an application is computationally intensive, parallel

execution can be used in a simple manner to reduce the time taken.

References

EFRON, B. and TIBSHIRANI, R. (1997): Improvements on Cross-Validation: The

.632+ Bootstrap Method. JASA, 92(438), 548–560.

FAWCETT, T. (2006): An introduction to ROC analysis. Pattern Recognition Let-

ters, 27(8).

ADLER, W. and LAUSEN, B. (2009): Bootstrap estimated true and false positive

rates and ROC curve. Comput. Stat. Data Anal., 53(3), 718–729.

Correcting the optimally selected

resampling-based error rate: A smooth

analytical alternative to nested

cross-validation

Christoph Bernau, Thomas Augustin and Anne-Laure Boulesteix

Department of Medical Informatics, Biometry and Epidemiology (IBE),

University of Munich,

Department of Statistics,

University of Munich

bernau@ibe.med.uni-muenchen.de

Many statistical problems in bioinformatics are high-dimensional binary classiﬁcation

tasks, e.g. the classiﬁcation of microarray samples into normal and cancer tissues. In this

context, statistical learning methods usually incorporate a tuning parameter adjusting

their complexity to the speciﬁc examined data set. By simply reporting the performance

of the best tuning parameter value, overly optimistic prediction errors have been pub-

lished in the past (Varma and Simon, 2006). A straightforward approach to avoid this

tuning bias is nested cross-validation (CV).

In this talk we are addressing two objectives. Firstly, we develop a new method

correcting for this tuning bias by embedding the tuning problem into a decision theoretic

framework. The method is based on the decomposition of the unconditional error rate

involving the tuning procedure. Our corrected error estimator can be reformulated

as a weighted mean of resampling errors obtained using the diﬀerent tuning parameter

values. In this sense, it can be interpreted as a smooth version of nested CV. The smooth

weighting additionally guarantees intuitive bounds for the corrected error. Secondly, we

suggest to also use bias correction methods to address the bias resulting from the optimal

choice of the learning method. The latter bias is particularly relevant to prediction

problems based on high-dimensional ”omic” data. In the absence of standards, it is

indeed common practice to apply several methods successively. This can lead to an

optimistic bias similar to the tuning bias if one reports the performance of the optimal

method only.

We demonstrate the performance of our new method to address both types of bias

based on four microarray cancer data sets and compare it to existing methods. Our

main result is that our approach yields intuitively bounded estimates similar to nested

CV and at a dramatically lower computational price.

References

S. Varma and R. Simon. Bias in error estimation when using cross-validation for model

selection. BMC Bioinformatics, 7:91.

Bias-Variance Analysis of Local

Classiﬁcation Methods

Julia Schiﬀner and Claus Weihs

Department of Statistics,

TU Dortmund

{schiffner, weihs}@statistik.tu-dortmund.de

Nowadays a plethora of classiﬁcation methods is available and new ones or modi-

ﬁcations of established methods are regularly published. They can be grouped using

diﬀerent properties as e.g. parametric or nonparametric, distance-based or not, pre-

dictive or generative etc. Another distinction can be made between global and local

methods. In recent years the amount of publications on local classiﬁcation methods is

increasing. Localized versions of nearly all standard classiﬁcation techniques like linear

discriminant analysis [1] and Fisher discriminant analysis [6], logistic regression [3, 7],

support vector machines [5] or boosting [8] are available.

The term local is only vaguely deﬁned and used in a rather intuitive way by most

authors, referring to the position in some space, to a part of a whole or to something

that is not general or widespread. Often it relates to the the neighborhood of the point

where a prediction is required, with the k nearest neighbors method [2] as probably

best-known example. But also other concepts of locality can be found in the relevant

literature. For example Hand and Vinciotti [3] use the term local to refer to points

close to the decision boundary. Most localization techniques can be applied in a generic

manner to many diﬀerent classiﬁcation methods which results in a rather broad ﬁeld of

methods. We will give an overview of existing approaches and their properties.

A question of interest is how localization aﬀects the performance of classiﬁcation

methods. The bias-variance decomposition of prediction error is conducive to gaining

deeper insight into the behavior of learning algorithms. It was originally introduced for

quadratic loss functions, but since in classiﬁcation the misclassiﬁcation rate is usually of

interest, generalizations to zero-one loss have been developed in the last 15 years, e.g. [4].

This was particularly motivated by research on multi-classiﬁer systems where variance-

reduction was found as one explanation for the often good performance of multi-classiﬁer

systems.

In order to gain deeper insight into how local methods work we analyze local classi-

ﬁcation methods in terms of bias and variance of the error rate Our intuition that is

supported by our recent experiments clearly is that local methods in general reduce the

bias in comparison with global counterparts. We will show some toy examples for illus-

tration of the decomposition and present some results for selected classiﬁcation methods

and localization types on simulated and real-world data sets.

References

[1] I. Czogiel, K. Luebke, M. Zentgraf, and C. Weihs. Localized linear discriminant

analysis. In R. Decker and H.-J. Lenz, editors, Advances in Data Analysis, volume 33

of Studies in Classiﬁcation, Data Analysis, and Knowledge Organization, pages 133–

140, Berlin Heidelberg, 2007. Springer.

[2] E. Fix and J. L. Hodges. Discriminatory analysis – nonparametric discrimination:

Consistency properties. Report 4, U.S. Airforce School of Aviation Medicine, Ran-

dolph Field, Texas, 1951.

[3] D. J. Hand and V. Vinciotti. Local versus global models for classiﬁcation problems:

Fitting models where it matters. The American Statistician, 57(2):124–131, May

2003.

[4] G. M. James. Variance and bias for general loss functions. Machine Learning, 51(2):

115–135, May 2003.

[5] N. Segata and E. Blanzieri Fast and scalable local kernel machines. Journal of

Machine Learning Research, 11:1883–1926, June 2010.

[6] M. Sugiyama. Dimensionality reduction of multimodal labeled data by local Fisher

discriminant analysis. Journal of Machine Learning Research, 8:1027–1061, May

2007.

[7] G. Tutz and H. Binder. Localized classiﬁcation. Statistics and Computing, 15:155–

166, 2005.

[8] C.-X. Zhang and J.-S. Zhang. A local boosting algorithm for solving classiﬁcation

problems. Computational Statistics & Data Analysis, 52:1928–1941, 2008.

A GLM-based zero-inflated generalized Poisson factor model for analyzing microbiome data

Article

Full-text available

May 2024

Motivation High-throughput sequencing technology facilitates the quantitative analysis of microbial communities, improving the capacity to investigate the associations between the human microbiome and diseases. Our primary motivating application is to explore the association between gut microbes and obesity. The complex characteristics of microbiome data, including high dimensionality, zero inflation, and over-dispersion, pose new statistical challenges for downstream analysis. Results We propose a GLM-based zero-inflated generalized Poisson factor analysis (GZIGPFA) model to analyze microbiome data with complex characteristics. The GZIGPFA model is based on a zero-inflated generalized Poisson (ZIGP) distribution for modeling microbiome count data. A link function between the generalized Poisson rate and the probability of excess zeros is established within the generalized linear model (GLM) framework. The latent parameters of the GZIGPFA model constitute a low-rank matrix comprising a low-dimensional score matrix and a loading matrix. An alternating maximum likelihood algorithm is employed to estimate the unknown parameters, and cross-validation is utilized to determine the rank of the model in this study. The proposed GZIGPFA model demonstrates superior performance and advantages through comprehensive simulation studies and real data applications.

Diet Drives Gut Bacterial Diversity of Wild and Semi-Captive Common Cranes (Grus grus)

Article

Full-text available

May 2024

The gut microbiota of wild animals can regulate host physical health to adapt to the environment. High-throughput sequencing from fecal samples was used to analyze the gut microbiota communities in common cranes (Grus grus) without harming them. Herein, we compared the fecal microbiome of fifteen G. grus in Tianjin Tuanbo Bird Natural Reserve (wild group) and six G. grus sampled from Beijing Wildlife Park (semi-captive group) in China, using 16S amplicon sequencing and bioinformatic analysis. The results showed that microbiota diversity and composition varied in different groups, suggesting that the gut microbiota was interactively influenced by diet and the environment. A total of 38 phyla and 776 genera were analyzed in this study. The dominant phyla of the G. grus were Firmicutes and Proteobacteria. Meanwhile, the microbiota richness of the semi-captive group was higher than the wild group. Data on beta diversity highlighted significant differences based on different dietary compositions. Zea mays, Glycine max, and Phragmites australia showed a significant correlation with intestinal bacteria of G. grus. This study provides a comprehensive analysis of diet and microbiomes in semi-captive and wild G. grus living in different environments, thus helping us to evaluate the influence on animal microbiomes and improve conservation efforts for this species.

Population-level immunologic variation in wild threespine stickleback (Gasterosteus aculeatus)

Article

Full-text available

Apr 2024
FISH SHELLFISH IMMUN

Wild organisms are regularly exposed to a wide range of parasites, requiring the management of an effective immune response while avoiding immunopathology. Currently, our knowledge of immunoparasitology primarily derives from controlled laboratory studies, neglecting the genetic and environmental diversity that contribute to immune phenotypes observed in wild populations. To gain insight into the immunologic variability in natural settings, we examined differences in immune gene expression of two Alaskan stickleback (Gasterosteus aculeatus) populations with varying susceptibility to infection by the cestode Schistocephalus solidus. Between these two populations, we found distinct immune gene expression patterns at the population level in response to infection with fish from the high-infection population displaying signs of parasite-driven immune manipulation. Further, we found significant differences in baseline immune gene profiles between the populations, with uninfected low-infection population fish showing signatures of inflammation compared to uninfected high-infection population fish. These results shed light on divergent responses of wild populations to the same parasite, providing valuable insights into host-parasite interactions in natural ecosystems.

Blood transcriptome analysis of common kestrel nestlings living in urban and non-urban environments

Article

Apr 2024

Genetic connectivity in the Arizona toad (Anaxyrus microscaphus): implications for conservation of a stream dwelling amphibian in the arid Southwestern United States

Article

Full-text available

Mar 2024
CONSERV GENET

The Arizona Toad (Anaxyrus microscaphus) is restricted to riverine corridors and adjacent uplands in the arid southwestern United States. As with numerous amphibians worldwide, populations are declining and face various known or suspected threats, from disease to habitat modification resulting from climate change. The Arizona Toad has been petitioned to be listed under the U.S. Endangered Species Act and was considered “warranted but precluded” citing the need for additional information – particularly regarding natural history (e.g., connectivity and dispersal ability). The objectives of this study were to characterize population structure and genetic diversity across the species’ range. We used reduced-representation genomic sequencing to genotype 3,601 single nucleotide polymorphisms in 99 Arizona Toads from ten drainages across its range. Multiple analytical methods revealed two distinct genetic groups bisected by the Colorado River; one in the northwestern portion of the range in southwestern Utah and eastern Nevada and the other in the southeastern portion of the range in central and eastern Arizona and New Mexico. We also found subtle substructure within both groups, particularly in central Arizona where toads at lower elevations were less connected than those at higher elevations. The northern and southern parts of the Arizona Toad range are not well connected genetically and could be managed as separate units. Further, these data could be used to identify source populations for assisted migration or translocations to support small or potentially declining populations.

Tailored midgut gene expression in Spodoptera litura (Lepidoptera: Noctuidae) feeding on Zea mays indicates a tug of war

Article

Full-text available

Mar 2024
ARTHROPOD-PLANT INTE

Spodoptera litura is a destructive lepidopteran generalist pest widespread in tropical and subtropical regions and causes huge yield loss by gregarious feeding on crop plants. During co-evolution, Zea mays (Var. African tall) has attained a well-crafted defence mechanism and can demote the performance of its invaders. When an insect feeds on a host/non-host plant, its digestive system needs to upregulate the first line of defence against a broad spectrum of antifeedants and toxins of host origin. To understand the molecular mechanisms underlying insect response to plant resistance factors, a comparative midgut transcriptome of Spodoptera litura fed on maize and control plants was investigated, which identified a total of 712 differentially expressed genes (DEGs), including 232 up-regulating and 480 down-regulating genes. Gene ontology, gene enrichment and pathway analysis revealed that upregulated genes are involved in carbohydrate metabolism, detoxification, defence, lipid metabolism, digestion, and signal transduction. In contrast, down-regulated genes were primarily linked to cytoskeleton, transport, signalling, carbohydrate and lipid metabolism, growth and developmental processes. The above results indicate an antinutritional stress on S. litura, which leads to a compensatory mechanism in the insect by enhanced digestibility and detoxification at the cost of growth and development. This study provides an overall understanding of the transcriptomic response of S. litura upon feeding on a suboptimal host. Nevertheless, our study forms the basis for future molecular studies on S. litura adaptation and may widen the scope for their management.

A blood-based multi-biomarker approach reveals different physiological responses of common kestrels to contrasting environments

Article

Mar 2024
ENVIRON RES

Transcriptomics of differences in thermal plasticity associated with selection for an exaggerated male sexual trait

Article

May 2024
HEREDITY

Integrated transcriptomic and metabolomic approaches reveal molecular response and potential biomarkers of the deep-sea mussel Gigantidas platifrons to copper exposure

Article

May 2024
J HAZARD MATER

Metal pollution caused by deep-sea mining activities has potential detrimental effects on deep-sea ecosystems. However, our knowledge of how deep-sea organisms respond to this pollution is limited, given the challenges of remoteness and technology. To address this, we conducted a toxicity experiment by using deep-sea mussel Gigantidas platifrons as model animals and exposing them to different copper (Cu) concentrations (50 and 500 μg/L) for 7 days. Transcriptomics and LC-MS-based metabolomics methods were employed to characterize the profiles of transcription and metabolism in deep-sea mussels exposed to Cu. Transcriptomic results suggested that Cu toxicity significantly affected the immune response, apoptosis, and signaling processes in G. platifrons. Metabolomic results demonstrated that Cu exposure disrupted its carbohydrate metabolism, anaerobic metabolism and amino acid metabolism. By integrating both sets of results, transcriptomic and metabolomic, we find that Cu exposure significantly disrupts the metabolic pathway of protein digestion and absorption in G. platifrons. Furthermore, several key genes (e.g., heat shock protein 70 and baculoviral IAP repeat-containing protein 2/3) and metabolites (e.g., alanine and succinate) were identified as potential molecular biomarkers for deep-sea mussel’s responses to Cu toxicity. This study contributes novel insight for assessing the potential effects of deep-sea mining activities on deep-sea organisms.

Comparison of the effectiveness of different normalization methods for metagenomic cross-study phenotype prediction under heterogeneity

Article

Full-text available

Mar 2024

The human microbiome, comprising microorganisms residing within and on the human body, plays a crucial role in various physiological processes and has been linked to numerous diseases. To analyze microbiome data, it is essential to account for inherent heterogeneity and variability across samples. Normalization methods have been proposed to mitigate these variations and enhance comparability. However, the performance of these methods in predicting binary phenotypes remains understudied. This study systematically evaluates different normalization methods in microbiome data analysis and their impact on disease prediction. Our findings highlight the strengths and limitations of scaling, compositional data analysis, transformation, and batch correction methods. Scaling methods like TMM show consistent performance, while compositional data analysis methods exhibit mixed results. Transformation methods, such as Blom and NPN, demonstrate promise in capturing complex associations. Batch correction methods, including BMC and Limma, consistently outperform other approaches. However, the influence of normalization methods is constrained by population effects, disease effects, and batch effects. These results provide insights for selecting appropriate normalization approaches in microbiome research, improving predictive models, and advancing personalized medicine. Future research should explore larger and more diverse datasets and develop tailored normalization strategies for microbiome data analysis.

Localized classification

Article

Full-text available

Jul 2005

The main problem with localized discriminant techniques is the curse of dimensionality, which seems to restrict their use to the case of few variables. However, if localization is combined with a reduction of dimension the initial number of variables is less restricted. In particular it is shown that localization yields powerful classifiers even in higher dimensions if localization is combined with locally adaptive selection of predictors. A robust localized logistic regression (LLR) method is developed for which all tuning parameters are chosen data-adaptively. In an extended simulation study we evaluate the potential of the proposed procedure for various types of data and compare it to other classification procedures. In addition we demonstrate that automatic choice of localization, predictor selection and penalty parameters based on cross validation is working well. Finally the method is applied to real data sets and its real world performance is compared to alternative procedures.

Fast and Scalable Local Kernel Machines

Article

Full-text available

Jun 2010

A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach guarantees high generalization ability by dividing the separation function in local optimization problems that can be handled very efficiently. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with datasets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional datasets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability.

An improved empirical bayes approach to estimating differential gene expression in microarray time-course data: BETR (Bayesian Estimation of Temporal Regulation)

Article

Full-text available

Dec 2009
BMC BIOINFORMATICS

Microarray gene expression time-course experiments provide the opportunity to observe the evolution of transcriptional programs that cells use to respond to internal and external stimuli. Most commonly used methods for identifying differentially expressed genes treat each time point as independent and ignore important correlations, including those within samples and between sampling times. Therefore they do not make full use of the information intrinsic to the data, leading to a loss of power. We present a flexible random-effects model that takes such correlations into account, improving our ability to detect genes that have sustained differential expression over more than one time point. By modeling the joint distribution of the samples that have been profiled across all time points, we gain sensitivity compared to a marginal analysis that examines each time point in isolation. We assign each gene a probability of differential expression using an empirical Bayes approach that reduces the effective number of parameters to be estimated. Based on results from theory, simulated data, and application to the genomic data presented here, we show that BETR has increased power to detect subtle differential expression in time-series data. The open-source R package betr is available through Bioconductor. BETR has also been incorporated in the freely-available, open-source MeV software tool available from http://www.tm4.org/mev.html.

Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties

Article

Dec 1989

: The discrimination problem (two population case) may be defined as follows: e random variable Z, of observed value z, is distributed over some space (say, p-dimensional) either according to distribution F, or according to distribution G. The problem is to decide, on the basis of z, which of the two distributions Z has.

A Toolbox for K-Centroids Cluster Analysis

Article

Nov 2006
COMPUT STAT DATA AN

Friedrich Leisch

A methodological and computational framework for centroid-based partitioning cluster analysis using arbitrary distance or similarity measures is presented. The power of high-level statistical computing environments like R enables data analysts to easily try out various distance measures with only minimal programming effort. A new variant of centroid neighborhood graphs is introduced which gives insight into the relationships between adjacent clusters. Artificial examples and a case study from marketing research are used to demonstrate the influence of distances measures on partitions and usage of neighborhood graphs.

A local boosting algorithm for solving classification problems

Article

Jan 2008
COMPUT STAT DATA AN

Based on the boosting-by-resampling version of Adaboost, a local boosting algorithm for dealing with classification tasks is proposed in this paper. Its main idea is that in each iteration, a local error is calculated for every training instance and a function of this local error is utilized to update the probability that the instance is selected to be part of next classifier's training set. When classifying a novel instance, the similarity information between it and each training instance is taken into account. Meanwhile, a parameter is introduced into the process of updating the probabilities assigned to training instances so that the algorithm can be more accurate than Adaboost. The experimental results on synthetic and several benchmark real-world data sets available from the UCI repository show that the proposed method improves the prediction accuracy and the robustness to classification noise of Adaboost. Furthermore, the diversity-accuracy patterns of the ensemble classifiers are investigated by kappa-error diagrams.

Variance and Bias for General Loss Functions

Article

May 2003

Gareth James

When using squared error loss, bias and variance and their decomposition of prediction error are well understood and widely used concepts. However, there is no universally accepted definition for other loss functions. Numerous attempts have been made to extend these concepts beyond squared error loss. Most approaches have focused solely on 0-1 loss functions and have produced significantly different defini- tions. These differences stem from disagreement as to the essential characteristics that variance and bias should display. This paper suggests an explicit list of rules that we feel any "reasonable" set of definitions should satisfy. Using this framework, bias and variance definitions are produced which generalize to any symmetric loss function. We illustrate these statistics on several loss functions with particular emphasis on 0-1 loss. We conclude with a discussion of the various definitions that have been proposed in the past as well as a method for estimating these quantities on real data sets.

Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis.

Article

May 2007

Masashi Sugiyama

Reducing the dimensionality of data without losing intrinsic information is an important preprocessing step in high-dimensional data analysis. Fisher discriminant analysis (FDA) is a traditional technique for supervised dimensionality reduction, but it tends to give undesired results if samples in a class are multimodal. An unsupervised dimensionality reduction method called locality-preserving projection (LPP) can work well with multimodal data due to its locality preserving property. However, since LPP does not take the label information into account, it is not necessarily useful in supervised learning scenarios. In this paper, we propose a new linear supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA), which effectively combines the ideas of FDA and LPP. LFDA has an analytic form of the embedding transformation and the solution can be easily computed just by solving a generalized eigenvalue problem. We demonstrate the practical usefulness and high scalability of the LFDA method in data visualization and classification tasks through extensive simulation studies. We also show that LFDA can be extended to non-linear dimensionality reduction scenarios by applying the kernel trick.

Frozen Robust Multi-Array Analysis (fRMA)

Article

Apr 2010
BIOSTATISTICS

Robust multiarray analysis (RMA) is the most widely used preprocessing algorithm for Affymetrix and Nimblegen gene expression microarrays. RMA performs background correction, normalization, and summarization in a modular way. The last 2 steps require multiple arrays to be analyzed simultaneously. The ability to borrow information across samples provides RMA various advantages. For example, the summarization step fits a parametric model that accounts for probe effects, assumed to be fixed across arrays, and improves outlier detection. Residuals, obtained from the fitted model, permit the creation of useful quality metrics. However, the dependence on multiple arrays has 2 drawbacks: (1) RMA cannot be used in clinical settings where samples must be processed individually or in small batches and (2) data sets preprocessed separately are not comparable. We propose a preprocessing algorithm, frozen RMA (fRMA), which allows one to analyze microarrays individually or in small batches and then combine the data for analysis. This is accomplished by utilizing information from the large publicly available microarray databases. In particular, estimates of probe-specific effects and variances are precomputed and frozen. Then, with new data sets, these are used in concert with information from the new arrays to normalize and summarize the data. We find that fRMA is comparable to RMA when the data are analyzed as a single batch and outperforms RMA when analyzing multiple batches. The methods described here are implemented in the R package fRMA and are currently available for download from the software section of http://rafalab.jhsph.edu.

Inference of Gene Regulatory Networks Using Boolean-Network Inference Methods

Article

Dec 2009

The modeling of genetic networks especially from microarray and related data has become an important aspect of the biosciences. This review takes a fresh look at a specific family of models used for constructing genetic networks, the so-called Boolean networks. The review outlines the various different types of Boolean network developed to date, from the original Random Boolean Network to the current Probabilistic Boolean Network. In addition, some of the different inference methods available to infer these genetic networks are also examined. Where possible, particular attention is paid to input requirements as well as the efficiency, advantages and drawbacks of each method. Though the Boolean network model is one of many models available for network inference today, it is well established and remains a topic of considerable interest in the field of genetic network inference. Hybrids of Boolean networks with other approaches may well be the way forward in inferring the most informative networks.

Differential expression analysis for sequence count data

Abstract

Recommended publications

CUSUM control charts to monitor series of Negative Binomial count data

Hype and Heavy Tails: A closer look at data breaches

Mixed-Effects Models

Least Squares, Linear Models and Beyond