ArticlePDF Available

Differential expression analysis for sequence count data

Authors:

Abstract

High-throughput DNA sequencing is a powerful and versatile new technology for ob-taining comprehensive and quantitative data about RNA expression (RNA-Seq), protein-DNA binding (ChIP-Seq), and genetic variations between individuals. It addresses es-sentially all of the use cases that microarrays were applied to in the past, but produces more detailed and more comprehensive results. One of the basic statistical tasks is inference (testing, regression) on discrete count values (e.g., representing the number of times a certain type of mRNA was sampled by the sequencing machine). Challenges are posed by a large dynamic range, heteroskedas-ticity and small numbers of replicates. Hence, model-based approaches are needed to achieve statistical power. I will present an error model that uses the negative binomial distribution, with vari-ance and mean linked by local regression, to model the null distribution of the count data. The method controls type-I error and provides good detection power. I will also discuss how to use the GLM framework to detect alternative transcript isoform usage. A free open-source R software package, DESeq, is available from the Bioconductor project.
Differential expression analysis for sequence
count data
Wolfgang Huber
EMBL Heidelberg
wolfgang.huber@embl.de
High-throughput DNA sequencing is a powerful and versatile new technology for ob-
taining comprehensive and quantitative data about RNA expression (RNA-Seq), protein-
DNA binding (ChIP-Seq), and genetic variations between individuals. It addresses es-
sentially all of the use cases that microarrays were applied to in the past, but produces
more detailed and more comprehensive results.
One of the basic statistical tasks is inference (testing, regression) on discrete count
values (e.g., representing the number of times a certain type of mRNA was sampled by
the sequencing machine). Challenges are posed by a large dynamic range, heteroskedas-
ticity and small numbers of replicates. Hence, model-based approaches are needed to
achieve statistical power.
I will present an error model that uses the negative binomial distribution, with vari-
ance and mean linked by local regression, to model the null distribution of the count
data. The method controls type-I error and provides good detection power. I will also
discuss how to use the GLM framework to detect alternative transcript isoform usage. A
free open-source R software package, DESeq, is available from the Bioconductor project.
* joint work with Simon Anders
1
Min P test: a resampling based gene
region-level testing procedure for genetic
case-control studies implemented in R
Stefanie Hieke, Harald Binder, Alexandra Nieters
and Martin Schumacher
Institute of Medical Biometry and Medical Informatics,
Center of Chronic Immunodeficieny,
University Medical Center Freiburg,
Freiburg Center for Data Analysis and Modeling,
University Freiburg
hieke@imbi.uni-freiburg.de
Introduction Current technologies generate a huge number of single nucleotide poly-
morphism (SNP) genotype measurements in case-control studies. The resulting multiple
testing problem can be ameliorated by considering candidate gene regions. The minPtest
R package provides the first widely accessible implementation of a gene region-level sum-
mary for each candidate gene using the min P test.
Method The gene region-level summary, as the min P test, assesses the statistical
significance of the smallest p-trend within each gene region and, therefore, considers
a reduced number of tests. The min P test is a permutation-based method that can
be based on several univariate tests per SNP. In permutation resampling, the observed
variable (case/control status) is randomly re-assigned without replacement to ”pseudo
case/control status”. A test statistic is then recomputed using the pseudo data and com-
pared to the marginal test statistic in the original data set. This procedure is repeated
B times. The inference is based on the permutation distribution of the minimum of the
ordered p-values from the marginal test of each SNP. The gene region-level summary is
mostly compatible with univariate statistical tests per SNP conducted separately over
multiple loci.
2
Results Combining the p-values from tests in a permutations-based approach prevents
an increase of the false-positive rates, as correlations of SNPs are automatically taken
into account. We developed an R package that brings together three different kinds
of tests that are scattered over several R packages, and automatically selects the most
appropriate one for the design at hand. The implementation in the minPtest package
integrates two different parallel computing packages, thus optimally leveraging available
resources for speedy results. The package comprises a function to simulate SNP data
with known structure, allowing the user to explore different scenarios and settings.
Conclusion The minPtest package provides a useful and feasible implementation of a
gene region-level summary, using the min P test, controlling the false-positive rate and
having higher power. In addition minPtest provides acceleration by parallel computing.
References
Chen,B.E. et al. (2006). Resampling-based multiple hypothesis testing procedures for
genetic case-control association studies. Genetic Epidemiology, 30, 495-507.
R Development Core Team (2010). R: A Language and Environment for Statistical
Computing. ISBN 3-900051-07-0. url = http://www.R-project.org.
Westfall,P.H. et al.(2002). Multiple tests for genetic effects in association studies. Meth-
ods Mol Biol, 184, 143-168.
Westfall,P.H. and Young,S.S. (1993). Resampling-Based Multiple Testing: Exam- ple
and Methods for p-Value Adjustment. Wiley, New York.
3
Survival models with preclustered gene
groups as covariates
K. Kammers, M. Lang and J. Rahnenf¨uhrer
Departments of Statistics,
TU Dortmund University
lang@statistik.tu-dortmund.de
An important application of high dimensional gene expression measurements is the
prediction of survival times and the interpretation of the variables in the resulting regres-
sion models. When the response variables are censored survival times, an appropriate
hazard framework is required. The largest problem in this context is the typically large
number of genes compared to the number of observations (individuals). We thus apply
feature selection procedures to construct predictive models for future patients. This ap-
proach aims at identifying models with high prediction accuracy and at the same time
low model complexity. However, interpretability of the resulting models is still limited
due to little knowledge on many of the remaining selected genes. In order to improve
the interpretability of the estimated models, we summarize genes as gene groups defined
by the hierarchically structured Gene Ontology (GO) and include these gene groups
as covariates in the hazard regression models. Though the expression profiles present
in GO groups are often heterogeneous, leading to several different expression profiles
within one group. Preclustering genes within GO groups according to the correlation of
their gene expression measurements leads to homogeneous subclasses. This allows the
aggregation of each subclass to single covariates with predictive importance as well as,
as a result of GO annotations, additional interpret- ability. Besides the genomic data,
we include clinical information to reveal the real benefit of the preclustered genomic
models. To evaluate the prediction performance of the models, we examine both Brier
scores and p-values derived from the prognostic index in a nested cross-validation setup.
Survival models with preclustered gene groups as covariates have similar prediction ac-
curacy to models built only with single genes. Using only gene groups as covariates can
lead to decreased prediction accuracy since many genes are not yet annotated to any
corresponding function. However, integrating the preclustering information improves
the interpretability of the models while prediction performance remains stable.
4
Evaluation and validation of gene
expression signatures for prognostic use in
node negative breast cancer patients
Aslihan Gerhold-Ay, Anja Victor and Marcus Schmidt
Institute of Medical Biostatistics, Epidemiology and Informatics,
University Medical Center of the Johannes Gutenberg University Mainz,
Merck KGaA, Darmstadt,
Department of Obstetrics and Gynaecology,
University Medical Center of the Johannes Gutenberg University Mainz
aslihan.gerhold-ay@unimedizin-mainz.de
Introduction The most widely used treatment guidelines for breast cancer are based
on classical risk factors like the St. Gallen classification. The guidelines recommend
adjuvant systemic therapy for almost all breast cancer patients because this therapy
has greatly improved survival in early breast cancer. However, adjuvant therapy also
has a lot of negative effects with respect to quality of life. For this reason there is
a need to specify an individual risk profile for each patient to avoid over- as well as
under treatment. To get useful risk profiles different predictors based on patients gene
expression have been developed for breast cancer (1; 2; 3; 4; 5). Furthermore, two gene
expression predictors are currently tested in prospective clinical trial (6; 7). The aim of
our project is the evaluation and validation of these well known signatures on a cohort
of Mainz.
Methods The cohort of Mainz consist of 199 node-negative breast cancer patients
treated between 1989 and 1998 at the Department of Obstetrics and Gynaecology, Med-
ical Center of the Johannes Gutenberg University Mainz. All patients were treated with
surgery and did not receive any systemic therapy. Data that have been collected are
classical risk factors and in addition the gene expression data from the Affmetrix chip
HG-U133A (8). To analyse the effect of a signature on survival we apply log-rank test
and uni- and multivariate Cox-regression. ROC curves with distant metastasis within 5
years as the defined endpoint were used to describe the quality of the signatures clas-
sification into low- and high risk group. Cluster analyses were performed to identify
5
the intrinsic subtypes of breast cancer. Currently simulations are initiated to analyse
the stability of the intrinsic subtype signature (3; 4; 5), which is based on previously
reported molecular subtypes of breast cancer. Furthermore approaches were identified
to develop a new tumor grade signature based on gene expression data. About half of
the breast cancers are assigned histological grade 1 or 3. The other berast tumors are
classified as histological grade 2, which is not informative for clinical decision mak- ing
because of the intermediate risk of recurrence. To increase the prognostic value of tumor
grade 2 new methods are necessary to classify them to tumor grade 1 or tumor grade 3.
Results The Mainz cohort is similar to the described populations used for the gene
signature development with respect to classical risk factors. Not all of the published
prognostic values of the gene signatures could be validated on the cohort of Mainz.
Discussion Gene signatures can provide a powerful tool for identification of patients
with high risk of recurrence. Many potential sources of bias (dye bias, sampling bias,
time lag bias and publication bias (9; 10)) can make the transmission of the methods
into practice difficult. Based on our results we recommend prospective studies to test
the validity of the signatures.
References
[1] Y Wang, J G M Klijn, Y Zhang, A M Sieuwerts, M P Look, F Yang, D Talantov, M
Timmermans, M E Meijer-van Gelder, J Yu, T Jatkoe, E M J J Berns, D Atkins, J A
Foekens. Gene-expression profiles to predict distant metastasis of lymph-node-negative
primary breast cancer. Lancet 2005; 365:671–79.
[2] C Sotiriou, P Wirapati, S Loi, A Harris, S Fox, J Smeds, H Nordgren, P Farmer, V
Praz, B Haibe- Kains, C Desmedt, D Larsimont, F Cardoso, H Peterse, D Nuyten, M
Buyse, M J Van de Vijver, J Bergh, M Piccart, M Delorenzi. Gene-expression Profiling
in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve
Prognosis. J Natl Cancer Inst. 2006; 98:262-72.
[3] C M Perou, T Srlie, M B Eisen, M va de Rijn, S Jeffrey, C A Rees, J R Pollack, D
T Ross, H Johnsen, L A Akslen, O Fluge, A Pergamenschikov, C Williams, S X Zhu,
P E Lnning, A L Brresen-Dale, P O Brown,D Botstein. Molecular portraits of human
breast tumors Nature 2000; 406(6797):747-52.
[4] Z Hu, C Fan, D S Oh, J S Marron, X He, B F Qaqish, C Livasy, L A Carey, E
Reynolds, L Dressler, A Nobel, J Parker, M G Ewend, L R Sawyer, J Wu, Y Liu, R N,
M Tretiakova, A Ruiz Orrico, D Dreher, J P Palazzo,L Perreard, E Nelson, M Mone, H
Hansen, M Mullins, J F Quackenbush, M J Ellis, O I Olopade, P S Bernard, C M Perou.
The molecular Portraits of Breast Tumors Are Conserved Across Microarray Platforms.
6
BMC Genomics. 2006; 7:96.
[5] M Smid, Y Wang, Y Zhang, A M Sieuwerts, J Yu, J G M Klijn, J A Foekens, J W
M Martens. Subtypes of Breast Cancer Show Preferential Site of relapse. Cancer Res.
2008; 1;68(9):3108-14.
[6] S Paik, S Shak, G Tang, C Kim, J Baker, M Cronin, F L Baehner, M G Walker,
D Watson, T Park, W Hiller, E R Fisher, D Wickerham, J Bryant, N Wolmark. A
multigeneassay to predict recurrence of tamoxifen -treated, node-nagtive breast cancer
N Engl J Med 2004, 351(27):2817- 2826.
[7] MJ van de Vijver,YD He, LJ vant Veer,H Dai, AAM Hart, DW Voskuil, GJ Schreiber,JL
Peterse, C Roberts, M J Marton, M Parrish, D Atsma, A Witteveen, A Glas, L Dela-
haye, T van der Velde, H Bartelink, S Rodenhuis, E T Rutgers, S H Friend, R Bernards.
A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med
2002, 347(25):1999-2009.
[8] M Schmidt, D Bhm, C von Trne, E Steiner, A Puhl, H Pilch, H Lehr, J G Hengstler,
H Klbl, M Gehrmann. The Humoral Immune System Has a Key Prognostic Impact in
Node-Negative Breast. Cancer. Cancer Research 2008; 68:5405–5413.
[9] K K Dobbin, E S Kawasaki, D W Petersen, R M Simon. Characterizing dye bias in
microarray expreiments. Bioinformatics 2005; 15;21(10):2430-7.
[10] J P Ioannidis, E E Ntzani, T A Trikalinos, D G Contopoulos-Ioannidis. Replication
validity of genetic association studies. Nat Genet. 2001; 29(3):306-9.
7
Differential Expression Analysis and Cluster
Method for Time Course Microarray Data
Khalid A. Abnaof and Holger Fohlich
Bonn-Aachen International Center for Information Technology (B-IT),
Bonn University
Institute of Molecular Biotechnology
RWTH Aachen University
abnaof@bit.uni-bonn.de
Understanding the mechanism by which transcription factors dynamically regulate
genes and other transcription factors (TF) in multicellular organisms is a very impor-
tant and interesting task in the research activities of molecular biology. However this task
is not easy to tackle, as the dynamic process underlying this regulatory system is very
complex. Here, we were particularly interested in TF-target gene networks (transcrip-
tional programs) in multipotent progenitor (MPP) and common dendritic progenitor
cells (CDP) in mice, which are dependent on TGF-β stimulation.
The data used in this study consisted of time series microarray data at six time points.
We applied a Bayesian approach to determine differentially expressed time series between
stimulated and unstimulated cells and between cell types (1). A logistic regression model
was used to perform an analysis of differential expression on pathway level (2). After-
wards time courses for each cell type were grouped into clusters utilizing an EM based
approach, that describes mean curves within a cluster via smoothing spline models (3).
This approach, unlike conventional clustering methods, considers the dynamics of ex-
pressions changes. Further analysis of enriched transcription factor binding sites (TFBS)
within clusters allowed for partial reconstruction of gene regulatory modules. Hypothe-
ses on the dependencies between these gene regulatory modules may be derived in the
future via Dynamic Bayesian Networks (DBNs). A meta-analysis of TFBS enriched in
MPPs, but not in CDPs points towards gene regulatory mechanisms, that drive stem
cell development in dendritic progenitor cells.
8
References
Martin J. Aryee, Jos´e A. Guti´errez-Pabello, Igor Kramnik, Tapabrata Maiti and John
Quackenbush (2009): An improved empirical bayes approach to estimating differential
gene expression in microarray time-course data. BMC Bioinformatics, 9(1).
Montaner D, Dopazo J, 2010 Multidimensional Gene Set Analysis of Genomic Data.
PLoS ONE 5(4): e10348. doi:10.1371/journal.pone.0010348.
Ping Ma, Cristian I. Castillo-davis , Wenxuan Zhong and Jun S. Liu (2006): A data-
driven clustering method for time course gene expression data. Nucleic Acids Research,
34(4):1261-1269.
9
Phenotype Microarray Data organisation
and analysis of respiration curves
Lea A. I. Vaas, Johannes Sikorski and Markus oker
DSMZ German Collection of Microorganisms and Cell Cultures GmbH,
Braunschweig
Lea.Vaas@dsmz.de
Recently, the set of techniques generating so called -omics data was augmented by
yet another one, Phenotype Microarrays (PM). In contrast to the existing major tech-
nologies, i.e. DNA Microarrays, 2D-Proteomic and chromatographic applications, PM
monitors cell respiration over time. Through a redox reaction that alters the colour of
a tetrazolium dye in the presence of respiration, kinetic response curves are generated.
This provides a high throughput means to characterize microbial metabolism. Yet, the
system consists of about 2000 assays for monitoring the cells respiration in the presence
of macro- and micronutrients or their reactions to osmotic stress factors, ion or pH ef-
fects. The application of a number of chemicals, such as antibiotics, antimetabolites,
membrane-active agents, respiratory inhibitors and toxic metals to investigate the cells
sensitivity is also possible. Beside the application in identification and drug screening
scenarios, where mainly presence-absence calls from each assay are of interest, many re-
search projects emphasize the interest in more sophisticated comparisons of phenotypes
of different strains, isolates, mutants, etc. The in-depth evaluation of redox kinetics
should gain knowledge about the metabolic differences and provide indications on which
genetic features of the investigated organisms these differences are based. The main
steps in those analyses would be (1) data organisation and graphical presentation of the
curves, (2) application of methods for the reliable estimation of the growth parameters
of each curve, and (3) extraction and comparison of such growth curve characteristics.
Based on a summary of the available statistical tools covering large parts of the de-
manded analyses, an application strategy of these ready-to-use software tools for the
data analysis of the PM data in R will be proposed. In addition to the statistical chal-
lenges this new type of high dimensional data brings along, this talk will give an outlook
on features to be provided for a more convenient data analysis pipeline comprehending
the organisation of meta-data, concepts for handling the raw data, data analysis and
presentation of the results.
10
Analysing structured data symbolic and
probabilistic appraoches.
Luc De Raedt
Katholieke Universiteit Leuven,
luc.deraedt@cs.kuleuven.be
Structured data in the form of graphs, networks or relational databases are om-
nipresent across numerous application-areas such as biology, chemistry, the internet,
social or bibliographic networks, robotics, vision, etc. The machine learning and data
mining literature has devoted a lot of attention to coping with such data giving rise
to a class of techniques that is known under the name of graph and network mining
or relational learning. In this talk, an introduction will be given to this class of tech-
niques, which often extend or upgrade more traditional techniques for dealing with ”flat”
data (that is, data in feature vector format) towards graph-based and relational data.
This talk will introduce and motivate the problems and techniques of relational and
graph-based learning and look into both symbolic and probabilistic methods. Symbolic
methods attempt to identify patterns in the form of subgraphs in graph data (e.g., in
molecular datasets) while probabilistic methods extent graphical models (like Bayesian
and Markov networks) for dealing with relational data. The approaches will be illus-
trated and motivated by several real-life examples.
11
Association of complex human pain
phenotypes with complex pain genotypes
using a self-organizing maps approach
orn otsch and Alfred Ultsch
pharmazentrum frankfurt/ZAFES, Institute of Clinical Pharmacology,
Johann Wolfgang Goethe University Hospital,
Data Bionics Research Group,
University of Marburg
ultsch@mathematik.uni-marburg.de
BACKGROUND Pain is a complex trait. While clinical pain syndromes can already
be diagnosed by a set of neurological parameters, the complexity of experimental pain
is only incompletely accounted for, which often impedes associations of pain data with
clinical or genetic parameters.
METHODS Pain phenotype markers (n = 8) and genotype markers (n = 30) were
available from previous assessments in 125 healthy volunteers. A U-Matrix on an Emer-
gent Self organizing map (ESOM) was used for visualization of the distance structures
in the data. Subsequently, the prediction of the clusters by the genetic markers was
assessed using a classification and regression tree (CART) approach.
RESULTS On the U-Matrix of the pain phenotypes, eight clusters were identified. This
clustering showed advantages over a Ward clustering on the same data. Rules could be
derived to describe the cluster contents that corresponded to three basic types of pain
thresholds: low, mean and high sensitivity. In the mean and low sensitivity (stoical)
phenotypes, subgroups could be identified. A cluster consisting of persons with high
overall pain threshold but selectively low resistance to heat, the predictive accuracy of
the classifiers was 84.56%. Among the genetic variants that were used for the CART
decision in that cluster were polymorphisms in TRPV1, a gene coding for a heat sensor.
CONCLUSIONS ESOM-based clustering of pain data provides biologically meaning-
ful results and satisfying the complexity of pain. The thus obtained clusters seem to
facilitate the otherwise only insufficiently successful genotype phenotype association in
common pain.
12
Spectral graph features for the classification
of graphs and graph sequences
Miriam Schmidt, G¨unther Palm and Friedhelm Schwenker
Institute of Neural Information Processing,
University of Ulm
{miriam.k.schmidt,friedhelm.schwenker}@uni-ulm.de
Spectral graph theory is an important branch in the area of graph classification. Matri-
ces associated with graphs such as the adjacency matrix or the Laplacian matrix contain
essential information about graph connectivity (Cvetkovi´c 1998). In this study the power
of the principal eigenvalues of adjacency matrices for classification tasks is investigated.
In order to illustrate the proposed method, a toy problem to classify 2D-objects has
been defined. The goal was to discriminate between two classes: circles and squares.
An object is represented by a set of 2D points, describing the object’s outer shape.
These points are considered as the graph’s nodes, and the graph’s adjacency matrix is
defined through the pairwise Euclidean distances of the points. From this adjacency
matrix principal eigenvalues are computed as the object’s characteristic features. This
method is then evaluated on a problem of optical character recognition. For this, the
capital letters data set from Bern repository of graph data sets (see Riesen et al 2008)
has been selected. Additionally, this method has been applied to the problem of human
activity recognition based on sequences of camera images. In this task, hidden Markov
models are modeling the sequential structure of the data. In the first step locations of
the person’s body parts (hand, head, etc.) and objects (table, cup, etc.), relevant for
the human activity, have to be estimated in each camera image. Subsequently distances
between all pairs of detected objects and body parts are computed and the eigenvalues
of this Euclidean distance matrix are calculated. These eigenvalues serve as inputs to
Gaussian mixture models estimating the emission probabilities of the hidden Markov
models.
References
Cvetkovi´c, D.M., Doob, M., Horst, S.: Spectra of Graphs. Theory and Applications.
Vch Verlagsgesellschaft Mbh (1998)
Riesen, K., Bunke, H.: IAM Graph Database Repository for Graph Based Pattern Recog-
nition and Machine Learning. Structural, Syntactic, and Statistical Pattern Recognition
(LNCS 5342), 287–297 (2008)
13
Tuning distance measures in k-nearest
neighbor classification via evolution
strategies
Alexander Melkozerov, Ludwig Lausser and Hans A. Kestler
Institute of Neural Information Processing,
University of Ulm
{alexander.melkozerov,ludwig.lausser,hans.kestler}@uni-ulm.de
Molecular high-throughput technologies usually generate data with a high dimension-
ality and a low cardinality. In the context of classification this poses a serious problem
as many classical (model based) methods turn out to be too complex for this task. This
stimulated the development of new classifiers that are of a lower complexity thus attain-
ing higher generalization performance by additional regularization terms and more rigid
model assumptions.
Some classification methods completely omit the usage of model assumptions. For
example the transductive k nearest neighbor (k NN) classifiers directly predict the
class label of a datapoint x according to the k training examples closest to x. The
performance of this technique is always coupled to the chosen distance measure, which
is often, due to the lack of a better choice, the Euclidian one. Other, non-standard
distance measures may be more suitable for this task. Here, we investigate the usability
of optimized distance measures of type
d
~w
(~x, ~y) =
v
u
u
t
n
X
i=1
w
i
(x
i
y
i
)
2
for k NN classification. The weights ~w are optimized to minimize the empirical risk of
the current classifier. Two types of evolutionary strategies (ES) were utilized: the stan-
dard self-adaptive ES with intermediate recombination (the so-called (µ/µ
I
, λ)-σSA-ES)
and the covariance matrix adaptation ES (the (µ
W
, λ)-CMA-ES). While the former is a
simple ES which provides baseline performance and serves as a reference in our compar-
ison, the (µ
W
, λ)-CMA-ES is the state-of-the-art algorithm for continuos optimization
showing very good performance in recent experimental benchmarks. Observing that the
14
fitness function under consideration is multi-modal, the following restart strategy was
used for the (µ/µ
I
, λ)-σSA-ES: if no improvement of the best fitness function value oc-
curs for 300 generation or the mutation strength gets too small, the ES starts the search
again from a random point.
The advanced (µ
W
, λ)-CMA-ES uses the following restart criteria in addition to the
check of the best fitness function value improvement:
the standard deviation σ
(g)
of the normal distribution used to sample new points
or evolution path is smaller than given value;
numerical precision problems: the mean hyi
(g)
of newly sampled points does not
change when adding to hyi
(g)
a 0.1σ
(g)
-vector in a principal axis direction of the
covariance matrix or a 0.2σ
(g)
to each coordinate of hyi
(g)
;
the condition number of the covariance matrix is too large.
After each restart, the (µ
W
, λ)-CMA-ES runs from a random point with the population
size increased by factor of 2.
The performance of these modified k NN techniques is investigated within a com-
parative study on different microarray datasets. The new classifiers are compared to
other well known classification techniques on their generalization ability, robustness and
sparsity.
15
optile: Optimizing k-dimensional graphical
classification analysis via category
reordering
Alexander Pilh¨ofer and Alexander Gribov
Department of Computer Oriented Statistics and Data Analysis,
Institute of Mathematics,
University of Augsburg
alexander.pilhoefer@math.uni-augsburg.de
In cluster analysis it is good practice to regard several different clustering methods
instead of focusing on only one type. The interest lies in the agreements as well as
the differences between the clusterings. A good way to interpret the results is to use
visualizations such as fluctuation diagrams or categorical parallel coordinate plots Pil-
hoefer and Unwin (2011). Clustering classifications usually are of a nominal categorical
structure and therefore the variable orders can be changed to improve the displays and
make their interpretation easier. Different seriation methods (see Chen et al., 2008)
have been proposed to improve the displays for 2-dimensional problems, mostly using
one-mode optimizations like the Anti-Robinson-Criterion in distance matrices which do
not directly account for the associations between the variables. The talk will present a
family of criteria and related optimization algorithms which can be used to choose the
category orders for 2- and k-dimensional categorical classification data with respect to
their multidimensional associations using the concept of agreement in a pseudo-diagonal
form. An effective optimization algorithm will be presented and the applicability to
both table-like plots such as fluctuation diagrams and line-based plots such as parallel
coordinates plots will be discussed. A special modification of the algorithm which takes
account of hierarchical classification structures will be presented using implementations
in the software package Seurat Gribov (2010). The talk will use real data clustering
results from the US Current Population Survey (CPS) for illustration.
16
References
Chen, C.-h., W. Haerdle, A. Unwin, H.-M. Wu, S. Tzeng, and C.-h. Chen (2008). Matrix
visualization. In Handbook of Data Visualization, Springer Handbooks Comp.Statistics,
pp. 681708. Springer Berlin Heidelberg.
Gribov, A. (2010, June). Seurat - visual analytics for the integrated analysis of mi-
croarray data. http: //seurat.r-forge.r-project.org/. Hofmann, H. (2000). Exploring
categorical data: Interactive mosaic plots. Metrika 51(1), 1126.
Pilhoefer, A. and A. Unwin (2011). Multiple barcharts for relative frequencies and
parallel coordinates plots for categorical data - package extracat. Journal of Statistical
Software. submitted.
17
Order constrained clustering for music
structure analysis
Sebastian Krey, Uwe Ligges and Friedrich Leisch
Fakult¨at Statistik,
Technische Universit¨at Dortmund,
Universit¨at f¨ur Bodenkultur Wien
krey@statistik.tu-dortmund.de
In music structure analysis, unsupervised machine learning methods are desirable to
get a first overivew and to segment into parts that may be relevant for further analyses.
Traditional unconstrained clustering methods may yield unstable and uninterpretable
results, particularly when used on sound features derived of recordings of real music.
Therefore, it is helpful to constrain the possible solutions in a way that frequently alter-
nating cluster assignments are suppressed.
One intuitive constraint is the temporal order of the recording which implies that a
sensible cluster consists of connected time segments. Steinley and Hubert [1] describe a
method to introduce an order constraint in clustering. Using an R implementation [2,3]
of this idea, we get promising results. Due to the exponential runtime in the number
of clusters, we propose a recursive tree-based approach of order constrained clustering
with small cluster numbers (e.g. 2).
This way, on a piece of popular music, it is possible to get clusters which represent
musical parts like intro, verse, refrain, bridge. This is promising in the sense that it is
typically more benefical to train a musical recognition system with a characteristic part
of a song rather than with artifical chosen segements.
In addition, it is possible to segment separate tones into attack, sustain, decay and
silence - without splitting some of these tone phases into several clusters.
References
[1] Douglas Steinley, Lawrence Hubert (2008). ”Order-constrained solutions in k-means
clustering: Even better than being globally optimal”, Psychometrika, Vol. 73, No. 5,
pp. 647-664.
18
[2] Sebastian Hoffmeister (2009). ”Partitionierende Clusterverfahren unter Ordnungs-
Nebenbedingungen”, Diplomarbeit, Institut ur Statistik, Ludwig-Maximilians-Universit¨at
M¨unchen.
[3] Friedrich Leisch (2006). ”A Toolbox for K-Centroids Cluster Analysis”, Computa-
tional Statistics and Data Analysis, Vol. 51, No. 2, pp. 526-544.
19
Inference of Boolean networks by fuzzy sets
Martin Hopfensitz, Markus Maucher and Hans A. Kestler
Internal Medicine I,
University Hospital Ulm,
Institute of Neural Information Processing,
Ulm University
{martin.hopfensitz, markus.maucher, hans.kestler}@uni-ulm.de
Molecular systems biology usually refers to integrated experimental and computational
approaches for studying biomolecular networks, such as signal transduction, gene regula-
tion or metabolic systems. At the core of systems biology research lies the identification
of gene-regulatory networks from experimental data via reverse-engineering methods.
Network inference algorithms can assist life scientists in unraveling gene-regulatory sys-
tems on a molecular level. In this context, Boolean networks (Kaufmann, 1969) provide a
well founded framework for reverse-engineering and analysis of gene-regulatory networks
(Hickman et al., 2009). In a Boolean network, a gene is modeled as a Boolean variable
that can attain two alternative levels: expressed (1) or not expressed (0). In spite of this
restriction, the behaviour of real genetic networks can be described well by this ”coarse-
grained” model (Bornholdt, 2005). To infer a Boolean network solely from quantitative
time series data, the continuous data have to be binarized. But the binarization is often
unreliable, since noise on gene expression data and the low number of temporal mea-
surement points frequently lead to an uncertain binarization of values. We developed
a novel reverse-engineering method based on Boolean networks that incorporates this
uncertainty in the binarized data for the inference process.
First, we binarize the data with the fuzzy-2-means algorithm in order to obtain a
binarization and a membership coefficient p
ij
for each Boolean value indicating the reli-
ability of the binarization. Based on the fuzzy model, multiple binarized time series are
sampled via randomized rounding, using the coefficients as probabilities of membership.
For each of these binarizations and each of the genes in the network, we infer possible
dependencies by scoring all combinations of input genes. The scoring of an input gene
combination is based on the error of the best Boolean function and on the number of
involved genes. An accumulated score for each input gene combination over all binariza-
tions is calculated, and the dependencies are modeled by the best-ranked combinations.
20
By incorporating uncertainty into the reverse-engineering process, we improve the accu-
racy in terms of state transitions and network wiring. For validation, our new approach
was applied on artificial data and yeast expression time series data.
References
KAUFFMAN, S. A. (1969): Metabolic Stability and Epigensis in Randomly Constructed
Genetic Nets. Journal of Theoretical Biology, 22(3):437–467.
HICKMAN, G. J. and HODGMAN T.C. (2009). Inference of gene regulatory networks
using Boolean-network inference methods. Journal of Bioinformatics and Computa-
tional Biology, 7(6):1013–29.
BORNHOLDT, S. (2005): Systems Biology. Less is More in Modeling Large Genetic
Networks. Science, 310(5747): 449–451.
21
Inferring Boolean network structure via
correlation
M. Maucher, B. Kracher, M. uhl, H.A. Kestler
Institute of Neural Information Processing,
Institute for Biochemistry and Molecular Biology,
University of Ulm
{markus.maucher,hans.kestler}@uni-ulm.de
The dynamic behavior of genetic regulatory networks can be described and analyzed
using Boolean network models. The reconstruction of such a Boolean network from time
series data requires the identification of dependencies within the network. To facilitate
the dependency structure of such a network, one can take advantage of the fact that
in a gene regulatory network a specific transcription factor often will consistently either
activate or inhibit a specific target gene. In this case, the observed regulatory behavior
can be modeled by the use of monotone functions.
We show that Pearson correlation can identify the dependencies in a Boolean network
from time series data if that network consists of monotone Boolean functions. This
approach enables fast inference of Boolean networks based on an intuitive correlation
measure. In experiments, we could reconstruct large fractions of both a published E.
coli transcriptional regulatory and metabolic network from simulated data and a yeast
cell cycle network from microarray data.
22
Meta-analysis Methods for Gene Expression
Profiles
Berthold Lausen
Department of Mathematical Sciences
University of Essex
blausen@essex.ac.uk
A fast increasing amount of public available gene expression data sets allows the use
of meta analysis techniques to validate and to identify molecular signatures. I review
several recent approaches. An important condition for preprocessing methods is that
the data analysis work flow should not be influenced by properties of other data sets
included in the meta analysis. For example a preprocessing method of one Affymetrix
cel file should be invariant under different sets of cel files included in the meta analysis.
I illustrate the talk with gene expression data sets of colorectal cancer.
References
BUFFA, F.M., HARRIS, A.L., WEST, C.M., MILLER, C.J. (2010): Large Meta- anal-
ysis of Multiple Cancers Reveals a Common, Compact and Highly Prognostic Hypoxia
Metagene. British Journal of Cancer, 102, 428–35.
CRONER, R., F
¨
ORTSCH, T., BR
¨
UCKL, W., R
¨
ODEL, F., et al. (2008): Molecular
Signature for Lymphatic Metastasis in Colorectal Carcinomas. Annals of Surgery 247,
803–810.
GORLOV, I.P., SIRCAR, K., ZHAO, H., et al. (2010): Prioritizing genes associated
with prostate cancer development. BMC Cancer 10:599.
MCCALL, M.N., BOLSTAD, B.M., IRIZARRY, R.A. (2010): Frozen robust multi-array
analysis (fRMA). Biostatistics 11, 2, 242–253.
23
MPINDI, J.P., SARA, H., HAAPA-PAANANEN, S. et al. (2011): GTI: A Novel Al-
gorithm for Identifying Outlier Gene Expression Profiles from Integrated Microarray
Datasets. PLOSone 6, 2, e17259.
SHI, F., ABRAHAM, G., LECKIE, C., HAVIV, I., KOWALCZYK, A. (2011): Meta-
analysis of gene expression microarrays with missing replicates. BMC Bioinformatics
12,84
24
Connecting miRNA and mRNA expression
profiles for medical classification problems
Klaus Jung, Tim Beißbarth and Mathias Fuchs
Department of Medical Statistics,
University Medical Center ottingen,
Department of Bioinformatics,
University Medical Center ottingen
kjung1@uni-goettingen.de
In biomedical research, it is by now very common that different types of high-dimensional
molecular data are studied in parallel. In the past, many studies concentrated for ex-
ample only on gene expression, protein expression or genetic data, due to the high cost
of each technique or because techniques were just established in many research groups.
Meanwhile, techniques have become cheaper and more common, so that they can be
applied in parallel to study the same biological sample, e.g. a tumour biopsy or a cell
line. A typical question for studying molecular data is to find differences in the samples
from different biological groups, e.g. individuals with different phenotypes or different
response to a therapy. In particular, many studies aim at finding molecular signatures
that can be used for diagnosis or prediction. In this context, we currently study methods
for connecting the information from mRNA and miRNA expression data for classifica-
tion problems in medicine. Our particular questions are as follows. Should we first
merge mRNA and miRNA data and search then for a common signature? Or should
we first search individual signatures and combine then the correspond-ing classification
rules (Kittler et al., 1998)? Is a merged classifier better than an individual one? Is there
a benefit of having both data sources available? We evaluate the different approaches
within a simulation study and on several publicly avail-able data (e.g. Peng et al., 2009).
More precisely, we compare the prediction accuracies obtained with each approach. In
addition, we discuss several technical difficulties of each approach, for example a common
normalization of mRNA and miRNA data.
25
References
Kittler, J., Hatef, M., Duin, R.P.W. and Matas, J. (1998) On combining classifiers.
IEEE Transactions on pattern analysis and machine intelligence, 20, 226–239.
Peng, X., Li, Y.,Walters, K.A., Rosenzweig, E.R., Lederer, S.L., Aicher, L.D., Proll,
S. and Katze, M.G. (2009) Computational identification of hepatitis C virus associated
microRNA-mRNA regulatory modules in human livers. BMC Genomics, 10: 373.
26
Integration of copy number variation and
gene expression data in Bayesian models for
prediction
Manuela Zucknick, Stefan Pfister and Axel Benner
Division of Biostatistics (C060),
Division Molecular Genetics (B060),
German Cancer Research Center, Heidelberg,
Department of Pediatric Oncology, Hematology and Immunology,
University Hospital Heidelberg
m.zucknick@dkfz-heidelberg.de
Bayesian variable selection models are an alternative to well-known regularisation
methods like lasso regression and boosting for prognostic modelling based on high-
dimensional input spaces. A typical application is prediction of clinical endpoints such
as therapy response using microarray gene expression data.
High-throughput microarray technologies are available for many other types of ge-
nomic data in addition to gene expression, and in recent years clinical researchers have
begun to systematically collect genome-wide data from various sources on the DNA-
and RNA-level as well as epigenetic data. If data from several sources are available for
the same set of biological samples, the data can be analysed together in an integrative
manner, with the aim of providing a more comprehensive picture of the disease biology
as well as improving the performance of clinical prediction models.
For example, the integration of copy number variation data into classical gene ex-
pression based prognostic models promises to improve both prognostic value and in-
terpretability of the model, because genomic deletions and amplifications are known to
affect expression levels of genes located in the corresponding genomic regions. In fact,
the deletion of chromosomal regions harbouring important tumour suppressor genes is
a well-known cause of certain cancers.
In contrast to methods like lasso and boosting, Bayesian variable selection models are
very flexible in their setup and are naturally well-suited to extensions allowing for the
integration of additional data sources.
27
We will propose a hierarchical Bayesian variable selection model, which combines
whole-genome information on copy number variation and gene expression in a manner
that is intuitive from a biological point of view. The model setup will be demonstrated,
as well as aspects of the MCMC sampling algorithm and posterior inference. The model
will be further illustrated in an application to pediatric brain tumour data.
28
On the utility of partially labeled data for
classification in high dimensional settings
Ludwig Lausser, Florian Schmid and Hans A. Kestler
Institute of Neural Information Processing,
University of Ulm,
Department of Internal Medicine I,
University Hospital Ulm
{ludwig.lausser,hans.kestler}@uni-ulm.de
Initial results gained by high throughput technologies such as microarrays or deep
sequencing are common starting points for investigations within molecular medicine or
biology. They allow the tracking of several thousand signals within a single experiment.
The data produced by such technologies is of usually high dimensionality but also of a
low cardinality. Many inferences regarding these data can be formulated as clustering
or classification tasks. In a clustering scenario the task is to find groups in a sample of
data points. It is an example for unsupervised learning. The sample does not contain
explicit information on the involved classes or groups; especially it does not contain class
labels. In a classification scenario the involved categories are known a priori. The task
is to predict the correct category of a unseen data point. Classification is an example
of a supervised learning task. Here the training set contains examples labeled according
the these categories.
Supervised and unsupervised learning are widely used paradigms within the anal-
ysis of microarray data. Other methodologies that bridge these paradigms are often
neglected. These methods are based on partially labeled datasets and incorporate in-
formation gained from labeled and unlabeled data. Examples for such concepts are
transductive learning and semi-supervised learning. In this work we investigate the
benefit of these learning schemes within the classification of microarray datasets.
We compare several supervised algorithms to their transductive (or semi-supervised)
counterparts in real and artificial settings. Aim of the study is the investigation of the
influence of the high dimensionality on the generalization ability and the robustness of
the algorithms.
29
Multi-objective parameter selection for
classifiers
C. M¨ussel, L. Lausser, M. Maucher and H. A. Kestler
Institute of Neural Information Processing,
University of Ulm
{christoph.muessel,ludwig.lausser,markus.maucher,hans.kestler}@uni-ulm.de
The choice of appropriate values for parameters is an essential step in classifier training
and can have a major influence on the classification performance. Often, such parameters
are set according to rules of thumb. Parameter tuning is an automated way of adapting
parameters. Most frequently, parameters are tuned according to single criterion, such
as the cross-validation error, which can be a good estimate of the generalization per-
formance. However, it is sometimes desirable to obtain parameter values that optimize
several concurrent criteria at the same time. For example, sensitivity and specificity are
important but usually conflicting characteristics of a classifier. Dominance-based
selection procedures allow for a simultaneous optimization of multiple objectives. They
leave the ultimate decision on the desired trade-off of objectives to the human expert.
We devised the R package TunePareto for multi-objective selection of parameters for
classifiers. The software chooses candidate parameter configurations according to so-
phisticated sampling strategies and search heuristics, such as quasi-random sequences
and evolutionary algorithms. It then determines the optimal configurations using Pareto
dominance. The package provides flexible interfaces for classifiers and objective func-
tions. The decision making process is supported by various visualizations as well as the
formal definition of desired and undesired objective values.
We present a tutorial on the functionality and usage of the TunePareto package.
30
The Daim package Diagnostic accuracy
of classification models
Sergej Potapov, Berthold Lausen and Werner Adler
Department of Biometry and Epidemiology
University of Erlangen-Nuremberg
Department of Mathematical Sciences
University of Essex
{sergej.potapov, werner.adler}@imbe.med.uni-erlangen.de
The Daim package contains several functions for evaluating the accuracy of classifi-
cation models by ROC analysis (Fawcett, 2006). It provides the following performance
measures: ”cv”, ”bcv”, ”0.632” and ”0.632+” estimation of the misclassification rate,
sensitivity, specificity and AUC (Efron & Tibshirani, 1997; Adler & Lausen, 2009).
The package provides a flexible interface to classifier functions and facilitates intuitive
evaluation of predictive models. If an application is computationally intensive, parallel
execution can be used in a simple manner to reduce the time taken.
References
EFRON, B. and TIBSHIRANI, R. (1997): Improvements on Cross-Validation: The
.632+ Bootstrap Method. JASA, 92(438), 548–560.
FAWCETT, T. (2006): An introduction to ROC analysis. Pattern Recognition Let-
ters, 27(8).
ADLER, W. and LAUSEN, B. (2009): Bootstrap estimated true and false positive
rates and ROC curve. Comput. Stat. Data Anal., 53(3), 718–729.
31
Correcting the optimally selected
resampling-based error rate: A smooth
analytical alternative to nested
cross-validation
Christoph Bernau, Thomas Augustin and Anne-Laure Boulesteix
Department of Medical Informatics, Biometry and Epidemiology (IBE),
University of Munich,
Department of Statistics,
University of Munich
bernau@ibe.med.uni-muenchen.de
Many statistical problems in bioinformatics are high-dimensional binary classification
tasks, e.g. the classification of microarray samples into normal and cancer tissues. In this
context, statistical learning methods usually incorporate a tuning parameter adjusting
their complexity to the specific examined data set. By simply reporting the performance
of the best tuning parameter value, overly optimistic prediction errors have been pub-
lished in the past (Varma and Simon, 2006). A straightforward approach to avoid this
tuning bias is nested cross-validation (CV).
In this talk we are addressing two objectives. Firstly, we develop a new method
correcting for this tuning bias by embedding the tuning problem into a decision theoretic
framework. The method is based on the decomposition of the unconditional error rate
involving the tuning procedure. Our corrected error estimator can be reformulated
as a weighted mean of resampling errors obtained using the different tuning parameter
values. In this sense, it can be interpreted as a smooth version of nested CV. The smooth
weighting additionally guarantees intuitive bounds for the corrected error. Secondly, we
suggest to also use bias correction methods to address the bias resulting from the optimal
choice of the learning method. The latter bias is particularly relevant to prediction
problems based on high-dimensional ”omic” data. In the absence of standards, it is
indeed common practice to apply several methods successively. This can lead to an
optimistic bias similar to the tuning bias if one reports the performance of the optimal
method only.
32
We demonstrate the performance of our new method to address both types of bias
based on four microarray cancer data sets and compare it to existing methods. Our
main result is that our approach yields intuitively bounded estimates similar to nested
CV and at a dramatically lower computational price.
References
S. Varma and R. Simon. Bias in error estimation when using cross-validation for model
selection. BMC Bioinformatics, 7:91.
33
Bias-Variance Analysis of Local
Classification Methods
Julia Schiffner and Claus Weihs
Department of Statistics,
TU Dortmund
{schiffner, weihs}@statistik.tu-dortmund.de
Nowadays a plethora of classification methods is available and new ones or modi-
fications of established methods are regularly published. They can be grouped using
different properties as e.g. parametric or nonparametric, distance-based or not, pre-
dictive or generative etc. Another distinction can be made between global and local
methods. In recent years the amount of publications on local classification methods is
increasing. Localized versions of nearly all standard classification techniques like linear
discriminant analysis [1] and Fisher discriminant analysis [6], logistic regression [3, 7],
support vector machines [5] or boosting [8] are available.
The term local is only vaguely defined and used in a rather intuitive way by most
authors, referring to the position in some space, to a part of a whole or to something
that is not general or widespread. Often it relates to the the neighborhood of the point
where a prediction is required, with the k nearest neighbors method [2] as probably
best-known example. But also other concepts of locality can be found in the relevant
literature. For example Hand and Vinciotti [3] use the term local to refer to points
close to the decision boundary. Most localization techniques can be applied in a generic
manner to many different classification methods which results in a rather broad field of
methods. We will give an overview of existing approaches and their properties.
A question of interest is how localization affects the performance of classification
methods. The bias-variance decomposition of prediction error is conducive to gaining
deeper insight into the behavior of learning algorithms. It was originally introduced for
quadratic loss functions, but since in classification the misclassification rate is usually of
interest, generalizations to zero-one loss have been developed in the last 15 years, e.g. [4].
This was particularly motivated by research on multi-classifier systems where variance-
reduction was found as one explanation for the often good performance of multi-classifier
systems.
In order to gain deeper insight into how local methods work we analyze local classi-
fication methods in terms of bias and variance of the error rate Our intuition that is
34
supported by our recent experiments clearly is that local methods in general reduce the
bias in comparison with global counterparts. We will show some toy examples for illus-
tration of the decomposition and present some results for selected classification methods
and localization types on simulated and real-world data sets.
References
[1] I. Czogiel, K. Luebke, M. Zentgraf, and C. Weihs. Localized linear discriminant
analysis. In R. Decker and H.-J. Lenz, editors, Advances in Data Analysis, volume 33
of Studies in Classification, Data Analysis, and Knowledge Organization, pages 133–
140, Berlin Heidelberg, 2007. Springer.
[2] E. Fix and J. L. Hodges. Discriminatory analysis nonparametric discrimination:
Consistency properties. Report 4, U.S. Airforce School of Aviation Medicine, Ran-
dolph Field, Texas, 1951.
[3] D. J. Hand and V. Vinciotti. Local versus global models for classification problems:
Fitting models where it matters. The American Statistician, 57(2):124–131, May
2003.
[4] G. M. James. Variance and bias for general loss functions. Machine Learning, 51(2):
115–135, May 2003.
[5] N. Segata and E. Blanzieri Fast and scalable local kernel machines. Journal of
Machine Learning Research, 11:1883–1926, June 2010.
[6] M. Sugiyama. Dimensionality reduction of multimodal labeled data by local Fisher
discriminant analysis. Journal of Machine Learning Research, 8:1027–1061, May
2007.
[7] G. Tutz and H. Binder. Localized classification. Statistics and Computing, 15:155–
166, 2005.
[8] C.-X. Zhang and J.-S. Zhang. A local boosting algorithm for solving classification
problems. Computational Statistics & Data Analysis, 52:1928–1941, 2008.
35
... where λ ij and α are the mean and dispersion parameters of the generalized Poisson part, respectively; T i is the relative library size of the i-th sample, which is utilized to regulate λ ij . Generally, there are many representations of T i (Anders and Huber, 2010;Eddy, 2011;Badri et al., 2020;Mishra and Müller, 2022). In this paper, we take ...
Article
Full-text available
Motivation High-throughput sequencing technology facilitates the quantitative analysis of microbial communities, improving the capacity to investigate the associations between the human microbiome and diseases. Our primary motivating application is to explore the association between gut microbes and obesity. The complex characteristics of microbiome data, including high dimensionality, zero inflation, and over-dispersion, pose new statistical challenges for downstream analysis. Results We propose a GLM-based zero-inflated generalized Poisson factor analysis (GZIGPFA) model to analyze microbiome data with complex characteristics. The GZIGPFA model is based on a zero-inflated generalized Poisson (ZIGP) distribution for modeling microbiome count data. A link function between the generalized Poisson rate and the probability of excess zeros is established within the generalized linear model (GLM) framework. The latent parameters of the GZIGPFA model constitute a low-rank matrix comprising a low-dimensional score matrix and a loading matrix. An alternating maximum likelihood algorithm is employed to estimate the unknown parameters, and cross-validation is utilized to determine the rank of the model in this study. The proposed GZIGPFA model demonstrates superior performance and advantages through comprehensive simulation studies and real data applications.
... The beta diversity of the gut microbiota was calculated based on weighted and unweighted UniFrac distances, and principal coordinates analysis (PCoA) plots were created to visualize the differences between the two groups. The effect of different breeding ecological environment was tested using ADONIS, through a multivariate analysis of variance with 999 permutations, in order to find the difference between two groups [28]. The hierarchical clustering heatmap was carried out at OTU-level taxa to visualize the relationship between fecal samples from two groups by MicrobiomeAnalyst [29]. ...
Article
Full-text available
The gut microbiota of wild animals can regulate host physical health to adapt to the environment. High-throughput sequencing from fecal samples was used to analyze the gut microbiota communities in common cranes (Grus grus) without harming them. Herein, we compared the fecal microbiome of fifteen G. grus in Tianjin Tuanbo Bird Natural Reserve (wild group) and six G. grus sampled from Beijing Wildlife Park (semi-captive group) in China, using 16S amplicon sequencing and bioinformatic analysis. The results showed that microbiota diversity and composition varied in different groups, suggesting that the gut microbiota was interactively influenced by diet and the environment. A total of 38 phyla and 776 genera were analyzed in this study. The dominant phyla of the G. grus were Firmicutes and Proteobacteria. Meanwhile, the microbiota richness of the semi-captive group was higher than the wild group. Data on beta diversity highlighted significant differences based on different dietary compositions. Zea mays, Glycine max, and Phragmites australia showed a significant correlation with intestinal bacteria of G. grus. This study provides a comprehensive analysis of diet and microbiomes in semi-captive and wild G. grus living in different environments, thus helping us to evaluate the influence on animal microbiomes and improve conservation efforts for this species.
... We normalized raw read counts via variance stabilizing transformation (VST) using the VST functions from the DeSeq2 R statistical software package [37] before plugging them into the R statistical software package WGCNA [4]. For our signed networks, we used a soft thresholding power of 24, a minimum module size of 15 genes, and merged modules with greater than 80 % similarity. ...
Article
Full-text available
Wild organisms are regularly exposed to a wide range of parasites, requiring the management of an effective immune response while avoiding immunopathology. Currently, our knowledge of immunoparasitology primarily derives from controlled laboratory studies, neglecting the genetic and environmental diversity that contribute to immune phenotypes observed in wild populations. To gain insight into the immunologic variability in natural settings, we examined differences in immune gene expression of two Alaskan stickleback (Gasterosteus aculeatus) populations with varying susceptibility to infection by the cestode Schistocephalus solidus. Between these two populations, we found distinct immune gene expression patterns at the population level in response to infection with fish from the high-infection population displaying signs of parasite-driven immune manipulation. Further, we found significant differences in baseline immune gene profiles between the populations, with uninfected low-infection population fish showing signatures of inflammation compared to uninfected high-infection population fish. These results shed light on divergent responses of wild populations to the same parasite, providing valuable insights into host-parasite interactions in natural ecosystems.
... We performed the differential expression analysis using the DESeq2 R package 1.20.0 (Anders and Huber, 2010). We adjusted the p-values using the Benjamini and Hochberg's approach for controlling the false discovery rate (Green and Diggle, 2007). ...
... The first was between a minimum coverage of six (below which heterozygous sites would be frequently missed) and a maximum coverage of 20 (above which the distribution inflected away from the mean, suggesting repetitive sequences). The second was a CV between 0.7 and 1.3, given that the expected value is approximately 1 for Poisson-distributed counts (Anders and Huber 2010), and strong divergence from this expectation suggests copynumber variation among samples. Approximately 102,000 tags were retained after coverage filtering. ...
Article
Full-text available
The Arizona Toad (Anaxyrus microscaphus) is restricted to riverine corridors and adjacent uplands in the arid southwestern United States. As with numerous amphibians worldwide, populations are declining and face various known or suspected threats, from disease to habitat modification resulting from climate change. The Arizona Toad has been petitioned to be listed under the U.S. Endangered Species Act and was considered “warranted but precluded” citing the need for additional information – particularly regarding natural history (e.g., connectivity and dispersal ability). The objectives of this study were to characterize population structure and genetic diversity across the species’ range. We used reduced-representation genomic sequencing to genotype 3,601 single nucleotide polymorphisms in 99 Arizona Toads from ten drainages across its range. Multiple analytical methods revealed two distinct genetic groups bisected by the Colorado River; one in the northwestern portion of the range in southwestern Utah and eastern Nevada and the other in the southeastern portion of the range in central and eastern Arizona and New Mexico. We also found subtle substructure within both groups, particularly in central Arizona where toads at lower elevations were less connected than those at higher elevations. The northern and southern parts of the Arizona Toad range are not well connected genetically and could be managed as separate units. Further, these data could be used to identify source populations for assisted migration or translocations to support small or potentially declining populations.
... After transcript annotation, differential gene expression (DEG) analysis between treated and control samples was performed using DESeq2 v1.18.1 program. The differential expressed transcripts were identified with p-value cut-off 0.05 (Anders and Huber 2010;Love et al. 2014). The log2FC (Fold change) value of ≥ 2/ ≤ − 2 was set as the threshold to identify transcript abundance between treated and control sample.. edgeR-3.20.9 was used for constructing MA plot and volcano plot. ...
Article
Full-text available
Spodoptera litura is a destructive lepidopteran generalist pest widespread in tropical and subtropical regions and causes huge yield loss by gregarious feeding on crop plants. During co-evolution, Zea mays (Var. African tall) has attained a well-crafted defence mechanism and can demote the performance of its invaders. When an insect feeds on a host/non-host plant, its digestive system needs to upregulate the first line of defence against a broad spectrum of antifeedants and toxins of host origin. To understand the molecular mechanisms underlying insect response to plant resistance factors, a comparative midgut transcriptome of Spodoptera litura fed on maize and control plants was investigated, which identified a total of 712 differentially expressed genes (DEGs), including 232 up-regulating and 480 down-regulating genes. Gene ontology, gene enrichment and pathway analysis revealed that upregulated genes are involved in carbohydrate metabolism, detoxification, defence, lipid metabolism, digestion, and signal transduction. In contrast, down-regulated genes were primarily linked to cytoskeleton, transport, signalling, carbohydrate and lipid metabolism, growth and developmental processes. The above results indicate an antinutritional stress on S. litura, which leads to a compensatory mechanism in the insect by enhanced digestibility and detoxification at the cost of growth and development. This study provides an overall understanding of the transcriptomic response of S. litura upon feeding on a suboptimal host. Nevertheless, our study forms the basis for future molecular studies on S. litura adaptation and may widen the scope for their management.
... We performed the differential expression analysis using the DESeq2 R package 1.20.0 (Anders and Huber, 2010). We adjusted the p-values using the Benjamini and Hochberg's approach for controlling the false discovery rate (Green and Diggle, 2007). ...
Article
Metal pollution caused by deep-sea mining activities has potential detrimental effects on deep-sea ecosystems. However, our knowledge of how deep-sea organisms respond to this pollution is limited, given the challenges of remoteness and technology. To address this, we conducted a toxicity experiment by using deep-sea mussel Gigantidas platifrons as model animals and exposing them to different copper (Cu) concentrations (50 and 500 μg/L) for 7 days. Transcriptomics and LC-MS-based metabolomics methods were employed to characterize the profiles of transcription and metabolism in deep-sea mussels exposed to Cu. Transcriptomic results suggested that Cu toxicity significantly affected the immune response, apoptosis, and signaling processes in G. platifrons. Metabolomic results demonstrated that Cu exposure disrupted its carbohydrate metabolism, anaerobic metabolism and amino acid metabolism. By integrating both sets of results, transcriptomic and metabolomic, we find that Cu exposure significantly disrupts the metabolic pathway of protein digestion and absorption in G. platifrons. Furthermore, several key genes (e.g., heat shock protein 70 and baculoviral IAP repeat-containing protein 2/3) and metabolites (e.g., alanine and succinate) were identified as potential molecular biomarkers for deep-sea mussel’s responses to Cu toxicity. This study contributes novel insight for assessing the potential effects of deep-sea mining activities on deep-sea organisms.
Article
Full-text available
The human microbiome, comprising microorganisms residing within and on the human body, plays a crucial role in various physiological processes and has been linked to numerous diseases. To analyze microbiome data, it is essential to account for inherent heterogeneity and variability across samples. Normalization methods have been proposed to mitigate these variations and enhance comparability. However, the performance of these methods in predicting binary phenotypes remains understudied. This study systematically evaluates different normalization methods in microbiome data analysis and their impact on disease prediction. Our findings highlight the strengths and limitations of scaling, compositional data analysis, transformation, and batch correction methods. Scaling methods like TMM show consistent performance, while compositional data analysis methods exhibit mixed results. Transformation methods, such as Blom and NPN, demonstrate promise in capturing complex associations. Batch correction methods, including BMC and Limma, consistently outperform other approaches. However, the influence of normalization methods is constrained by population effects, disease effects, and batch effects. These results provide insights for selecting appropriate normalization approaches in microbiome research, improving predictive models, and advancing personalized medicine. Future research should explore larger and more diverse datasets and develop tailored normalization strategies for microbiome data analysis.
Article
Full-text available
The main problem with localized discriminant techniques is the curse of dimensionality, which seems to restrict their use to the case of few variables. However, if localization is combined with a reduction of dimension the initial number of variables is less restricted. In particular it is shown that localization yields powerful classifiers even in higher dimensions if localization is combined with locally adaptive selection of predictors. A robust localized logistic regression (LLR) method is developed for which all tuning parameters are chosen data-adaptively. In an extended simulation study we evaluate the potential of the proposed procedure for various types of data and compare it to other classification procedures. In addition we demonstrate that automatic choice of localization, predictor selection and penalty parameters based on cross validation is working well. Finally the method is applied to real data sets and its real world performance is compared to alternative procedures.
Article
Full-text available
A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach guarantees high generalization ability by dividing the separation function in local optimization problems that can be handled very efficiently. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with datasets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional datasets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability.
Article
Full-text available
Microarray gene expression time-course experiments provide the opportunity to observe the evolution of transcriptional programs that cells use to respond to internal and external stimuli. Most commonly used methods for identifying differentially expressed genes treat each time point as independent and ignore important correlations, including those within samples and between sampling times. Therefore they do not make full use of the information intrinsic to the data, leading to a loss of power. We present a flexible random-effects model that takes such correlations into account, improving our ability to detect genes that have sustained differential expression over more than one time point. By modeling the joint distribution of the samples that have been profiled across all time points, we gain sensitivity compared to a marginal analysis that examines each time point in isolation. We assign each gene a probability of differential expression using an empirical Bayes approach that reduces the effective number of parameters to be estimated. Based on results from theory, simulated data, and application to the genomic data presented here, we show that BETR has increased power to detect subtle differential expression in time-series data. The open-source R package betr is available through Bioconductor. BETR has also been incorporated in the freely-available, open-source MeV software tool available from http://www.tm4.org/mev.html.
Article
: The discrimination problem (two population case) may be defined as follows: e random variable Z, of observed value z, is distributed over some space (say, p-dimensional) either according to distribution F, or according to distribution G. The problem is to decide, on the basis of z, which of the two distributions Z has.
Article
A methodological and computational framework for centroid-based partitioning cluster analysis using arbitrary distance or similarity measures is presented. The power of high-level statistical computing environments like R enables data analysts to easily try out various distance measures with only minimal programming effort. A new variant of centroid neighborhood graphs is introduced which gives insight into the relationships between adjacent clusters. Artificial examples and a case study from marketing research are used to demonstrate the influence of distances measures on partitions and usage of neighborhood graphs.
Article
Based on the boosting-by-resampling version of Adaboost, a local boosting algorithm for dealing with classification tasks is proposed in this paper. Its main idea is that in each iteration, a local error is calculated for every training instance and a function of this local error is utilized to update the probability that the instance is selected to be part of next classifier's training set. When classifying a novel instance, the similarity information between it and each training instance is taken into account. Meanwhile, a parameter is introduced into the process of updating the probabilities assigned to training instances so that the algorithm can be more accurate than Adaboost. The experimental results on synthetic and several benchmark real-world data sets available from the UCI repository show that the proposed method improves the prediction accuracy and the robustness to classification noise of Adaboost. Furthermore, the diversity-accuracy patterns of the ensemble classifiers are investigated by kappa-error diagrams.
Article
When using squared error loss, bias and variance and their decomposition of prediction error are well understood and widely used concepts. However, there is no universally accepted definition for other loss functions. Numerous attempts have been made to extend these concepts beyond squared error loss. Most approaches have focused solely on 0-1 loss functions and have produced significantly different defini- tions. These differences stem from disagreement as to the essential characteristics that variance and bias should display. This paper suggests an explicit list of rules that we feel any "reasonable" set of definitions should satisfy. Using this framework, bias and variance definitions are produced which generalize to any symmetric loss function. We illustrate these statistics on several loss functions with particular emphasis on 0-1 loss. We conclude with a discussion of the various definitions that have been proposed in the past as well as a method for estimating these quantities on real data sets.
Article
Reducing the dimensionality of data without losing intrinsic information is an important preprocessing step in high-dimensional data analysis. Fisher discriminant analysis (FDA) is a traditional technique for supervised dimensionality reduction, but it tends to give undesired results if samples in a class are multimodal. An unsupervised dimensionality reduction method called locality-preserving projection (LPP) can work well with multimodal data due to its locality preserving property. However, since LPP does not take the label information into account, it is not necessarily useful in supervised learning scenarios. In this paper, we propose a new linear supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA), which effectively combines the ideas of FDA and LPP. LFDA has an analytic form of the embedding transformation and the solution can be easily computed just by solving a generalized eigenvalue problem. We demonstrate the practical usefulness and high scalability of the LFDA method in data visualization and classification tasks through extensive simulation studies. We also show that LFDA can be extended to non-linear dimensionality reduction scenarios by applying the kernel trick.
Article
Robust multiarray analysis (RMA) is the most widely used preprocessing algorithm for Affymetrix and Nimblegen gene expression microarrays. RMA performs background correction, normalization, and summarization in a modular way. The last 2 steps require multiple arrays to be analyzed simultaneously. The ability to borrow information across samples provides RMA various advantages. For example, the summarization step fits a parametric model that accounts for probe effects, assumed to be fixed across arrays, and improves outlier detection. Residuals, obtained from the fitted model, permit the creation of useful quality metrics. However, the dependence on multiple arrays has 2 drawbacks: (1) RMA cannot be used in clinical settings where samples must be processed individually or in small batches and (2) data sets preprocessed separately are not comparable. We propose a preprocessing algorithm, frozen RMA (fRMA), which allows one to analyze microarrays individually or in small batches and then combine the data for analysis. This is accomplished by utilizing information from the large publicly available microarray databases. In particular, estimates of probe-specific effects and variances are precomputed and frozen. Then, with new data sets, these are used in concert with information from the new arrays to normalize and summarize the data. We find that fRMA is comparable to RMA when the data are analyzed as a single batch and outperforms RMA when analyzing multiple batches. The methods described here are implemented in the R package fRMA and are currently available for download from the software section of http://rafalab.jhsph.edu.
Article
The modeling of genetic networks especially from microarray and related data has become an important aspect of the biosciences. This review takes a fresh look at a specific family of models used for constructing genetic networks, the so-called Boolean networks. The review outlines the various different types of Boolean network developed to date, from the original Random Boolean Network to the current Probabilistic Boolean Network. In addition, some of the different inference methods available to infer these genetic networks are also examined. Where possible, particular attention is paid to input requirements as well as the efficiency, advantages and drawbacks of each method. Though the Boolean network model is one of many models available for network inference today, it is well established and remains a topic of considerable interest in the field of genetic network inference. Hybrids of Boolean networks with other approaches may well be the way forward in inferring the most informative networks.