imDEV: a graphical user interface to R multivariate analysis tools in Microsoft Excel.
ABSTRACT Interactive modules for Data Exploration and Visualization (imDEV) is a Microsoft Excel spreadsheet embedded application providing an integrated environment for the analysis of omics data through a user-friendly interface. Individual modules enables interactive and dynamic analyses of large data by interfacing R's multivariate statistics and highly customizable visualizations with the spreadsheet environment, aiding robust inferences and generating information-rich data visualizations. This tool provides access to multiple comparisons with false discovery correction, hierarchical clustering, principal and independent component analyses, partial least squares regression and discriminant analysis, through an intuitive interface for creating high-quality two- and a three-dimensional visualizations including scatter plot matrices, distribution plots, dendrograms, heat maps, biplots, trellis biplots and correlation networks. Availability and implementation: Freely available for download at http://sourceforge.net/projects/imdev/. Implemented in R and VBA and supported by Microsoft Excel (2003, 2007 and 2010).
Published by Oxford University Press. All rights reserved. For Permissions, please email: firstname.lastname@example.org
Summary: Interactive modules for Data Exploration and Visu-
alization (imDEV) is a Microsoft Excel spreadsheet embedded
application providing an integrated environment for the analysis
of omics data through a user-friendly interface. Individual mod-
ules enable interactive and dynamic analyses of large data by
interfacing R’s multivariate statistics and highly customizable
visualizations with the spreadsheet environment, aiding robust
inferences and generating information rich data visualizations.
This tool provides access to multiple comparisons with false
discovery correction, hierarchical clustering, principal (PCA)
and independent component analyses (ICA), partial least squares
regression (PLS) and discriminant analysis (PLS-DA), through
an intuitive interface for creating high quality two and a three-
dimensional visualizations including scatter plot matrices, dis-
tribution plots, dendrograms, heat maps, biplots, trellis biplots
and correlation networks.
Availability and Implementation: Freely available for down-
load at http://sourceforge.net/projects/imdev/. Implemented in R
and VBA and supported by Microsoft Excel (2003, 2007 and
Contact: John W. Newman – John.Newman@ars.usda.gov
Supplementary information: Installation instructions, tutorials,
and users manual
are available at
Omics experiments generate complex high dimensional data
requiring multivariate analyses. While basic spreadsheets are
widely used for data storage and low level statistical analyses,
these currently lack tools for multivariate analyses and visualiza-
tion. In contrast, the R Project for Statistical Computing is a
freely available software environment (R Development Core
Team, 2011) that provides a variety of multivariate data analysis
and visualization methods. Despite its power, the command-line
interface of R is a barrier to its broad use. Through the applica-
tion of RExcel (Baier and Neuwirth, 2007) , imDEV unites the
Microsoft Excel (MS Excel) spreadsheet and R, to provide a
user-friendly graphical interface to multivariate analysis and
visualization. imDEV also facilitates multivariate data interpre-
tation by integrating dynamic data visualizations with univariate
statistics, dimensional reduction methods, predictive modeling
and network analyses tools.
*To whom correspondence should be addressed.
Software imDEV (http://sourceforge.net/projects/imdev/), is
implemented in Visual Basic for Applications (VBA) and R
programming languages, and depends on the open source appli-
cations statconnDCOM (v3.1-2B7) and RExcel (v3.2.2)
(http://rcom.univie.ac.at/) for integration as an add-in into MS
Excel versions 2003, 2007 and 2010.
Pull down menus in MS Excel provide a modular interface
organized by analysis type. Background processes translate user
inputs from VBA to R, and R source scripts execute calculations
and generate visualizations. Numerical outputs from R are re-
turned to user defined ranges or worksheets in MS Excel. Visu-
alizations can be exported directly from the R plotting interface
in a variety of file formats. User-defined inputs are stored as
named ranges in MS Excel, enabling dynamic loading of R ob-
jects between analyses sessions.
Features Interactive modules allow the user to rapidly progress
from routine data analysis tasks to the generation of complex
network representations of multivariate classification or predic-
tion models. Data pre-treatment tools include variable specific
normality transformations, missing values imputation, centering,
and a variety of scaling methods. Hypothesis testing is supported
by parametric and non-parametric comparisons of class means
with multiple comparison adjustments, q-value calculations, and
tabular result outputs.
Analysis and visualization of variable correlations depend on
the significance of correlation coefficients from user selected
parametric or rank-order tests (i.e. Pearson, Spearman and Ken-
dall). Analysis of variable or sample relatedness is facilitated by
hierarchal cluster analyses supported by a variety of distance
and agglomeration methods, and aided by multivariate, group-
specific visualizations of variable distributions, linear correla-
tions, correlation matrix heat maps and dendrograms.
Exploratory data analyses are supported through dimensional
reduction and projection pursuits. These include outlier-
insensitive methods such as probabilistic and Bayesian principal
component analyses (Stacklies, et al., 2007) and independent
components analysis (ICA), useful for separating analytical
noise from biologically relevant information (Marchini, et al.,
2010). Interpretation of data projections are aided by customiza-
ble, single to multiple-component, two- or three-dimensional
biplots for model eigenvalues, scores, and loadings.
Multivariate predictive modeling can be used to reduce data
complexity through the segregation of informative from non-
informative variables. Multivariate regression and classification
are supported through an interface for fitting, validating and
optimizing partial least squares regression and classification
imDEV: a Graphical User Interface to R Multivariate Analysis
Tools in Microsoft Excel
Dmitry Grapov1,2, John W. Newman1,2*
1Department of Nutrition, University of California, Davis CA, 95616.
2Obesity and Metabolism Research Unit, USDA, ARS, Western Human Nutrition Research Center, Davis, CA, 95616
Associate Editor: Dr. Jonathan Wren
Bioinformatics Advance Access published July 18, 2012
at University of California, Davis on July 19, 2012
models (Mevik, et al., 2011). Models can be fit using NIPALS,
SIMPLS, and kernel methods. To avoid over-fitting, tools to co-
mpare results with permuted null models and training and test
set based validations are provided. Filter (Romanski, 2009),
wrapper (Markowetz and Spang, 2005) and hybrid variable se-
lection methods are implemented to optimize model perfor-
mance, which is internally and externally cross-validated using
user-defined or PCA optimized (Kennard and Stone, 1969)
training and/or test splits of the original data. Tools for penal-
ized feature selection methods (Goeman, et al., 2012; Kraemer
and Boulesteix, 2011), e.g. L1 (lasso) and L2 (ridge) are under
Correlation networks provide an intuitive method to integrate
and analyze relationships among data and metadata. User-
defined, projection based or multidimensional scaled graph lay-
outs are used to generate undirected two- or three-dimensional
weighted graphs, which through a dynamic edge drawing and
vertex annotating interface can be used to explore, highlight or
identify linear dependencies among variables. To limit spurious
edges and visual complexity of correlation driven networks,
future expansions will implement methods for undirected Gauss-
ian graphical Markov model generation (Castelo and Roverato,
imDEV provides a visualization centric integrated environment
for multivariate analyses based on projection pursuits, multivari-
ate predictive modeling and network generation. To highlight
RESULTS AND DISCUSSION
the versatility of imDEV, a previously reported metabolomic
data set comprised of serum free fatty acids, oxylipins and endo-
cannabinoids (Psychogios, et al., 2011) was combined with as-
sociated clinical data (NIH Project Number 5R01HL076238-03,
Oxidized Linoleic Acid, Aldosterone and Obesity) and analyzed
to identify relationships between circulating metabolites and
systolic blood pressure (sBP).
The relationship between metabolomic variables (n=174) and
sBP in men (n=30) and women (n=38), are explored (Figure 1).
Visual inspection of PCA scores and loadings (Figure 1A) for
the 2nd and 3rd components highlights gender-specific differ-
ences in metabolite and sBP measurements. Since females dis-
played lower sBP than their male counterparts (Mann-Whitney
U-test, p<0.05) gender-independent predictors for sBP were
explored. Variable-specific transformations (Box and Cox,
1964) were used to achieve normal distributions (D'Agostino
and Stephens, 1986). Missing values were imputed using a ten
component probabilistic PCA model (Stacklies, et al., 2007).
Pre-treated values were linearly adjusted for gender (Figure 1B),
centered, scaled to unit variance and used to fit a PLS predictive
model for sBP. Hybrid, filter/subset, feature selection was used
to reduced model complexity by 81%, compared with the full
parameter model, while maintaining goodness-of-fit (Q2) and
root mean squared error of prediction (RMSEP) model statistics.
The selected feature set/model was externally validated by com-
paring the distributions for its Q2 (Figure 1 C1) and RMSEP
(Figure 1 C2) statistics, based on 100 permutations of model
training (2/3 of the samples) and testing (1/3 hold out set) pro-
cedures, to their respective null distributions, generated from
Figure 1. Analysis of the relationship between circulating metabolite levels and systolic blood pressure (sBP). Exploratory PCA was
used to identify gender-specific differences in metabolite concentrations and sBP. A PLS model was developed for the prediction of sBP
given gender-adjusted metabolite concentrations. Network analysis of the PLS model parameters was used to highlight a previously known
relationship between sBP and gamma glutamyl transferase (GGT), and identified a novel group of related metabolites (Group 2) which are
negatively correlated with both sBP and GGT. A) PCA scores and loadings trellis-plots. Scores (bottom left) where size indicates sBP,
while color and shape indicate gender (pink squares, female; blue diamonds, male), with an outlier highlighted in red. Loading (top right)
sizes indicate p-values from Mann-Whitney U-test for gender-specific differences in metabolite concentrations. B) Variable distribution and
scatter plot matrix displaying the effect of covariate adjustment for gender on a representative variable. C) Comparison of gender-adjusted
sBP PLS model Q2 (C1) and RMSEP (C2) statistic distributions to their respective permuted null distributions. D) A multi-dimensionally
scaled PLS model correlation network visualizing correlations (Spearman’s rho, p<0.05) displayed by colored edges (orange, positive; blue,
negative) between systolic blood pressure (sBP, red diamond) and model parameters (triangles). Triangle (i.e. network vertex) characteris-
tics encode PLS coefficient magnitude (size) and sign (upward, positive; inverted, negative). Major groups of correlated variables, defined
by a hierarchal cluster analysis, are displayed by ellipses, and biologically related classes of metabolites are shown using similar vertex col-
ors and polygons.
at University of California, Davis on July 19, 2012
100 randomly permuted sBP models. The out of sample error
rate defined by the RMSEP for the developed model, containing
6 clinical and 27 metabolite parameters, was significantly lower
than that expected by random chance (Figure 1 C2). A correla-
tion network (Figure 1D) was used to interpret the model and
study the interactions between its components. A hierarchal
cluster analysis was used to define 4 major groups of correlated
parameters, which are highlighted by ellipses in the multidimen-
sional scaled defined network space (Figure 1D). In this graph,
vertex positions are defined by similarities in rank-order correla-
tions and vertex shape and size by the sign and magnitude of the
PLS model coefficients. Vertex and polygon colors are the same
for biologically related species. Correlations between model
parameters and sBP are shown by connecting edges. Systolic
blood pressure is shown to be positively correlated with gamma
glutamyl transferase (GGT) and negatively correlated with me-
tabolite Group 2. GGT is a known independent risk factor for
cardiovascular disease (CVD) and metabolic syndrome (Lee, et
al., 2007). Interestingly, metabolite Group 2, consisting of two
classes of biologically related molecules, is negatively correlated
with both sBP and GGT. Further investigation of theses species
may provide sensitive markers for CVD risk and aid in the de-
velopment of a mechanistic relationship between sBP, GGT and
these circulating metabolites.
In summary imDEV consists of a VBA-based graphical user
interface that accesses R-based visualizations and statistical
algorithms, which operate on data stored within and returned to
named ranges or worksheets within Excel. This developed frame
work for imDEV ensures that new analytical tools and methods
can be incorporated into future interactive modules. The devel-
oped software interfacing exploratory, predictive and network-
ing tasks allows for a rapid and dynamic multivariate analysis
workflow which focuses on understanding the data at a systems-
level, aiding its interpretation within a biological context.
We acknowledge the exceptional work of the R Development
Core Team, authors of the R community contributed packages
used in imDEV, the RExcel and statconn developers, and im-
DEV early adopters, who have made this project possible. The
USDA is an equal opportunity employer. Funding: This project
was funded by the National Institutes of Health [T32-
GM008799, R01DK078328-01] and the United States Depart-
ment of Agriculture [5306-51530-019-00D].
Baier, T. and Neuwirth, E. (2007) Excel :: COM :: R, Computa-
tional Statistics, 22, 91-108.
Box, G.E.P. and Cox, D.R. (1964) An analysis of transfor-
mations, Journal of the Royal Statisistical Society, 26, 211-252.
Castelo, R. and Roverato, A. (2006) A robust procedure for
Gaussian graphical model search from microarray data with p
larger than n, J Mach Learn Res, 7, 2621-2650.
D'Agostino, R.B. and Stephens, M.A. (eds) (1986) Goodness-of-
Fit Techniques. Marcel Dekker, New York.
Goeman, J.J., Meijer, R. and Chaturvedi, N. (2012) penalized:
L1 (lasso and fused lasso) and L2 (ridge) penalized estimation in
GLMs and in the Cox model. R Foundation for Statistical Com-
Kennard, R.W. and Stone, L.A. (1969) Computer aided design
of experiments, Technometrics, 11, 137-148.
Kraemer, N. and Boulesteix, A.-L. (2011) ppls: Penalized Partial
Least Squares. R Foundation for Statistical Computing, Vienna.
Lee, D.S., Evans, J.C., Robins, S.J., Wilson, P.W., Albano, I.,
Fox, C.S., Wang, T.J., Benjamin, E.J., D'Agostino, R.B. and
Vasan, R.S. (2007) Gamma glutamyl transferase and metabolic
syndrome, cardiovascular disease, and mortality risk: the Fram-
ingham Heart Study, Arteriosclerosis, thrombosis, and vascular
biology, 27, 127-133.
Marchini, J.L., Heaton, C. and Ripley, B.D. (2010) fastICA:
FastICA Algorithms to perform ICA and Projection Pursuit. R
Foundation for Statistical Computing, Vienna.
Markowetz, F. and Spang, R. (2005) Molecular diagnosis. Clas-
sification, model selection and performance evaluation, Methods
of information in medicine, 44, 438-443.
Mevik, B.-H., Wehrens, R. and Liland, K.H. (2011) pls: Partial
Least Squares and Principal Component regression. R Founda-
tion for Statistical Computing, Vienna.
Psychogios, N., Hau, D.D., Peng, J., Guo, A.C., Mandal, R.,
Bouatra, S., Sinelnikov, I., Krishnamurthy, R., Eisner, R., Gau-
tam, B., Young, N., Xia, J., Knox, C., Dong, E., Huang, P., Hol-
lander, Z., Pedersen, T.L., Smith, S.R., Bamforth, F., Greiner,
R., McManus, B., Newman, J.W., Goodfriend, T. and Wishart,
D.S. (2011) The human serum metabolome, PLoS One, 6,
R Development Core Team (2011) R: A Language and Envi-
ronment for Statistical Computing. R Foundation for Statistical
Romanski, P. (2009) FSelector: Selecting attributes. R Founda-
tion for Statistical Computing, Vienna.
Stacklies, W., Redestig, H., Scholz, M., Walther, D. and Selbig,
J. (2007) pcaMethods--a bioconductor package providing PCA
methods for incomplete data, Bioinformatics, 23, 1164-1167.
at University of California, Davis on July 19, 2012