## About

81

Publications

25,681

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

17,313

Citations

Introduction

**Skills and Expertise**

## Publications

Publications (81)

This chapter is a tutorial for the continuous conditional generative adversarial network (CcGAN) [8], the first generative model for image generation with continuous, scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (e.g., class labels); conditioning on regression labels...

Recently, subsampling or refining images generated from unconditional generative adversarial networks (GANs) has been actively studied to improve the overall image quality. Unfortunately, these methods are often observed less effective or inefficient in handling conditional GANs (cGANs) -- conditioning on a class (aka class-conditional GANs) or a c...

This work proposes the continuous conditional generative adversarial network (CcGAN), the first generative model for image generation conditional on continuous, scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (eg, class labels); conditioning on regression labels is mathe...

Background and Objective
In binary classification problems with a rare class of interest, there is relatively little information available for the rare class to build a model. On the other hand, the number of useful variables to develop a model for classification can be high-dimensional. For example, in drug discovery, there are usually a very few...

Knowledge distillation (KD) has been actively studied for image classification tasks in deep learning, aiming to improve the performance of a student model based on the knowledge from a teacher model. However, there have been very few efforts for applying KD in image regression with a scalar response, and there is no KD method applicable to both ta...

This work proposes the continuous conditional generative adversarial network (CcGAN), the first generative model for image generation conditional on continuous , scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (e.g., class labels); conditioning on regression labels is ma...

Modern methods often formulate the counting of cells from microscopic images as a regression problem and more or less rely on expensive, manually annotated training images (e.g., dot annotations indicating the centroids of cells or segmentation masks identifying the contours of cells). This work proposes a supervised learning framework based on cla...

Modern methods often formulate the counting of cells from microscopic images as a regression problem and more or less rely on expensive, manually annotated training images (e.g., dot annotations indicating the centroids of cells or segmentation masks identifying the contours of cells). This work proposes a supervised learning framework based on cla...

Filtering out unrealistic images from trained generative adversarial networks (GANs) has attracted considerable attention recently. Two density ratio based subsampling methods---Discriminator Rejection Sampling (DRS) and Metropolis-Hastings GAN (MH-GAN)---were recently proposed, and their effectiveness in improving GANs was demonstrated on multiple...

A computer code can simulate a system's propagation of variation from random inputs to output measures of quality. Our aim here is to estimate a critical output tail probability or quantile without a large Monte Carlo experiment. Instead, we build a statistical surrogate for the input-output relationship with a modest number of evaluations and then...

The majority of computational methods for predicting toxicity of chemicals are typically based on ‘non-mechanistic’ cheminformatics solutions, relying on an arsenal of QSAR descriptors, often vaguely associated with chemical structures, and typically employing ‘black-box’ mathematical algorithms. Nonetheless, such machine learning models, while hav...

We propose an algorithm for a family of optimization problems where the objective can be decomposed as a sum of functions with monotonicity properties. The motivating problem is optimization of hyperparameters of machine learning algorithms, where we argue that the objective, validation error, can be decomposed as monotonic functions of the hyperpa...

Tomal et al. (2015) introduced the notion of "phalanxes" in the context of rare-class detection in two-class classification problems. A phalanx is a subset of features that work well for classification tasks. In this paper, we propose a different class of phalanxes for application in regression settings. We define a "Regression Phalanx" - a subset...

Gaussian processes are widely used in the analysis of data from a computer model. Ideally, the analysis will yield accurate predictions with correct coverage probabilities of credible intervals. In this paper, we first review several existing Bayesian implementations in the literature. We show that Bayesian approaches with squared-exponential corre...

An ensemble of models (EM), where each model is constructed on a diverse subset of feature variables, is proposed to rank rare class items ahead of majority class items in a highly unbalanced two class problem. The proposed ensemble relies on an algorithm to group the feature variables into subsets where the variables in a subset work better togeth...

A quantitative-structure activity relationship (QSAR) is a model relating a specific biological response to the chemical structures of compounds. There are many descriptor sets available to characterize chemical structure, raising the question of how to choose among them or how to use all of them for training a QSAR model. Making efficient use of a...

Statistical methods based on a regression model plus a zero-mean Gaussian process (GP) have been widely used for predicting the output of a deterministic computer code. There are many suggestions in the literature for how to choose the regression component and how to model the correlation structure of the GP. This article argues that comprehensive,...

A computer code or simulator is a mathematical representation of a physical system, for example a set of differential equations. Running the code with given values of the vector of inputs, x, leads to an output y(x) or several such outputs. For instance, one application we use for illustration simulates the average tidal power, y, generated as a fu...

We show how a Gaussian Process (GP) can be used as a nonparametric regression model to fit experiment data that captures the relationship between the experiment response and the experiment factors. We illustrate the GP model analysis with a solar collector computer experiment. We also illustrate how physical experiment data can be analyzed using a...

We propose a statistical air quality model that simultaneously performs two major tasks: (1) provides a computationally inexpensive means of modelling and forecasting complex space-time air pollution processes, (2) enables an informative and statistically defensible approach for evaluation of air quality models. Rather than working with raw data, w...

We have proposed an ensemble method which aggregates over clusters of
predictor variables. We form the clusters (we call phalanxes) by joining
variables together. The variables in a phalanx are good to put together, and
the variables in different phalanxes are good to ensemble. We then build our
ensemble of phalanxes (EPX) by growing a random fores...

Millions of compounds are available as potential drug candidates. High throughput screening (HTS) is widely used in drug discovery to assay compounds for a particular biological activity. A common approach is to build a classification model using a smaller sample of assay data to predict the activity of unscreened compounds and hence select further...

Cross-validation (CV) is widely used for tuning a model with respect to
user-selected parameters and for selecting a "best" model. For example, the
method of $k$-nearest neighbors requires the user to choose $k$, the number of
neighbors, and a neural network has several tuning parameters controlling the
network complexity. Once such parameters are...

The use of single nucleotide polymorphisms (SNPs) has become increasingly important for a wide range of genetic studies. A high-throughput genotyping technology usually in- volves a statistical algorithm for automatic (non-manual) geno- type calling. Most calling algorithms in the literature, using methods such as �� -means and mixture-models, rely...

ChemModLab, written by the ECCR @ NCSU consortium under NIH support, is a toolbox for fitting and assessing quantitative structure-activity relationships (QSARs). Its elements are: a cheminformatic front end used to supply molecular descriptors for use in modeling; a set of methods for fitting models; and methods for validating the resulting model....

The authors propose a profile likelihood approach to linear clustering which explores potential linear clusters in a data set. For each linear cluster, an errors-in-variables model is assumed. The optimization of the derived profile likelihood can be achieved by an EM algorithm. Its asymptotic properties and its relationships with several existing...

We produce reasons and evidence supporting the informal rule that the number of runs for an effective initial computer experiment should be about 10 times the input dimension. Our arguments quantify two key characteristics of computer codes that affect the sample size required for a desired level of accuracy when approximating the code via a Gaussi...

To build a predictor, the output of a deterministic computer model or “code” is often treated as a realization of a stochastic process indexed by the code's input variables. The authors consider an asymptotic form of the Gaussian correlation function for the stochastic process where the correlation tends to unity. They show that the limiting best l...

The inverse of the Fisher information matrix is commonly used as an approximation for the covariance matrix of maximum-likelihood estimators. We show via three examples that for the covariance parameters of Gaussian stochastic processes under infill asymptotics, the covariance matrix of the limiting distribution of their maximum-likelihood estimato...

Sequential screening has become increasingly popular in drug discovery. It iteratively builds quantitative structure-activity relationship (QSAR) models from successive high-throughput screens, making screening more effective and efficient. We compare cluster structure-activity relationship analysis (CSARA) as a QSAR method with recursive partition...

Computer models to simulate physical phenomena are now widely available in engineering and science. Before relying on a computer model, a natural first step is often to compare its output with physical or field data, to assess whether the computer model reliably represents the real world. Field data, when available, can also be used to calibrate or...

Single nucleotide polymorphisms (SNPs) are DNA sequence variations, occurring when a single nucleotide--adenine (A), thymine (T), cytosine (C) or guanine (G)--is altered. Arguably, SNPs account for more than 90% of human genetic variation. Our laboratory has developed a highly redundant SNP genotyping assay consisting of multiple probes with signal...

An experiment involving a complex computer model or code may have tens or even hundreds of input variables and, hence, the identication of the more important variables (screening) is often crucial. Methods are described for decomposing a complex input-output relationship into eects. Eects are more easily understood because each is due to only one o...

this paper deals with the unconstrained global optimization problem, minimize f(x) where x = (x 1 ; : : : ; x k ). This includes the class of problems with simple constraints like a i x i b i , since these problems can be transformed to unconstrained global optimization problems. Throughout we assume without loss of generality that the extremum of...

The concepts of diversity and similarity of molecules are widely used in quantitative methods for designing (selecting) a representative set of molecules and for analyzing the relationship between chemical structure and biological activity. We review methods and algorithms for design of a diverse set of molecules in the chemical space using cluster...

Initial leads for drug development often originate from high-throughput screening (HTS), where hundreds of thousands of compounds are tested for biological activity. As the number of both targets for screening and compounds available for screening increase, there is a need to consider methods for making this process more efficient. One approach is...

In screening for drug discovery, chemists often select a large subset of molecules from a very large database (e.g., select 1,000 molecules from 100,000). To generate diverse leads for drug optimization, highly active compounds in several structurally different chemical classes are sought. Molecules can be characterized by numerical descriptors, an...

this paper, for instance, biological activity in protecting human cells from HIV infection was assayed for about 30,000 compounds. In order to nd the most promising drug candidate, biochemists would like to examine as many compounds as possible. However, it is 1 impractical to test all of the huge number of compounds potentially available. Research...

A problem often arising in engineering applications of computer models is to determine the importance of each data item in the large pool of required input factors. This paper explores a statistical approach for investigating factor sensitivities. The methodology is demonstrated with the HDM-III highway life-cycle cost analysis model. Specifically,...

The effects of certain chemical additives at maintaining a high level of activity in protein constructs during storage is investigated. We use a semiparametric regression technique to model the effects of the additives on protein activity. The model is extended to handle categorical explanatory variables. On the basis of the available data, the imp...

Sampling and prediction strategies relevant at the planning stage of the cleanup of environmental hazards are discussed. Sampling designs and models are compared using an extensive set of data on dioxin contamination at Piazza Road, Missouri. To meet the assumptions of the statistical model, such data are often transformed by taking logarithms. Pre...

In many engineering optimization problems, the number of function evaluations is severely limited by time or cost. These problems pose a special challenge to the field of global optimization, since existing methods often require more function evaluations than can be comfortably afforded. One way to address this challenge is to fit response surfaces...

D-optimality is one of the most commonly used design criteria for linear regression models. In industrial experiments binary or count data often arise, for example defective/nondefective or number of defects. For such data Generalized Linear Models (GLMs) are appropriate. An analogous D-optimality design criterion can be developed using the asympto...

In electrical engineering, circuit designs are now often optimized via circuit simulation computer models. Typically, many response variables characterize the circuit’s performance. Each response is a function of many input variables, including factors that can be set in the engineering design and noise factors representing manufacturing conditions...

Ozone in the planetary boundary layer of the troposphere is considered harmful to plants and human health. Surface ozone levels are determined by the strengths of sources and precursor emissions and by meteorological conditions. Therefore, assessing ozone trends is complicated by meteorological variability. Ozone data in the Chicago area over the p...

In this study we have employed statistical methods to efficiently design experiments and analyze output of an ocean general circulation model that uses an isopycnal mixing parameterization. Full ranges of seven inputs are explored using 51 numerical experiments. Fifteen of the cases fail to reach satisfactory equilibria. These are attributable to n...

A dynamic-thermodynamic sea ice model is used to illustrate a sensitivity evaluation strategy in which a statistical model is fit to the output of the ice model. The statistical model response, evaluated in terms of certain metrics or integrated features of the ice model output, is a function of a selected set of d(=13) prescribed parameters of the...

The authors describe a sequential strategy for designing
manufacturable integrated circuits using available CAD tools. Optimizing
the performance of complex designs in the presence of unwanted parameter
variations can take a prohibitively large number of computer runs. These
methods overcome this complexity by combining sequential experimentation
w...

Many scientific phenomena are now investigated by complex computer models or codes. Given the input values, the code produces one or more outputs via a complex mathematical model. Often the code is expensive to run, and it may be necessary to build a computationally cheaper predictor to enable, for example, optimization of the inputs. If there are...

A major bottleneck in the design and parametric yield optimization of CMOS integrated circuits lies in the high cost of the circuit simulations. One method that significantly reduces the simulation cost is to approximate the circuit performances by fitted quadratic models and then use these computationally inexpensive models to optimize the paramet...

Many products are now routinely designed with the aid of computer models. Given the inputs-designable engineering parameters and parameters representing manufacturing-process conditions-the model generates the product's quality characteristics. The quality improvement problem is to choose the designable engineering parameters such that the quality...

A general method for constructing permutation tests for various experimental designs follows from invariance and sufficiency. In this framework, randomization (or rerandomization) tests are just special cases of permutation tests. The methodology extends the applicability of permutation tests: An example demonstrates a test for an interaction effec...

Taguchi's off-line quality control methods for product and process improvement emphasize experiments to design quality “into” products and processes. In Very Large Scale Integrated (VLSI) circuit design, the application of interest here, computer modeling is invariably quicker and cheaper than physical experimentation. Our approach models quality c...

A method for parametric yield optimization which significantly
reduces the simulation cost is proposed. The method assumes that the
circuit performances ultimately determining yield can be approximated by
computationally inexpensive functions of the inputs to the circuit
simulator. These inputs are the designable parameters, the
uncontrollable stat...

A computer experiment generates observations by running a computer model at inputs x and recording the output (response) Y. Prediction of the response Y to an untried input is treated by modeling the systematic departure of Y from a linear model as a realization of a stochastic process. For given data (selected inputs and the computed responses), b...

Many scientific phenomena are now investigated by complex computer models or codes. A computer experiment is a number of runs of the code with various inputs. A feature of many computer experiments is that the output is deterministic—rerunning the code with the same inputs gives identical observations. Often, the codes are computationally expensive...

A branch-and-bound algorithm is described for finding the permutation (randomization) P value in matched-pairs designs without enumeration of the entire reference distribution. It is not restricted to test statistics that are linear in functions of the observations, and permutation tests based on trimmed means are investigated. We apply the algorit...

When a rerandomization or permutation test is applied to a matched-pairs design, the observed mean difference is typically
compared with the null distribution of means that would have occurred under all possible randomizations. It is shown that
rerandomizing the median instead of the mean transforms a formidable computational problem into one amena...

In experiments for response estimation, algorithms developed for constructing D-optimal exact (integer replication) designs may be inappropriate when the number of observations is not large relative to the number of parameters. This article generalizes Mitchell's DETMAX algorithm to an arbitrary design criterion and describes an efficient implement...

In estimating a response surface over a design region of interest, mean squared error can arise both from sampling variance and bias introduced by model inadequacy. The criterion adopted here for experimental design attempts to protect against bias resulting from a large class of deviations from the assumed model. Two algorithms are proposed for th...

A theory of algorithmic complexity from computer science allows one to examine the properties of algorithms for the solution of computational problems and, in particular, the relationship between the size of an instance of the generic problem and the time required for computation The theory identifies certain problems as np- complete or np-hard, tw...

This article presents a branch-and-bound algorithm that constructs a catalog of all D-optimal n-point designs for specified design region, linear model, and number of observations, n. While the primary design criterion is D optimality, the algorithm may also be. used to find designs performing well by other secondary criteria, if a small sacrifice...

We present methodology and algorithms for classification of sin- gle nucleotide polymorphism (SNP) genotypes. A mixture model for classification provides robustness against outlying values of the ex- planatory variables. Furthermore, different sets of explanatory vari- ables are generated by deliberate redundancy in the genotyping chem- istry, and...

Transformations can help small sample likelihood/Bayesian inference by improving the ap-proximate normality of the likelihood/posterior. In this article we investigate when one can expect an improvement for a one-dimensional random function (Gaussian process) model. The log transformation of the range parameter is compared with an alternative (the...

We propose a profile likelihood approach to linear clustering which explores poten-tial linear clusters in a data set. For each linear cluster, an errors-in-variables model is assumed. The optimization of the derived profile likelihood can be achieved by an EM algorithm. Its asymptotic properties and its relationships with several existing clusteri...

## Network

Cited