William J. Welch

William J. Welch
University of British Columbia - Vancouver | UBC ·  Department of Statistics

About

81
Publications
25,681
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
17,313
Citations
Introduction

Publications

Publications (81)
Chapter
This chapter is a tutorial for the continuous conditional generative adversarial network (CcGAN) [8], the first generative model for image generation with continuous, scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (e.g., class labels); conditioning on regression labels...
Preprint
Full-text available
Recently, subsampling or refining images generated from unconditional generative adversarial networks (GANs) has been actively studied to improve the overall image quality. Unfortunately, these methods are often observed less effective or inefficient in handling conditional GANs (cGANs) -- conditioning on a class (aka class-conditional GANs) or a c...
Preprint
Full-text available
This work proposes the continuous conditional generative adversarial network (CcGAN), the first generative model for image generation conditional on continuous, scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (eg, class labels); conditioning on regression labels is mathe...
Article
Background and Objective In binary classification problems with a rare class of interest, there is relatively little information available for the rare class to build a model. On the other hand, the number of useful variables to develop a model for classification can be high-dimensional. For example, in drug discovery, there are usually a very few...
Preprint
Full-text available
Knowledge distillation (KD) has been actively studied for image classification tasks in deep learning, aiming to improve the performance of a student model based on the knowledge from a teacher model. However, there have been very few efforts for applying KD in image regression with a scalar response, and there is no KD method applicable to both ta...
Conference Paper
Full-text available
This work proposes the continuous conditional generative adversarial network (CcGAN), the first generative model for image generation conditional on continuous , scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (e.g., class labels); conditioning on regression labels is ma...
Chapter
Modern methods often formulate the counting of cells from microscopic images as a regression problem and more or less rely on expensive, manually annotated training images (e.g., dot annotations indicating the centroids of cells or segmentation masks identifying the contours of cells). This work proposes a supervised learning framework based on cla...
Preprint
Full-text available
Modern methods often formulate the counting of cells from microscopic images as a regression problem and more or less rely on expensive, manually annotated training images (e.g., dot annotations indicating the centroids of cells or segmentation masks identifying the contours of cells). This work proposes a supervised learning framework based on cla...
Article
Full-text available
Filtering out unrealistic images from trained generative adversarial networks (GANs) has attracted considerable attention recently. Two density ratio based subsampling methods---Discriminator Rejection Sampling (DRS) and Metropolis-Hastings GAN (MH-GAN)---were recently proposed, and their effectiveness in improving GANs was demonstrated on multiple...
Preprint
Full-text available
A computer code can simulate a system's propagation of variation from random inputs to output measures of quality. Our aim here is to estimate a critical output tail probability or quantile without a large Monte Carlo experiment. Instead, we build a statistical surrogate for the input-output relationship with a modest number of evaluations and then...
Article
The majority of computational methods for predicting toxicity of chemicals are typically based on ‘non-mechanistic’ cheminformatics solutions, relying on an arsenal of QSAR descriptors, often vaguely associated with chemical structures, and typically employing ‘black-box’ mathematical algorithms. Nonetheless, such machine learning models, while hav...
Article
We propose an algorithm for a family of optimization problems where the objective can be decomposed as a sum of functions with monotonicity properties. The motivating problem is optimization of hyperparameters of machine learning algorithms, where we argue that the objective, validation error, can be decomposed as monotonic functions of the hyperpa...
Article
Tomal et al. (2015) introduced the notion of "phalanxes" in the context of rare-class detection in two-class classification problems. A phalanx is a subset of features that work well for classification tasks. In this paper, we propose a different class of phalanxes for application in regression settings. We define a "Regression Phalanx" - a subset...
Article
Gaussian processes are widely used in the analysis of data from a computer model. Ideally, the analysis will yield accurate predictions with correct coverage probabilities of credible intervals. In this paper, we first review several existing Bayesian implementations in the literature. We show that Bayesian approaches with squared-exponential corre...
Article
Full-text available
An ensemble of models (EM), where each model is constructed on a diverse subset of feature variables, is proposed to rank rare class items ahead of majority class items in a highly unbalanced two class problem. The proposed ensemble relies on an algorithm to group the feature variables into subsets where the variables in a subset work better togeth...
Article
Full-text available
A quantitative-structure activity relationship (QSAR) is a model relating a specific biological response to the chemical structures of compounds. There are many descriptor sets available to characterize chemical structure, raising the question of how to choose among them or how to use all of them for training a QSAR model. Making efficient use of a...
Article
Full-text available
Statistical methods based on a regression model plus a zero-mean Gaussian process (GP) have been widely used for predicting the output of a deterministic computer code. There are many suggestions in the literature for how to choose the regression component and how to model the correlation structure of the GP. This article argues that comprehensive,...
Article
Full-text available
A computer code or simulator is a mathematical representation of a physical system, for example a set of differential equations. Running the code with given values of the vector of inputs, x, leads to an output y(x) or several such outputs. For instance, one application we use for illustration simulates the average tidal power, y, generated as a fu...
Article
We show how a Gaussian Process (GP) can be used as a nonparametric regression model to fit experiment data that captures the relationship between the experiment response and the experiment factors. We illustrate the GP model analysis with a solar collector computer experiment. We also illustrate how physical experiment data can be analyzed using a...
Chapter
We propose a statistical air quality model that simultaneously performs two major tasks: (1) provides a computationally inexpensive means of modelling and forecasting complex space-time air pollution processes, (2) enables an informative and statistically defensible approach for evaluation of air quality models. Rather than working with raw data, w...
Article
Full-text available
We have proposed an ensemble method which aggregates over clusters of predictor variables. We form the clusters (we call phalanxes) by joining variables together. The variables in a phalanx are good to put together, and the variables in different phalanxes are good to ensemble. We then build our ensemble of phalanxes (EPX) by growing a random fores...
Article
Millions of compounds are available as potential drug candidates. High throughput screening (HTS) is widely used in drug discovery to assay compounds for a particular biological activity. A common approach is to build a classification model using a smaller sample of assay data to predict the activity of unscreened compounds and hence select further...
Article
Full-text available
Cross-validation (CV) is widely used for tuning a model with respect to user-selected parameters and for selecting a "best" model. For example, the method of $k$-nearest neighbors requires the user to choose $k$, the number of neighbors, and a neural network has several tuning parameters controlling the network complexity. Once such parameters are...
Conference Paper
The use of single nucleotide polymorphisms (SNPs) has become increasingly important for a wide range of genetic studies. A high-throughput genotyping technology usually in- volves a statistical algorithm for automatic (non-manual) geno- type calling. Most calling algorithms in the literature, using methods such as �� -means and mixture-models, rely...
Article
Full-text available
ChemModLab, written by the ECCR @ NCSU consortium under NIH support, is a toolbox for fitting and assessing quantitative structure-activity relationships (QSARs). Its elements are: a cheminformatic front end used to supply molecular descriptors for use in modeling; a set of methods for fitting models; and methods for validating the resulting model....
Article
The authors propose a profile likelihood approach to linear clustering which explores potential linear clusters in a data set. For each linear cluster, an errors-in-variables model is assumed. The optimization of the derived profile likelihood can be achieved by an EM algorithm. Its asymptotic properties and its relationships with several existing...
Article
Full-text available
We produce reasons and evidence supporting the informal rule that the number of runs for an effective initial computer experiment should be about 10 times the input dimension. Our arguments quantify two key characteristics of computer codes that affect the sample size required for a desired level of accuracy when approximating the code via a Gaussi...
Article
To build a predictor, the output of a deterministic computer model or “code” is often treated as a realization of a stochastic process indexed by the code's input variables. The authors consider an asymptotic form of the Gaussian correlation function for the stochastic process where the correlation tends to unity. They show that the limiting best l...
Article
The inverse of the Fisher information matrix is commonly used as an approximation for the covariance matrix of maximum-likelihood estimators. We show via three examples that for the covariance parameters of Gaussian stochastic processes under infill asymptotics, the covariance matrix of the limiting distribution of their maximum-likelihood estimato...
Article
Sequential screening has become increasingly popular in drug discovery. It iteratively builds quantitative structure-activity relationship (QSAR) models from successive high-throughput screens, making screening more effective and efficient. We compare cluster structure-activity relationship analysis (CSARA) as a QSAR method with recursive partition...
Article
Full-text available
Computer models to simulate physical phenomena are now widely available in engineering and science. Before relying on a computer model, a natural first step is often to compare its output with physical or field data, to assess whether the computer model reliably represents the real world. Field data, when available, can also be used to calibrate or...
Article
Full-text available
Single nucleotide polymorphisms (SNPs) are DNA sequence variations, occurring when a single nucleotide--adenine (A), thymine (T), cytosine (C) or guanine (G)--is altered. Arguably, SNPs account for more than 90% of human genetic variation. Our laboratory has developed a highly redundant SNP genotyping assay consisting of multiple probes with signal...
Article
Full-text available
An experiment involving a complex computer model or code may have tens or even hundreds of input variables and, hence, the identication of the more important variables (screening) is often crucial. Methods are described for decomposing a complex input-output relationship into eects. Eects are more easily understood because each is due to only one o...
Article
Full-text available
this paper deals with the unconstrained global optimization problem, minimize f(x) where x = (x 1 ; : : : ; x k ). This includes the class of problems with simple constraints like a i x i b i , since these problems can be transformed to unconstrained global optimization problems. Throughout we assume without loss of generality that the extremum of...
Article
The concepts of diversity and similarity of molecules are widely used in quantitative methods for designing (selecting) a representative set of molecules and for analyzing the relationship between chemical structure and biological activity. We review methods and algorithms for design of a diverse set of molecules in the chemical space using cluster...
Article
Initial leads for drug development often originate from high-throughput screening (HTS), where hundreds of thousands of compounds are tested for biological activity. As the number of both targets for screening and compounds available for screening increase, there is a need to consider methods for making this process more efficient. One approach is...
Article
Full-text available
In screening for drug discovery, chemists often select a large subset of molecules from a very large database (e.g., select 1,000 molecules from 100,000). To generate diverse leads for drug optimization, highly active compounds in several structurally different chemical classes are sought. Molecules can be characterized by numerical descriptors, an...
Article
this paper, for instance, biological activity in protecting human cells from HIV infection was assayed for about 30,000 compounds. In order to nd the most promising drug candidate, biochemists would like to examine as many compounds as possible. However, it is 1 impractical to test all of the huge number of compounds potentially available. Research...
Article
A problem often arising in engineering applications of computer models is to determine the importance of each data item in the large pool of required input factors. This paper explores a statistical approach for investigating factor sensitivities. The methodology is demonstrated with the HDM-III highway life-cycle cost analysis model. Specifically,...
Article
The effects of certain chemical additives at maintaining a high level of activity in protein constructs during storage is investigated. We use a semiparametric regression technique to model the effects of the additives on protein activity. The model is extended to handle categorical explanatory variables. On the basis of the available data, the imp...
Article
Sampling and prediction strategies relevant at the planning stage of the cleanup of environmental hazards are discussed. Sampling designs and models are compared using an extensive set of data on dioxin contamination at Piazza Road, Missouri. To meet the assumptions of the statistical model, such data are often transformed by taking logarithms. Pre...
Article
Full-text available
In many engineering optimization problems, the number of function evaluations is severely limited by time or cost. These problems pose a special challenge to the field of global optimization, since existing methods often require more function evaluations than can be comfortably afforded. One way to address this challenge is to fit response surfaces...
Article
Full-text available
D-optimality is one of the most commonly used design criteria for linear regression models. In industrial experiments binary or count data often arise, for example defective/nondefective or number of defects. For such data Generalized Linear Models (GLMs) are appropriate. An analogous D-optimality design criterion can be developed using the asympto...
Article
In electrical engineering, circuit designs are now often optimized via circuit simulation computer models. Typically, many response variables characterize the circuit’s performance. Each response is a function of many input variables, including factors that can be set in the engineering design and noise factors representing manufacturing conditions...
Article
Ozone in the planetary boundary layer of the troposphere is considered harmful to plants and human health. Surface ozone levels are determined by the strengths of sources and precursor emissions and by meteorological conditions. Therefore, assessing ozone trends is complicated by meteorological variability. Ozone data in the Chicago area over the p...
Article
Full-text available
In this study we have employed statistical methods to efficiently design experiments and analyze output of an ocean general circulation model that uses an isopycnal mixing parameterization. Full ranges of seven inputs are explored using 51 numerical experiments. Fifteen of the cases fail to reach satisfactory equilibria. These are attributable to n...
Article
A dynamic-thermodynamic sea ice model is used to illustrate a sensitivity evaluation strategy in which a statistical model is fit to the output of the ice model. The statistical model response, evaluated in terms of certain metrics or integrated features of the ice model output, is a function of a selected set of d(=13) prescribed parameters of the...
Article
The authors describe a sequential strategy for designing manufacturable integrated circuits using available CAD tools. Optimizing the performance of complex designs in the presence of unwanted parameter variations can take a prohibitively large number of computer runs. These methods overcome this complexity by combining sequential experimentation w...
Article
Many scientific phenomena are now investigated by complex computer models or codes. Given the input values, the code produces one or more outputs via a complex mathematical model. Often the code is expensive to run, and it may be necessary to build a computationally cheaper predictor to enable, for example, optimization of the inputs. If there are...
Article
A major bottleneck in the design and parametric yield optimization of CMOS integrated circuits lies in the high cost of the circuit simulations. One method that significantly reduces the simulation cost is to approximate the circuit performances by fitted quadratic models and then use these computationally inexpensive models to optimize the paramet...
Article
Many products are now routinely designed with the aid of computer models. Given the inputs-designable engineering parameters and parameters representing manufacturing-process conditions-the model generates the product's quality characteristics. The quality improvement problem is to choose the designable engineering parameters such that the quality...
Article
A general method for constructing permutation tests for various experimental designs follows from invariance and sufficiency. In this framework, randomization (or rerandomization) tests are just special cases of permutation tests. The methodology extends the applicability of permutation tests: An example demonstrates a test for an interaction effec...
Article
Full-text available
Taguchi's off-line quality control methods for product and process improvement emphasize experiments to design quality “into” products and processes. In Very Large Scale Integrated (VLSI) circuit design, the application of interest here, computer modeling is invariably quicker and cheaper than physical experimentation. Our approach models quality c...
Conference Paper
A method for parametric yield optimization which significantly reduces the simulation cost is proposed. The method assumes that the circuit performances ultimately determining yield can be approximated by computationally inexpensive functions of the inputs to the circuit simulator. These inputs are the designable parameters, the uncontrollable stat...
Article
Full-text available
A computer experiment generates observations by running a computer model at inputs x and recording the output (response) Y. Prediction of the response Y to an untried input is treated by modeling the systematic departure of Y from a linear model as a realization of a stochastic process. For given data (selected inputs and the computed responses), b...
Article
Full-text available
Many scientific phenomena are now investigated by complex computer models or codes. A computer experiment is a number of runs of the code with various inputs. A feature of many computer experiments is that the output is deterministic—rerunning the code with the same inputs gives identical observations. Often, the codes are computationally expensive...
Article
Full-text available
A branch-and-bound algorithm is described for finding the permutation (randomization) P value in matched-pairs designs without enumeration of the entire reference distribution. It is not restricted to test statistics that are linear in functions of the observations, and permutation tests based on trimmed means are investigated. We apply the algorit...
Article
Full-text available
When a rerandomization or permutation test is applied to a matched-pairs design, the observed mean difference is typically compared with the null distribution of means that would have occurred under all possible randomizations. It is shown that rerandomizing the median instead of the mean transforms a formidable computational problem into one amena...
Article
Full-text available
In experiments for response estimation, algorithms developed for constructing D-optimal exact (integer replication) designs may be inappropriate when the number of observations is not large relative to the number of parameters. This article generalizes Mitchell's DETMAX algorithm to an arbitrary design criterion and describes an efficient implement...
Article
In estimating a response surface over a design region of interest, mean squared error can arise both from sampling variance and bias introduced by model inadequacy. The criterion adopted here for experimental design attempts to protect against bias resulting from a large class of deviations from the assumed model. Two algorithms are proposed for th...
Article
Full-text available
A theory of algorithmic complexity from computer science allows one to examine the properties of algorithms for the solution of computational problems and, in particular, the relationship between the size of an instance of the generic problem and the time required for computation The theory identifies certain problems as np- complete or np-hard, tw...
Article
This article presents a branch-and-bound algorithm that constructs a catalog of all D-optimal n-point designs for specified design region, linear model, and number of observations, n. While the primary design criterion is D optimality, the algorithm may also be. used to find designs performing well by other secondary criteria, if a small sacrifice...
Article
We present methodology and algorithms for classification of sin- gle nucleotide polymorphism (SNP) genotypes. A mixture model for classification provides robustness against outlying values of the ex- planatory variables. Furthermore, different sets of explanatory vari- ables are generated by deliberate redundancy in the genotyping chem- istry, and...
Article
Full-text available
Transformations can help small sample likelihood/Bayesian inference by improving the ap-proximate normality of the likelihood/posterior. In this article we investigate when one can expect an improvement for a one-dimensional random function (Gaussian process) model. The log transformation of the range parameter is compared with an alternative (the...
Article
We propose a profile likelihood approach to linear clustering which explores poten-tial linear clusters in a data set. For each linear cluster, an errors-in-variables model is assumed. The optimization of the derived profile likelihood can be achieved by an EM algorithm. Its asymptotic properties and its relationships with several existing clusteri...

Network