
Adam KapelnerCity University of New York - Queens College | QC CUNY · Department of Mathematics
Adam Kapelner
Ph.D. Statistics, A.M. Statistics
About
55
Publications
12,771
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,724
Citations
Introduction
Additional affiliations
August 2014 - present
September 2009 - May 2014
June 2006 - February 2007
Publications
Publications (55)
The publishing industry shows marked evidence of both gender and racial discrimination. A rational explanation for this difference in treatment of both female and Black authors might relate to the taste-based preferences of book consumers, who might be less willing to pay for books by such authors. We ran a randomized experiment to test for the pre...
We present a new experimental design procedure that divides a set of experimental units into two groups in order to minimize error in estimating a treatment effect. One concern is the elimination of large covariate imbalance between the two groups before the experiment begins. Another concern is robustness of the design to misspecification in respo...
We present an optimized rerandomization design procedure for a non-sequential treatment-control experiment. Randomized experiments are the gold standard for finding causal effects in nature. But sometimes random assignments result in unequal partitions of the treatment and control group visibly seen as imbalance in observed covariates. There can ad...
Machine-assisted treatment selection commonly follows one of two paradigms: a fully personalized paradigm which ignores any possible clustering of patients; or a sub-grouping paradigm which ignores personal differences within the identified groups. While both paradigms have shown promising results, each of them suffers from important limitations. I...
We propose a dynamic allocation procedure that increases power and efficiency when measuring an average treatment effect in sequential randomized trials exploiting some subjects' previous assessed responses. Subjects arrive sequentially and are either randomized or paired to a previously randomized subject and administered the alternate treatment....
We present methodological advances in understanding the effectiveness of personalized medicine models and supply easy-to-use open-source software. Personalized medicine involves the systematic use of individual patient characteristics to determine which treatment option is most likely to result in a better average outcome for the patient. Why is pe...
Background
Sufficiently accurate predictions of hospital readmissions are necessary for the allocation of scare clinical resources to reduce preventable readmissions. We describe the use of a data-driven approach that relies on machine learning algorithms to predict readmission at the time of discharge.
Methods
We employ random forests to clinical...
Purpose: Our work introduces a highly accurate, safe, and sufficiently explicable machine-learning (artificial intelligence) model of intraocular lens power (IOL) translating into better post-surgical outcomes for patients with cataracts. We also demonstrate its improved predictive accuracy over previous formulas.
Methods: We collected retrospectiv...
We present a new experimental design procedure that divides a set of experimental units into two groups in order to minimize error in estimating an additive treatment effect. One concern is minimizing error at the experimental design stage is large covariate imbalance between the two groups. Another concern is robustness of design to misspecificati...
Depression affects one in nine people, but treatment response rates remain low. There is significant potential in the use of computational modeling techniques to predict individual patient responses and thus provide more personalized treatment. Deep learning is a promising computational technique that can be used for differential treatment selectio...
We propose a dynamic allocation procedure that increases power and efficiency when measuring an average treatment effect in sequential randomized trials exploiting some subjects' previous assessed responses. Subjects arrive iteratively and are either randomized or paired to a previously randomized subject and administered the alternate treatment. T...
We consider the problem of evaluating designs for a two-arm randomized experiment with the criterion being the power of the randomization test for the one-sided null hypothesis. Our evaluation assumes a response that is linear in one observed covariate, an unobserved component and an additive treatment effect where the only randomness comes from th...
There is a long debate in experimental design between the classic randomization design of Fisher, Yates, Kempthorne, Cochran and those who advocate deterministic assignments based on notions of optimality. In non-sequential trials comparing treatment and control, covariate measurements for each subject are known in advance, and subjects can be divi...
Purpose
Improving vocabulary knowledge is important for many adolescents, but there are few evidence-based vocabulary instruction programs available for high school students. The purpose of this article is to describe the iterative development of the DictionarySquared research platform, a web-based vocabulary program that provides individualized vo...
Background
Depression affects one in nine people, but treatment response rates remain low. There is significant potential in the use of computational modelling techniques to predict individual patient responses and thus provide more personalized treatment. Deep learning is a promising computational technique that can be used for differential treatm...
We present an optimized rerandomization design procedure for a non-sequential treatment-control experiment. Randomized experiments are the gold standard for finding causal effects in nature. But sometimes random assignments result in unequal partitions of the treatment and control group, visibly seen as imbalanced observed covariates, increasing es...
We run a randomized experiment to examine gender discrimination in book purchasing with 2,544 subjects on Amazon’s Mechanical Turk. We manipulate author gender and book genre in a factorial design to study consumer preferences for male versus female versus androgynous authorship. Despite previous findings in the literature showing gender discrimina...
There is a movement in design of experiments away from the classic randomization put forward by Fisher, Cochran and others to one based on optimization. In fixed-sample trials comparing two groups, measurements of subjects are known in advance and subjects can be divided optimally into two groups based on a criterion of homogeneity or "imbalance" b...
In traditional publishing, female authors’ titles command nearly half (45%) the price of male authors’ and are underrepresented in more prestigious genres, and books are published by publishing houses, which determined whose books get published, subject classification, and retail price. In the last decade, the growth of digital technologies and sal...
Vocabulary knowledge is essential to educational progress. High quality vocabulary instruction requires supportive contextual examples to teach word meaning and proper usage. Identifying such contexts by hand for a large number of words can be difficult. In this work, we take a statistical learning approach to engineer a system that predicts inform...
Objective:
In the absence of specific metabolic disorders, accurate predictors of response to ketogenic dietary therapies (KDTs) for treating epilepsy are largely unknown. We hypothesized that specific biochemical parameters would be associated with the effectiveness of KDT in humans with epilepsy. The parameters tested were ?-hydroxybutyrate, ace...
We present a new experimental design procedure that divides a set of experimental units into two groups so that the two groups are balanced on a prespecified set of covariates and being almost as random as complete randomization. Under complete randomization, the difference in covariate balance as measured by the standardized average between treatm...
When measuring Henry's Law constants ($k_H$) using the phase ratio method via headspace gas chromatography (GC), the value of $k_H$ of the compound under investigation is calculated from the ratio of the slope to the intercept of a linear regression of the the inverse GC response versus the ratio of gas to liquid volumes of a series of vials drawn...
We present a new package in R implementing Bayesian additive regression trees (BART). The package introduces many new features for data analysis using BART such as variable selection, interaction detection, model diagnostic plots, incorporation of missing data and the ability to save trees for future prediction. It is significantly faster than the...
We present the task of predicting individual well-being, as measured by a life satisfaction scale, through the language people use on social media. Well-being, which encompasses much more than emotion and mood, is linked with good mental and physical health. The ability to quickly and accurately assess it can supplement multi-million dollar nationa...
Forecasts of prospective criminal behavior have long been an important
feature of many criminal justice decisions. There is now substantial evidence
that machine learning procedures will classify and forecast at least as well,
and typically better, than logistic regression, which has to date dominated
conventional practice. However, machine learnin...
We consider the task of discovering gene regulatory networks, which are defined as sets of genes and the corresponding transcription factors which regulate their expression levels. This can be viewed as a variable selection problem, potentially with high dimensionality. Variable selection is especially challenging in high-dimensional settings, wher...
Neoplasms are highly dependent on glucose as their substrate for energy
production and are generally not able to catabolize other fuel sources such as
ketones and fatty acids. Thus, removing access to glucose has the potential to
starve cancer cells and induce apoptosis. Unfortunately, other body tissues are
also dependent on glucose for energy und...
In medical practice, when more than one treatment option is viable, there is
little systematic use of individual patient characteristics to estimate which
treatment option is most likely to result in a better outcome for the patient.
We introduce a new framework for using statistical models for personalized
medicine. Our framework exploits (1) data...
We incorporate heteroskedasticity into Bayesian Additive Regression Trees
(BART) by modeling the log of the error variance parameter as a linear function
of prespecified covariates. Under this scheme, the Gibbs sampling procedure for
the original sum-of- trees model is easily modified, and the parameters for the
variance model are updated via a Met...
We propose a dynamic allocation procedure that increases power and efficiency when measuring an average treatment effect in fixed sample randomized trials with sequential allocation. Subjects arrive iteratively and are either randomized or paired via a matching criterion to a previously randomized subject and administered the alternate treatment. W...
This thesis develops methods for the analysis and design of crowdsourced experiments and crowdsourced labeling tasks. Much of this document focuses on applications including running natural field experiments, estimating the number of objects in images and collecting labels for word sense disambiguation. Observed shortcomings of the crowdsourced exp...
We present a new package in R implementing Bayesian Additive Regression Trees
(BART). The package introduces many new features for data analysis using BART
such as variable selection, interaction detection, model diagnostic plots,
incorporation of missing data and the ability to save trees for future
prediction. It is significantly faster than the...
The variable selection problem is especially challenging in high dimensional
data, where it is difficult to detect subtle individual effects and
interactions between factors. Bayesian additive regression trees (BART, Chipman
et al., 2010) provides a novel nonparametric exploratory alternative to
parametric regression approaches, such as the lasso o...
Dendritic cells (DCs) are important mediators of anti-tumor immune responses. We hypothesized that an in-depth analysis of dendritic cells and their spatial relationships to each other as well as to other immune cells within tumor draining lymph nodes (TDLNs) could provide a better understanding of immune function and dysregulation in cancer.
We an...
This article presents Individual Conditional Expectation (ICE) plots, a tool
for visualizing the model estimated by any supervised learning algorithm.
Classical partial dependence plots (PDPs) help visualize the average partial
relationship between the predicted response and one or more features. In the
presence of substantial interaction effects,...
We present a method for incorporating missing data in non-parametric
statistical learning without the need for imputation. We focus on a tree-based
method, Bayesian Additive Regression Trees (BART), enhanced with "Missingness
Incorporated in Attributes," an approach recently proposed incorporating
missingness into decision trees (Twala, 2008). This...
We propose a dynamic allocation procedure that increases power and efficiency
when measuring an average treatment effect in sequential randomized trials.
Subjects arrive iteratively and are either randomized or paired via a matching
criterion to a previously randomized subject and administered the alternate
treatment. We develop estimators for the...
We conduct the first natural field experiment to explore the relationship
between the "meaningfulness" of a task and worker effort. We employed about
2,500 workers from Amazon's Mechanical Turk (MTurk), an online labor market, to
label medical images. Although given an identical task, we experimentally
manipulated how the task was framed. Subjects...
Spectral unmixing of a triple-stained lymph node section by VectraTM. (A) An original RGB image of a part of a tissue section, taken at 200× magnification. (B) Images resulting from unmixing of the spectral signals of each chromogen and counterstain. (C) A reconstructed image with pseudo-colors that allowed a greater distinction of the cell populat...
Proportions of both T cells and B cells were similar in TDLNs and HLNs used. (A) Proportions of T and B cells in HLNs and tumor-invaded TDLNs. (B) Proportions T and B cells in ALN+ and ALN− pairs (p = 0.7 and 0.1, respectively; paired t test).
(0.10 MB TIF)
An RGB image of an entire TDLN cross section taken by VectraTM. In this particular example, the whole-section image consists of 125, 200× sub-images. Chromogens used were Vulcan Fast Red (cytokeratin (tumor), red), DAB (CD20(+)- B cells, brown) and Ferangi Blue (CD3(+)-T cells, dark blue). Cellular nuclei were counterstained with hematoxylin (light...
An illustration of L function plots that can be generated from T and B cell data within a tissue section. (A) Interpretations of each plot. (B) An extrapolation of an L function of B cells to another plot that illustrates how much more clustered B cells are compared to the T cells.
(0.36 MB TIF)
To date, pathological examination of specimens remains largely qualitative. Quantitative measures of tissue spatial features are generally not captured. To gain additional mechanistic and prognostic insights, a need for quantitative architectural analysis arises in studying immune cell-cancer interactions within the tumor microenvironment and tumor...
Supervised learning can be used to segment/identify regions of interest in images usingboth color and morphological information. A novel object identication algorithm wasdeveloped in Java to locate immune and cancer cells in images of immunohistochemically-stained lymph node tissue from a recent study published by Kohrt et al. (2005). Thealgorithms...
11059 Background: Clinical decisions in oncology are increasingly individualized and dependent upon accurate assessment of tumor biology, such as hormone receptor status. Imaging techniques are limited by subjective interpretation of staining patterns and limited tissue sampling leading to variability in patient care. We developed a novel imaging a...
Supervised learning can be used to segment /identify regions of interest in images making use of color and morphological information. A novel object identification algorithm was developed in Java to locate immune and cancer cells in images of immunohistochemically-stained lymph node tissue from the recent Kohrt study[1] and also shows promise in ot...
Researchers are increasingly using online labor markets such as Amazon's Mechanical Turk (MTurk) as a source of inex-pensive data. One of the most popular tasks is answering surveys. However, without adequate controls, researchers should be concerned that respondents may fill out surveys haphazardly in the unsupervised environment of the Inter-net....