Bin Yu

Bin Yu
University of California, Berkeley | UCB · Department of Statistics

About

18
Publications
3,034
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
754
Citations

Publications

Publications (18)
Article
The performance of spectral clustering can be considerably improved via regularization, as demonstrated empirically in Amini et al. [Ann. Statist. 41 (2013) 2097a "2122]. Here, we provide an attempt at quantifying this improvement through theoretical analysis. Under the stochastic block model (SBM), and its extensions, previous results on spectral...
Article
Full-text available
Several months ago, Phil Bourne, the initiator and frequent author of the wildly successful and incredibly useful “Ten Simple Rules” series, suggested that some statisticians put together a Ten Simple Rules article related to statistics. (One of the rules for writing a PLOS Ten Simple Rules article is to be Phil Bourne [1]. In lieu of that, we hope...
Conference Paper
One challenge in big data analytics is the lack of tools to manage the complex interactions among code, data and parameters, especially in the common situation where all these factors can change a lot. We present our preliminary experience with DataLab, a system we build to manage the big data workflow. DataLab improves big data analytical workflow...
Article
Significance Despite the abundance of spatial gene expression data, extracting meaningful information to reveal how genes interact remains a challenge. We developed staNMF, a method that combines a powerful unsupervised learning algorithm, nonnegative matrix factorization (NMF), with a new stability criterion that selects the size of the dictionary...
Article
Full-text available
We study the theoretical properties of learning a dictionary from a set of $N$ signals $\mathbf x_i\in \mathbb R^K$ for $i=1,...,N$ via $l_1$-minimization. We assume that the signals $\mathbf x_i$'s are generated as $i.i.d.$ random linear combinations of the $K$ atoms from a complete reference dictionary $\mathbf D_0 \in \mathbb R^{K\times K}$. For...
Article
Full-text available
Crowdsourcing has become an effective and popular tool for human-powered computation to label large datasets. Since the workers can be unreliable, it is common in crowdsourcing to assign multiple workers to one task, and to aggregate the labels in order to obtain results of high quality. In this paper, we provide finite-sample exponential bounds on...
Conference Paper
We present CAGe, a statistical algorithm which exploits high sequence identity between sampled genomes and a reference assembly to streamline the variant calling process. Using a combination of changepoint detection, classification, and online variant detection, CAGe is able to call simple variants quickly and accurately on the 90-95% of a sampled...
Article
Full-text available
Site-specific transcription factors (TFs) bind DNA regulatory elements to control expression of target genes, forming the core of gene regulatory networks. Despite decades of research, most studies focus on only a small number of TFs and the roles of many remain unknown. We present a systematic characterization of spatiotemporal gene expression pat...
Chapter
Terry joined the Berkeley Statistics faculty in the summer of 1987 after being the statistics head of CSIRO in Australia. His office was just down the hallway from mine on the third floor of Evans. I was beginning my third year at Berkeley then and I remember talking to him in the hallway after a talk that he gave on information theory and the Mini...
Conference Paper
Full-text available
Information theory provides an attractive framework for attacking the neural coding problem. This entails estimating information theoretic quantities from neural spike train data. This paper highlights two issues that may arise: non-parametric entropy estimation and non-stationarity. It gives an overview of these issues and some of the progress tha...
Article
Full-text available
Information estimates such as the direct method of Strong, Koberle, de Ruyter van Steveninck, and Bialek (1998) sidestep the difficult problem of estimating the joint distribution of response and stimulus by instead estimating the difference between the marginal and conditional entropies of the response. While this is an effective estimation strate...
Article
Full-text available
Time-series segmentation in the fully unsupervised scenario in which the number of segment-types is a priori unknown is a fundamental problem in many applications. We propose a Bayesian approach to a segmentation model based on the switching linear Gaussian ...
Article
Data on 'neural coding' have frequently been analyzed using information-theoretic measures. These formulations involve the fundamental and generally difficult statistical problem of estimating entropy. We review briefly several methods that have been advanced to estimate entropy and highlight a method, the coverage-adjusted entropy estimator (CAE),...
Article
Full-text available
In this paper, we propose to monitor a Markov chain sampler using the cusum path plot of a chosen one-dimensional summary statistic. We argue that the cusum path plot can bring out, more effectively than the sequential plot, those aspects of a Markov sampler which tell the user how quickly or slowly the sampler is moving around in its sample space,...
Article
Various random fingerprinting methods are sometimes used to detect overlap between pairs of clones as a first step toward producing a minimal tiling path of clones for subsequent mapping and sequencing efforts. This paper evaluates and compares various statistical procedures for detecting pairwise overlap between clones when the fingerprints arise...

Network

Cited By