Shi ZhongGoogle Inc. | Google · Engineering Department
Shi Zhong
PhD
About
37
Publications
12,829
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,436
Citations
Introduction
Skills and Expertise
Additional affiliations
May 2014 - present
November 2009 - May 2014
Adometry Inc.
Position
- Principal Investigator
August 2003 - December 2005
Publications
Publications (37)
Like any marketing campaigns, online advertisement campaigns need to be monitored, analysed and optimised. The quantitative methods are more crucial to online campaigns because of their dynamic pricing and highly interactive nature. Not only can marketing effectiveness be measured almost instantly in terms of measures such as click through rate and...
Existing searches in unstructured peer-to-peer (P2P) networks are either blind or informed based on simple heuristics. Blind schemes suffer from low query quality. Simple heuristics lack theoretical background to support the simulation results. In this paper, we propose an intelligent searching scheme, called intelligent search by reinforcement lea...
Like any marketing campaigns, online advertisement campaigns need to be monitored, analyzed and optimized. It is more so for online campaigns because online advertisements are usually sold in auction style. Prices can change very dynamically; the creatives, the landing pages and the targeting profiles can all be changed frequently to improve the ef...
Recently data mining methods have gained importance in addressing network security issues, including network intrusion detection|a challenging task in network security. Intrusion detection systems aim to identify attacks with a high detection rate and a low false alarm rate. Classiflcation-based data mining models for intrusion detection are of- te...
Semi-supervised clustering employs limited supervision in the form of labeled instances or pairwise instance constraints to aid unsupervised clustering and often significantly improves the clustering performance. Despite the vast amount of expert knowledge spent on this problem, most existing work is not designed for handling high-dimensional spars...
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in p...
Intrusion detection in wireless networks has become an indispensable component of any useful wireless network security systems, and has recently gained attention in both research and industry communities due to widespread use of wireless local area networks (WLANs). This paper focuses on detecting intrusions or anomalous behaviors in WLANs with dat...
This paper presents a detailed empirical study of twelve generative approaches to text cluster- ing obtained by applying four types of document-to-model assignment strategies (hard, stochas- tic, soft and deterministic annealing (DA) based assignments) to each of three base models, namely mixtures of multivariate Bernoulli, multinomials, and von Mi...
Clustering data streams has been a new research topic, recently emerged from many real data mining applications, and has attracted a lot of research attention. However, there is little work on clustering high-dimensional streaming text data. This paper combines an efficient online spherical k-means (OSKM) algorithm with an existing scalable cluster...
This paper presents a technique that improves the accuracy of classification models by enhancing the quality of training data. The idea is to eliminate instances that are likely to be noisy, and train classification models on "clean" data. Our approach uses 25 different classification techniques to create an ensemble classifier to filter noise. Usi...
Using unlabeled data to help supervised learning has become an increasingly attractive methodology and proven to be effective in many applications. This paper applies semi-supervised classification algorithms, based on hidden Markov models (HMMs), to classify sequences. For model-based classification, semi-supervised learning amounts to using both...
Assuring whether the desired software quality and reliability is met for a project is as important as delivering it within scheduled budget and time. This is especially vital for high-assurance software systems where software failures can have severe consequences. To achieve the desired software quality, practitioners utilize software quality model...
The increasing reliance upon wireless networks has put tremendous' emphasis on wireless network security. While considerable attention has been given to data mining for intrusion detection in wired networks, limited focus has been devoted to data mining for intrusion detection in wireless networks. This study presents a clustering approach with tra...
In many practical classification problems, mislabeled data instances (i.e., class noise) exist in the acquired (train-ing) data and often have a detrimental effect on the classi-fication performance. Identifying such noisy instances and removing them from training data can significantly improve the trained classifiers. One such effective noise dete...
The spherical k-means algorithm, i.e., the k-means algorithm with cosine similarity, is a popular method for clustering high-dimensional text data. In this algorithm, each document as well as each cluster mean is represented as a high-dimensional unit-length vector. However, it has been mainly used in hatch mode. Thus is, each cluster mean vector i...
This work presents a novel information retrieval (IR) tool, designed to help VLSI defect and yield engineers identify potentially defective layout regions. Given a query defect pattern discovered in a manufacturing process, this tool can be used to return similar layout regions in one or more designs ranked by similarity to the query pattern. Defec...
For software quality estimation, software development practitioners typically construct quality-classification or fault prediction models using software metrics and fault data from a previous system release or a similar software project. Engineers then use these models to predict the fault proneness of software modules in development. Software qual...
Current software quality estimation models often involve using supervised learning methods to train a software quality classifier or a software fault prediction model. In such models, the dependent variable is a software quality measurement indicating the quality of a software module by either a risk-based class membership (e.g., whether it is faul...
A software quality estimation model is often built using known software metrics and fault data obtained from program modules of previously developed releases or similar projects. Such a supervised learning approach to software quality estimation assumes that fault data is available for all the previously developed modules. Considering the various p...
Using unlabeled data to help supervised learning has become an increasingly attractive methodology and proven to be effective in many applications. This paper applies semi-supervised classification algorithms, based on hidden Markov models, to classify sequences. For model-based classification, semi-supervised learning amounts to using both labeled...
Balanced clustering algorithms can be useful in a variety of applications and have recently attracted increasing research interest. Most recent work, however, addressed only hard balancing by constraining each cluster to have equal or a certain minimum number of data objects. We provide a soft balancing strategy built upon a soft mixture-of-models...
Model-based clustering techniques have been widely used and have shown promising results in many applications involving complex data. This paper presents a unified framework for probabilistic model-based clustering based on a bipartite graph view of data and models that highlights the commonalities and differences among existing model-based cluster...
Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical k-means algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mixture of von Mises-Fisher (vMF) distributions. This p...
This paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. Partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process---iterative model re-estimation and sample re-assignment. Instead of a maxim...
Not available Electrical and Computer Engineering
A variety of Coupled HMMs (CHMMs) have recently been proposed as
extensions of HMM to better characterize multiple interdependent
sequences. This paper introduces a novel distance coupled HMM. It then
compares the performance of several HMM and CHMM models for a
multi-channel EEG classification problem. The results show that, of all
approaches exam...
This paper introduces a novel classier architecture that exploits both support vector concept and the on-line learning of feedforward neural networks. It uses an MLP network to directly determine a nonlinear decision boundary in the input space, and uses only the data points within a boundary margin for further training. This is shown to be equival...
This paper describes wavelet thresholding for image denoising
under the framework provided by statistical learning theory a.k.a.
Vapnik-Chervonenkis (VC) theory. Under the framework of VC-theory,
wavelet thresholding amounts to ordering of wavelet coefficients
according to their relevance to accurate function estimation, followed
by discarding insi...
In this paper, we model the MPEG-4 video encoder using computational process networks (CPN), which is a determinate and concurrent computation model, and implement a scalable software-based encoder in C++ and portable operating system interface (POSIX) threads under the framework proposed by Allen and Evans (see IEEE Trans. Signal Proc., vol.48, no...
Multilayer perceptron (MLP) network has been successfully applied
to many practical problems because of its nonlinear mapping ability.
However, there are many factors, which may affect the generalization
ability of MLP networks, such as the number of hidden units, the initial
values of weights and the stopping rules. These factors, if improperly
ch...
Software quality estimation models, used to predict the fault-proneness of software modules based on software metrics, are often constructed by training a clas-sifier from labeled software metrics data. Two challenges often encountered in building an accurate model are the presence of "noisy" data and the possible unavailability of fault-proneness...
This paper provides a general formulation of probabilistic model-based clustering with de-terministic annealing (DA), which leads to a unifying analysis of k-means, EM clus-tering, soft competitive learning algorithms (e.g., self-organizing map), and information bottleneck. The analysis points out an inter-esting yet not well-recognized connection...
In this paper, we summarize the motivation for diameter-constrained (D-constrained) clustering, and present experimental results to demonstrate the effectiveness of D- constrained clustering in keeping data records close to their cluster centers. We formulate D-constrained clustering as a competitive learning problem, which is solved using the stan...
Abstract MPEG-4 standard provides support for content-based interactivity, high compression, and/or universal accessibility and portability of audio and video content. Due to its content-based representation nature (except the simple profile used for wireless video communication) and flexible configuration structure, any MPEG-4 hardware implementat...
Questions
Question (1)
There are often spurious correlations between independent variables and the dependent variable. How do we avoid selecting such variables or train predictive models in a way that such variables will not get significant weights?