Jiawei Han

Jiawei Han
University of Illinois, Urbana-Champaign | UIUC · Department of Computer Science

About

402
Publications
109,658
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
63,902
Citations

Publications

Publications (402)
Preprint
Full-text available
Entity linking (EL) is the process of linking entity mentions appearing in web text with their corresponding entities in a knowledge base. EL plays an important role in the fields of knowledge engineering and data mining, underlying a variety of downstream applications such as knowledge base population, content analysis, relation extraction, and qu...
Article
Twitter, a microblogging platform, has developed into an increasingly invaluable information source, where millions of users post a great quantity of tweets with various topics per day. Heterogeneous information networks consisting of multi-type objects and relations are becoming more and more prevalent as an organization form of knowledge and info...
Article
Entity linking (EL) is the process of linking entity mentions appearing in web text with their corresponding entities in a knowledge base. EL plays an important role in the fields of knowledge engineering and data mining, underlying a variety of downstream applications such as knowledge base population, content analysis, relation extraction, and qu...
Preprint
Full-text available
Extracting entities and their relations from text is an important task for understanding massive text corpora. Open information extraction (IE) systems mine relation tuples (i.e., entity arguments and a predicate string to describe their relation) from sentences, and do not confine to a pre-defined schema for the relations of interests. However, cu...
Chapter
The wide proliferation of GPS-enabled mobile devices and the rapid development of sensing technology have nurtured explosive growth of semantics- enriched spatiotemporal (SeST) data. Compared to traditional spatiotemporal data like GPS traces and RFID data, SeST data is multidimensional in nature as each SeST object involves location, time, and tex...
Article
Heterogeneous information networks that consist of multi-type, interconnected objects are becoming increasingly popular, such as social media networks and bibliographic networks. The task of linking named entity mentions detected from unstructured Web text with their corresponding entities in a heterogeneous information network is of practical impo...
Conference Paper
Full-text available
The availability of massive geo-annotated social media data sheds light on studying human mobility patterns. Among them, periodic pattern, \ie an individual visiting a geographical region with some specific time interval, has been recognized as one of the most important. Mining periodic patterns has a variety of applications, such as location predi...
Conference Paper
Recent years have witnessed an astonishing growth of crowd-contributed data, which has become a powerful information source that covers almost every aspect of our lives. This big treasure trove of information has fundamentally changed the ways in which we learn about our world. Crowdsourcing has attracted considerable attentions with various approa...
Conference Paper
Full-text available
Understanding human mobility is of great importance to various applications, such as urban planning, traffic scheduling, and location prediction. While there has been fruitful research on modeling human mobility using tracking data (e.g., GPS traces), the recent growth of geo-tagged social media (GeoSM) brings new opportunities to this task because...
Article
Protein-protein interaction (PPI) networks, providing a comprehensive landscape of protein interacting patterns, enable us to explore biological processes and cellular components at multiple resolutions. For a biological process, a number of proteins need to work together to perform the job. Proteins densely interact with each other, forming large...
Article
Entity Extraction is a process of identifying meaningful entities from text documents. In enterprises, extracting entities improves enterprise efficiency by facilitating numerous applications, including search, recommendation, etc. However, the problem is particularly challenging on enterprise domains due to several reasons. First, the lack of redu...
Article
Full-text available
As a fundamental task, document similarity measure has broad impact to document-based classification, clustering and ranking. Traditional approaches represent documents as bag-of-words and compute document similarities using measures like cosine, Jaccard, and dice. However, entity phrases rather than single words in documents can be critical for ev...
Article
A major event often has repercussions on both news media and microblogging sites such as Twitter. Reports from mainstream news agencies and discussions from Twitter complement each other to form a complete picture. An event can have multiple aspects (sub-events) describing it from multiple angles, each of which attracts opinions/comments posted on...
Article
This paper presents a novel research problem on joint discovery of commonalities and differences between two individual documents (or document sets), called Comparative Document Analysis (CDA). Given any pair of documents from a document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both do...
Conference Paper
Random walk with restart (RWR) is widely recognized as one of the most important node proximity measures for graphs, as it captures the holistic graph structure and is robust to noise in the graph. In this paper, we study a novel query based on the RWR measure, called the inbound top-k (Ink) query. Given a query node q and a number k, the Ink query...
Conference Paper
With the increasing use of GPS-enabled mobile phones, geo-tagging, which refers to adding GPS information to media such as micro-blogging messages or photos, has seen a surge in popularity recently. This enables us to not only browse information based on locations, but also discover patterns in the location-based behaviors of users. Many techniques...
Conference Paper
Recent years have witnessed the wide proliferation of geo-sensory applications wherein a bundle of sensors are deployed at different locations to cooperatively monitor the target condition. Given massive geo-sensory data, we study the problem of mining spatial co-evolving patterns (SCPs), i.e., groups of sensors that are spatially correlated and co...
Article
Full-text available
We describe here the vision, motivations, and research plans of the National Institutes of Health Center for Excellence in Big Data Computing at the University of Illinois, Urbana-Champaign. The Center is organized around the construction of "Knowledge Engine for Genomics" (KnowEnG), an E-science framework for genomics where biomedical scientists w...
Article
Huge volumes of biomedical text data discussing about different biomedical entities are being generated every day. Hidden in those unstructured data are the strong relevance relationships between those entities, which are critical for many interesting applications including building knowledge bases for the biomedical domain and semantic search amon...
Article
What is happening around the world? When and where? Mining the geo-tagged Twitter stream makes it possible to answer the above questions in real-time. Although a single tweet can be short and noisy, proper aggregations of tweets can provide meaningful results. In this paper, we focus on hierarchical spatio-temporal hashtag clustering techniques. Ou...
Article
Download Free Sample The "big data" era is characterized by an explosion of information in the form of digital data collections, ranging from scientific knowledge, to social media, news, and everyone's daily life. Examples of such collections include scientific publications, enterprise logs, news articles, social media, and general web pages. Valua...
Article
The problem of community detection has recently been studied widely in the context of the web and social media networks. Most algorithms for community detection assume that the entire network is available for online analysis. In practice, this is not really true, because only restricted portions of the network may be available at any given time for...
Article
The large number of potential applications from bridging web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge b...
Article
Online learning on a graph is appealing due to its efficiency. However, existing online learning algorithms on a graph are limited to binary classification. Moreover, they require accessing the full label information, where the label oracle needs to return the true class label after the learner makes classification of each node. In many application...
Article
Mining outliers in a heterogeneous information network is a challenging problem: It is even unclear what should be outliers in a large heterogeneous network (e.g., Outliers in the entire bibliographic network consisting of authors, titles, papers and venues). In this study, we propose an interesting class of outliers, query-based sub network outlie...
Article
Mining network evolution has emerged as an intriguing research topic in many domains such as data mining, social networks, and machine learning. While a bulk of research has focused on mining evolutionary patterns of homogeneous networks (e.g., networks of friends), however, most real-world networks are heterogeneous, containing objects of differen...
Conference Paper
Full-text available
Topic models such as Latent Dirichlet Allocation have been useful text analysis methods of wide interest. Recently, moment-based inference with provable performance has been proposed for topic models. Compared with inference algorithms that approximate the maximum likelihood objective, moment-based inference has theoretical guarantee in recovering...
Article
The development in positioning technology has enabled us to collect a huge amount of movement data from moving objects, such as human, animals, and vehicles. The data embed rich information about the relationships among moving objects and have applications in many fields, e.g., in ecological study and human behavioral study. Previously, we have pro...
Book
This comprehensive reference consists of 18 chapters from prominent researchers in the field. Each chapter is self-contained, and synthesizes one aspect of frequent pattern mining. An emphasis is placed on simplifying the content, so that students and practitioners can benefit from the book. Each chapter contains a survey describing key research on...
Chapter
Sequential pattern mining, which discovers frequent subsequences as patterns in a sequence database, has been a focused theme in data mining research for over a decade. This problem has broad applications, such as mining customer purchase patterns and Web access patterns. However, it is also a challenging problem since the mining may have to genera...
Article
This chapter surveys recent debugging tools for sensor networks that are inspired by data mining algorithms. These tools are motivated by the increased complexity and scale of sensor network applications, making it harder to identify root causes of system problems. At a high level, debugging solutions in the domain of sensor networks can be classif...
Article
Full-text available
While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inherent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing work either performs post processing to the inference results of unigram-based topic models, or utilizes compl...
Article
Full-text available
Heterogeneous information networks that consist of multi-type, interconnected objects are becoming ubiquitous and increasingly popular, such as social media networks and bibliographic networks. The task to link named entity mentions detected from the unstructured Web text with their corresponding entities existing in a heterogeneous information net...
Article
Full-text available
Variation is key to the adaptability of species and their ability to survive changes to the Earth's climate and habitats. Plasticity in movement strategies allows a species to better track spatial dynamics of habitat quality. We describe the mechanisms that shape the movement of a long-distance migrant bird (turkey vulture, Cathartes aura) across t...
Conference Paper
Emerging trends and products pose a challenge to modern search engines since they must adapt to the constantly changing needs and interests of users. For example, vertical search engines, such as Amazon, eBay, Walmart, Yelp and Yahoo! Local, provide business category hierarchies for people to navigate through millions of business listings. The cate...
Conference Paper
Ranking in response to user queries is a central problem in information retrieval, data mining, and machine learning. In the era of "Big data", traditional effectiveness-centric ranking techniques tend to get more and more costly (requiring additional hardware and energy costs) to sustain reasonable ranking speed on large data. The mentality of com...
Article
In the statistics community, outlier detection for time series data has been studied for decades. Recently, with advances in hardware and software technology, there has been a large body of work on temporal outlier detection from a computational perspective within the computer science community. In particular, advances in hardware technology have e...
Article
Full-text available
Automated generation of high-quality topical hierarchies for a text collection is a dream problem in knowledge engineering with many valuable applications. In this paper a scalable and robust algorithm is proposed for constructing a hierarchy of topics from a text collection. We divide and conquer the problem using a top-down recursive framework, b...
Article
In this paper, we study the statistical performance of robust tensor decomposition with gross corruption. The observations are noisy realization of the superposition of a low-rank tensor W∗ and an entrywise sparse corruption tensor ν∗. Unlike conventional noise with bounded variance in previous convex tensor decomposition analysis, the magnitude of...
Conference Paper
With massive amounts of data being generated and stored ubiquitously in every discipline and every aspect of our daily life, how to handle such big data poses many challenging issues to researchers in data and information systems. The participants of CIKM 2013 are active researchers on large scale data, information and knowledge management, from mu...
Conference Paper
Taxicabs equipped with real-time location sensing devices are increasingly becoming popular. Such location traces are a rich source of information and can be used for congestion pricing, taxicab placement, and improved city planning. An important problem to enable these application is to identify human mobility patterns from the taxicab traces, whi...
Article
Information networks that can be extracted from many domains are widely studied recently. Different functions for mining these networks are proposed and developed, such as ranking, community detection, and link prediction. Most existing network studies are on homogeneous networks, where nodes and links are assumed from one single type. In reality,...
Article
A number of probabilistic methods such as LDA, hidden Markov models, Markov random fields have arisen in recent years for probabilistic analysis of text data. This chapter provides an overview of a variety of probabilistic models for text mining. The chapter focuses more on the fundamental probabilistic techniques, and also covers their various app...
Article
Full-text available
Background The movement of animals is strongly influenced by external factors in their surrounding environment such as weather, habitat types, and human land use. With advances in positioning and sensor technologies, it is now possible to capture animal locations at high spatial and temporal granularities. Likewise, scientists have an increasing ac...
Article
Full-text available
Data stream classification poses many challenges to the data mining community. In this paper, we address four such major challenges, namely, infinite length, concept-drift, concept-evolution, and feature-evolution. Since a data stream is theoretically infinite in length, it is impractical to store and use all the historical data for training. Conce...
Conference Paper
Full-text available
Collective intelligence, which aggregates the shared information from large crowds, is often negatively impacted by unreliable information sources with the low quality data. This becomes a barrier to the effective use of collective intelligence in a variety of applications. In order to address this issue, we propose a probabilistic model to jointly...
Conference Paper
Full-text available
Twitter is a microblogging website that has been useful as a source for human social behavioral analysis, such as political sentiment analysis, user influence, and spread of news. In this paper, we discuss a text cube approach to studying different kinds of human, social and cultural behavior (HSCB) embedded in the Twitter stream. Text cube is a ne...
Conference Paper
Full-text available
A set of behavior rules, personal characteristics, group affiliations and roles was used to generate a dataset of mixed communication actions modeling those at a large organization. Several different approaches to community detection and modeling were applied to this generated dataset, in order to compare the strengths and range of applicability of...
Article
Social multimedia sharing and hosting websites, such as Flickr and Facebook, contain billions of user-submitted images. Popular Internet commerce websites such as Amazon.com are also furnished with tremendous amounts of product-related images. In addition, images in such social networks are also accompanied by annotations, comments, and other infor...
Conference Paper
Medical literature has been an important information source for clinical professionals. As the body of medical literature expands rapidly, keeping this knowledge up-to-date becomes a challenge for medical professionals. One question is that for a given disease how can we find the most influential treatments currently available from online medical p...
Conference Paper
Full-text available
Link prediction is an important task in network analysis, benefiting researchers and organizations in a variety of fields. Many networks in the real world, for example social networks, are heterogeneous, having multiple types of links and complex dependency structures. Link prediction in such networks must model the influence propagating between he...
Conference Paper
Active learning on graphs has received increasing interest in the past years. In this paper, we propose a textit{nonadaptive} active learning approach on graphs, based on generalization error bound minimization. In particular, we present a data-dependent error bound for a graph-based learning method, namely learning with local and global consistenc...
Conference Paper
Full-text available
The recent development of social media (e.g., Twitter, Facebook, blogs, etc.) provides an unprecedented opportunity to study human social cultural behaviors. These data sources provide rich structured data (e.g., XML, relational tables, and categorical data) as well as unstructured data (e.g., texts). A significant challenge is to summarize and nav...
Article
Influence is a complex and subtle force that governs social dynamics and user behaviors. Understanding how users influence each other can benefit various applications, e.g., viral marketing, recommendation, information retrieval and etc. While prior work has mainly focused on qualitative aspect, in this article, we present our research in quantitat...
Article
In this paper, we give an overview of knowledge discovery (KD) and information extraction (IE) techniques on the World Wide Web (WWW). We intend to answer the following questions: What kind of additional uncertainty challenges are introduced by the WWW setting to basic KD and IE techniques? What are the fundamental techniques that can be used to re...
Conference Paper
Mining directly on the existing networks formed by explicit webpage links on the World-Wide Web may not be so fruitful due to the diversity and semantic heterogeneity of such web-links. However, construction of service-oriented, semi-structured information networks from the Web and mining on such networks may lead to many exciting discoveries of us...
Article
Full-text available
Newly emerged event-based online social services, such as Meetup and Plancast, have experienced increased popularity and rapid growth. From these services, we observed a new type of social network - event-based social network (EBSN). An EBSN does not only contain online social interactions as in other conventional online social networks, but also i...
Article
Most objects and data in the real world are interconnected, forming complex, heterogeneous but often semi-structured information networks. However, most people consider a database merely as a data repository that supports data storage and retrieval rather than one or a set of heterogeneous information networks that contain rich, inter-related, mult...
Article
Real-world physical and abstract data objects are interconnected, forming gigantic, interconnected networks. By structuring these data objects and interactions between these objects into multiple types, such networks become semi-structured heterogeneous information networks. Most real-world applications that handle big data, including interconnecte...
Article
Full-text available
In this work, we develop a simple algorithm for semi-supervised regression. The key idea is to use the top eigenfunctions of integral operator derived from both labeled and unlabeled examples as the basis functions and learn the prediction function by a simple linear regression. We show that under appropriate assumptions about the integral operator...
Article
Different from traditional one-sided clustering techniques, coclustering makes use of the duality between samples and features to partition them simultaneously. Most of the existing co-clustering algorithms focus on modeling the relationship between samples and features, whereas the intersample and interfeature relationships are ignored. In this pa...
Article
Full-text available
Many organizations have large text repositories that contain information that is mission critical to the organization. NASA, for example, operates a safety reporting system known as the Aviation Safety Reporting System (ASRS) which collects voluntarily submitted aviation safety incident/situation reports from pilots, controllers, and others with th...
Article
We consider the problem of active learning over the vertices in a graph, without fea-ture representation. Our study is based on the common graph smoothness assumption, which is formulated in a Gaussian random field model. We analyze the probability dis-tribution over the unlabeled vertices condi-tioned on the label information, which is a multivari...
Article
Full-text available
Locality Preserving Indexing (LPI) has been quite successful in tackling document anal-ysis problems, such as clustering or classi-fication. The approach relies on the Local-ity Preserving Criterion, which preserves the locality of the data points. However, LPI takes every word in a data corpus into ac-count, even though many words may not be usefu...
Article
Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even more challenging. Over the last decade research on adversarial information retrieval has gained a...
Article
With the rapid development of computer and information technology in the last several decades, an enormous amount of data in science and engineering has been and will con-tinuously be generated in massive scale, either being stored in gigantic storage devices or flowing into and out of the system in the form of data streams. Moreover, such data has...
Article
Full-text available
Fisher score is one of the most widely used supervised feature selection methods. However, it selects each feature independently according to their scores under the Fisher criterion, which leads to a suboptimal subset of features. In this paper, we present a generalized Fisher score to jointly select features. It aims at finding an subset of featur...
Article
Full-text available
The fast development of multimedia technology and increasing availability of network bandwidth has given rise to an abundance of network data as a result of all the ever-booming social media and social websites in recent years, e.g., Flickr, Youtube, MySpace, Facebook, etc. Social network analysis has therefore become a critical problem attracting...
Article
Full-text available
Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phe-nomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of provid-ing high quality search even more challenging. Over the last decade research on adversarial information retrieval has gained...
Article
The fast information sharing on Twitter from millions of users all over the world leads to almost real-time reporting of events. It is extremely important for business and administrative decision makers to learn events' popularity as quickly as possible, as it can buy extra precious time for them to make informed decisions. Therefore, we introduce...
Article
Full-text available
Previous studies on supporting free-form keyword queries over RDBMSs provide users with linked structures (e.g., a set of joined tuples) that are relevant to a given keyword query. Most of them focus on ranking individual tuples from one table or joins of multiple tables containing a set of keywords. In this paper, we study the problem of keyword s...
Conference Paper
The prevalence of Web 2.0 techniques has led to the boom of various online communities, where topics spread ubiquitously among user-generated documents. Working to-gether with this diffusion process is the evolution of topic con-tent, where novel contents are introduced by documents which adopt the topic. Unlike explicit user behavior (e.g., buying...
Conference Paper
Full-text available
Patents are of crucial importance for businesses, because they provide legal protection for the invented techniques, processes or products. A patent can be held for up to 20 years. However, large maintenance fees need to be paid to keep it enforceable. If the patent is deemed not valuable, the owner may decide to abandon it by stopping paying the m...
Conference Paper
Full-text available
This paper studies the problem of latent periodic topic analysis from time stamped documents. The examples of time stamped documents include news articles, sales records, financial reports, TV programs, and more recently, posts from social media websites such as Flickr, Twitter, and Face book. Different from detecting periodic patterns in tradition...
Article
The identification and characterization of traffic anomalies on massive road networks is a vital component of traffic monitoring and control. Anomaly identification can be used to reduce congestion, increase safety, and provide transportation engineers with better information for traffic forecasting and road network design. However, because of the...
Conference Paper
Traditional feature selection methods assume that the data are independent and identically distributed (i.i.d.). However, in real world, there are tremendous amount of data which are distributing in a network. Existing features selection methods are not suited for networked data because the i.i.d. assumption no longer holds. This motivates us to st...
Conference Paper
Full-text available
Multi-label learning studies the problem where each instance is associated with a set of labels. There are two challenges in multi-label learning: (1) the labels are interdependent and correlated, and (2) the data are of high dimensionality. In this paper, we aim to tackle these challenges in one shot. In particular, we propose to learn the label c...
Conference Paper
Software bugs in distributed systems are notoriously hard to find due to the large number of components involved and the non-determinism introduced by race conditions between messages. This paper introduces Pop Mine, a tool for diagnosing corner-case bugs by finding the minimal causal directed acyclic graph (DAG) of events, spanning multiple proces...
Article
Full-text available
Data stream classification for intrusion detection poses at least three major challenges. First, these data streams are typically infinite-length, making traditional multipass learning algorithms inapplicable. Second, they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed...
Article
Different information sources publish information with different degrees of correctness and originality. False information can often result in considerable damage. Hence, trustworthinessof information is an important issue in this datadriven world economy. Reputation of different agents in a network has been studied earlier in a variety of domains...
Conference Paper
Full-text available
Social media is becoming increasingly ubiquitous and popular on the Internet. Due to the huge popularity of social media websites, such as Facebook, Twitter, YouTube and Flickr, many companies or public figures are now active in maintaining pages on those websites to interact with online users, attracting a large number of fans/followers by posting...