About
113
Publications
25,032
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,611
Citations
Current institution
Publications
Publications (113)
Large Language Models (LLMs) have shown strong general capabilities in many applications. However, how to make them reliable tools for some specific tasks such as automated short answer grading (ASAG) remains a challenge. We present SteLLA (Structured Grading System Using LLMs with RAG) in which a) Retrieval Augmented Generation (RAG) approach is u...
Word sense disambiguation (WSD) is one of the main challenges in Computational Linguistics. TreeMatch is a WSD system originally developed using data from SemEval 2007 Task 7 (Coarse-grained English All-words Task) that has been adapted for use in SemEval 2010 Task 17 (All-words Word Sense Disambiguation on a Specific Domain). The system is based o...
Mode collapse has been a persisting challenge in generative adversarial networks (GANs), and it directly affects the applications of GAN in many domains. Existing works that attempt to solve this problem have some serious limitations: models using optimal transport (OT) strategies (e.g.,
Wasserstein
distance) lead to vanishing or exploding gradie...
With frequent reports of biased outcomes of AI systems, fairness rightfully becomes an active area of current ML research. However, while progress has been made on theoretical analysis and formulation of fairness as constraints on error probabilities, our ability to design and train modern deep learning models that reach the targeted fairness goals...
The study of model bias and variance with respect to decision boundaries is critically important in supervised learning and artificial intelligence. There is generally a tradeoff between the two, as fine-tuning of the decision boundary of a classification model to accommodate more boundary training samples (i.e., higher model complexity) may improv...
The discovery of Markov blanket (MB) for feature selection has attracted much attention in recent years, since the MB of the class attribute is the optimal feature subset for feature selection. However, almost all existing MB discovery algorithms focus on either improving computational efficiency or boosting learning accuracy, instead of both. In t...
Feature selection is important in many big data applications. Two critical challenges closely associate with big data. First, in many big data applications, the dimensionality is extremely high, in millions, and keeps growing. Second, big data applications call for highly scalable feature selection algorithms in an online manner such that each feat...
Kui Yu Xindong Wu Wei Ding- [...]
Hao Wang
It has received much attention in recent years to use Markov blankets in a Bayesian network for feature selection. The Markov blanket of a class attribute in a Bayesian network is a unique yet minimal feature subset for optimal feature selection if the probability distribution of a data set can be faithfully represented by this Bayesian network. Ho...
This work describes algorithms for performing discrete object detection, specifically in the case of buildings, where usually only low quality RGB-only geospatial reflective imagery is available. We utilize new candidate search and feature extraction techniques to reduce the problem to a machine learning (ML) classification task. Here we can harnes...
One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of...
This work describes algorithms for performing discrete object detection, specifically in the case of buildings, where usually only low quality RGB-only geospatial reflective imagery is available. We utilize new candidate search and feature extraction techniques to reduce the problem to a machine learning (ML) classification task. Here we can harnes...
As an emerging research direction, online streaming feature selection deals with sequentially added dimensions in a feature space while the number of data instances is fixed. Online streaming feature selection provides a new, complementary algorithmic methodology to enrich online feature selection, especially targets to high dimensionality in big d...
As an emerging research direction, online streaming feature selection deals with sequentially added dimensions in a feature space while the number of data instances is fixed. Online streaming feature selection provides a new, complementary algorithmic methodology to enrich online feature selection, especially targets to high dimensionality in big d...
Convolutional neural networks are sensitive to the random initialization of filters. We call this The Filter Lottery (TFL) because the random numbers used to initialize the network determine if you will "win" and converge to a satisfactory local minimum. This issue forces networks to contain more filters (be wider) to achieve higher accuracy becaus...
Craters are among the most studied geomorphic features in the Solar System
because they yield important information about the past and present geological
processes and provide information about the relative ages of observed geologic
formations. We present a method for automatic crater detection using advanced
machine learning to deal with the large...
What does the cost of academic publishing look like to the common researcher
today? Our goal is to convey the current state of academic publishing,
specifically in regards to the field of computer science and provide analysis
and data to be used as a basis for future studies. We will focus on author and
reader costs as they are the primary points o...
Feature selection is important in many big data applications. Two critical challenges closely associate with big data. Firstly, in many big data applications, the dimensionality is extremely high, in millions, and keeps growing. Secondly, big data applications call for highly scalable feature selection algorithms in an online manner such that each...
Reliable tornado forecasting with a long-lead time can greatly support emergency response and is of vital importance for the economy and society. The large number of meteorological variables in spatiotemporal domains and the complex relationships among variables remain the top difficulties for a long-lead tornado forecasting.
Standard data mining a...
Many datasets from real-world applications have very high-dimensional or increasing feature space. It is a new research problem to learn and maintain a classifier to deal with very high dimensionality or streaming features. In this article, we adapt the well-known emerging-pattern-based classification models and propose a semi-streaming approach. F...
Continuous data extraction pipelines using wrappers have become common and
integral parts of businesses dealing with stock, flight, or product
information. Extracting data from websites that use HTML templates is difficult
because available wrapper methods are not designed to deal with websites that
change over time (the inclusion or removal of HTM...
We propose a new approach, CCRBoost, to identify the hierarchical structure of spatio-temporal patterns at different resolution levels and subsequently construct a predictive model based on the identified structure. To accomplish this, we first obtain indicators within different spatio-temporal spaces from the raw data. A distributed spatio-tempora...
This paper introduces a method for mining co-occurring events from longitudinal data, and applies this method to detecting adverse drug reactions (ADRs) from patient data. Electronic health records are richer than older data sources (such as spontaneous report records) and thus are ideal for ADR mining. However, current data mining methods, such as...
Authorship identification is a task of identifying authors of anonymous texts given examples of the writing of authors. The increasingly large volumes of anonymous texts on the Internet enhance the great yet urgent necessity for authorship identification. It has been applied to more and more practical applications including literary works, intellig...
Mining frequent patterns with periodic wildcard gaps is a critical data mining problem to deal with complex real-world problems. This problem can be described as follows: given a subject sequence, a pre-specified threshold, and a variable gap-length with wildcards between each two consecutive letters. The task is to gain all frequent patterns with...
Recent approaches to crater detection have been inspired by face detection’s use of gray-scale texture features. Using gray-scale texture features for supervised machine learning crater detection algorithms provides better classification of craters in planetary images than previous methods. When using Haar features it is typical to generate thousan...
Crime forecasting is notoriously difficult. A crime incident is a multi-dimensional complex phenomenon that is closely associated with temporal, spatial, societal, and ecological factors. In an attempt to utilize all these factors in crime pattern formulation, we propose a new feature construction and feature selection framework for crime forecasti...
Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem th...
The technique of Hotspot Mapping is widely used in analysing the spatial characteristics of crimes. The spatial distribution of crime is considered to be related with a variety of socio-economic and crime opportunity factors. But existing methods usually focus on the target crime density as input without utilizing these related factors. In this stu...
It is a nontrivial task to build an accurate emerging pattern (EP) classifier from high-dimensional data because we inevitably face two challenges 1) how to efficiently extract a minimal set of strongly predictive EPs from an explosive number of candidate patterns, and 2) how to handle the highly sensitive choice of the minimal support threshold. T...
The ultimate goal of distance metric learning is to use discriminative information to keep data samples in the same class close, and those in different classes separate. Local distance metric methods can preserve discriminative information by considering neighborhood influence. We propose a discriminative distance metric approach by maximizing loca...
Group feature selection makes use of structural information among features to discover a meaningful subset of features. Existing group feature selection algorithms only deal with pre-given candidate feature sets and they are incapable of handling streaming features. On the other hand, feature selection algorithms targeted for streaming features can...
In faithful Bayesian networks, the Markov blanket of the class attribute is a unique and minimal feature subset for optimal feature selection. However, little attention has been paid to Markov blanket feature selection in a non-faithful environment which widely exists in the real world. To tackle this issue, in this paper, we deal with non-faithful...
Information on the World Wide Web is congested with large amounts of news contents. Recommending, filtering, and summarization of Web news have become hot topics of research in Web intelligence, aiming to find interesting news for users and give concise content for reading. This paper presents our research on developing the Personalized News Filter...
The least squares problem is one of the most important regression problems in statistics, machine learning and data mining. In this paper, we present the Constrained Stochastic Gradient Descent (CSGD) algorithm to solve the large-scale least squares problem. CSGD improves the Stochastic Gradient Descent (SGD) by imposing a provable constraint that...
The development of disastrous flood forecasting techniques able to provide warnings at a long lead-time (5-15 days) is of great importance to society. Extreme Flood is usually a consequence of a sequence of precipitation events occurring over from several days to several weeks. Though precise short-term forecasting the magnitude and extent of indiv...
The ultimate goal of distance metric learning is to incorporate abundant discriminative information to keep all data samples in the same class close and those from different classes separated. Local distance metric methods can preserve discriminative information by considering the neighborhood influence. In this paper, we propose a new local discri...
Standard feature selection algorithms deal with given candidate feature sets at the individual feature level. When features exhibit certain group structures, it is beneficial to conduct feature selection in a grouped manner. For high-dimensional features, it could be far more preferable to online generate and process features one at a time rather t...
We propose a new online feature selection framework for applications with streaming features where the knowledge of the full feature space is unknown in advance. We define streaming features as features that flow in one by one over time whereas the number of training examples remains fixed. This is in contrast with traditional online learning metho...
Crime tends to cluster geographically. This has led to the wide usage of hotspot analysis to identify and visualize crime. Accurately identified crime hotspots can greatly benefit the public by creating accurate threat visualizations, more efficiently allocating police resources, and predicting crime. Yet existing mapping methods usually identify h...
Surveying a large amount of small sub-kilometer craters in planetary images is a challenging task due to their non-distinguishable features. In this paper, we integrate the LASSO (Least Absolute Shrinkage and Selection Operator) method with the Bayesian network classifier and propose an L1 Regularized Bayesian Network Classifier (L1-BNC) algorithm...
This paper introduces entropy quad-trees, which are structures derived from quad-trees by allowing nodes to split only when those correspond to sufficiently complex sub-domains of a data domain. Complexity is evaluated using an information-theoretic measure based on the analysis of the entropy associated to sets of objects designated by nodes. An a...
Association mining aims to find valid correlations among data attributes, and has been widely applied to many areas of data analysis. This paper presents a semantic network-based association analysis model including three spreading activation methods. It applies this model to assess the quality of a dataset, and generate semantically valid new hypo...
The emergence of social tagging and crowdsourcing systems provides a unique platform where multiple weak labelers can form a crowd to fulfill a labeling task. Yet crowd labelers are often noisy, inaccurate, and have limited labeling knowledge, and worst of all, they act independently without seeking complementary knowledge from each other to improv...
Counting craters is a fundamental task of planetary science, because it provides the only tool for measuring relative ages of planetary surfaces. However, advances in surveying craters present in data gathered by planetary probes have not kept up with advances in data collection. It becomes extremely challenging to automatically count a very large...
Causal discovery is highly desirable in science and technology. In this paper, we study a new research problem of discovery
of causal relationships in the context of streaming features, where the features steam in one by one. With a Bayesian network
to represent causal relationships, we propose a novel algorithm called causal discovery from streami...
Building an accurate emerging pattern classifier with a high-dimensional dataset is a challenging issue. The problem becomes even more difficult if the whole feature space is unavailable before learning starts. This paper presents a new technique on mining emerging patterns using streaming feature selection. We model high feature dimensions with st...
In stock markets, an emerging challenge for surveillance is that a group of hidden manipulators collaborate with each other to manipulate the price movement of securities. Recently, the coupled hidden Markov model (CHMM)-based coupled behavior analysis (CBA) has been proposed to consider the coupling relationships in the above group-based behaviors...
Word sense disambiguation is the process of determining which sense of a word is used in a given context. Due to its importance in understanding semantics and many real-world applications, word sense disambiguation has been extensively studied in Natural Language Processing and Computational Linguistics. However, existing methods either narrowly fo...
Criminal activities are unevenly distributed over space. The concept of hotspots is widely used to analyze the spatial characters of crimes. But existing methods usually identify hotspots based on an ar-bitrary user-defined threshold with respect to the number of a target crime without considering underlying controlling factors. In this study we in...
We apply techniques that originate in the analysis of market basket data sets to the study of frequent trajectories in graphs. Trajectories are defined as simple paths through a directed graph, and we put forth some definitions and observations about the calculation of supports of paths in this context. A simple algorithm for calculating path suppo...
Counting craters is a paramount tool of planetary analysis because it provides relative dating of planetary surfaces. Dating surfaces with high spatial resolution requires counting a very large number of small, sub-kilometer size craters. Exhaustive manual surveys of such craters over extensive regions are impractical, sparking interest in designin...
Knowledge representation is essential for semantics modeling and intelligent information processing. For decades researchers have proposed many knowledge representation techniques. However, it is a daunting problem how to capture deep semantic information effectively and support the construction of a large-scale knowledge base efficiently. This pap...
Police agencies have been collecting an increasing amount of information to better understand patterns in criminal activity. Recently there is a new trend in using the data collected to predict where and when crime will occur. Crime prediction is greatly beneficial because if it is done accurately, police practitioner would be able to allocate reso...
Using gray-scale texture features has recently become a new trend in supervised machine learning crater detection algorithms. To provide better classification of craters in planetary images, feature subset selection is used to reduce irrelevant and redundant features. Feature selection is known to be NP-hard. To provide an efficient suboptimal solu...
Crime is classically âunpredictableâ. It is not necessarily random, but neither does it take place consistently in space or time. A better theoretical understanding is needed to facilitate practical crime prevention solutions that correspond to specific places and times. In this study, we discuss the preliminary results of a crime forecasting m...
Associative classifiers have received considerable attention due to their easy to understand models and promising performance. However, with a high dimensional dataset, associative classifiers inevitably face two challenges: (1) how to extract a minimal set of strong predictive rules from an explosive number of generated association rules, and (2)...
Information on the World Wide Web is congested with large amounts of news contents. Recommendation, filtering, and summarization of Web news have received much attention in Web intelligence, aiming to find interesting news and summarize concise content for users. In this paper, we present our research on developing the Personalized News Filtering a...
Counting craters is a fundamental task of planetary science because it provides the only tool for measuring relative ages of planetary surfaces. However, advances in surveying craters present in data gathered by planetary probes have not kept up with advances in data collection. One challenge of auto-detecting craters in images is to identify an im...
Counting craters is a fundamental task of planetary science, because it provides the only tool for measuring relative ages of planetary surfaces. In this paper, we combine active learning with semi-supervised learning to build an new semi-supervised active class selection system for crater detection from high resolution panchromatic planetary image...
Crater detection from panchromatic images has its unique challenges when comparing to the traditional object detection tasks. Craters are numerous, have large range of sizes and textures, and they continuously merge into image backgrounds. Using traditional feature construction methods to describe craters cannot well embody the diversified characte...
In this paper, we formulate a new research problem of concept learning and summarization for one-class data streams. The main
objectives are to (1) allow users to label instance groups, instead of single instances, as positive samples for learning,
and (2) summarize concepts labeled by users over the whole stream. The employment of the batch-labeli...
Modeling spatially distributed phenomena in terms of its controlling factors is a recurring problem in geoscience. Most efforts
concentrate on predicting the value of response variable in terms of controlling variables either through a physical model
or a regression model. However, many geospatial systems comprises complex, nonlinear, and spatiall...
Counting craters in remotely sensed images is the only tool that provides relative dating of remote planetary surfaces. Surveying craters requires counting a large amount of small subkilometer craters, which calls for highly efficient automatic crater detection. In this article, we present an integrated framework on autodetection of subkilometer cr...
Advances in database and data acquisition technologies have re-sulted in an immense amount of spatial data, much of which cannot be readily explored using traditional data analysis techniques. The goal of spatial data mining is to automate the extraction of interesting and useful patterns that are not explicitly represented in spatial datasets. The...
Association mining aims to find valid correlations among data attributes, and has been widely applied to many areas of data analysis. This paper presents a semantic network-based association analysis model including three spreading activation methods. It applies this model to assess the quality of a dataset, and generate semantically valid new hypo...
Our strategy for automatic crater detection consists of employing a cascading AdaBoost classifier for identification of craters in images, and using the SOM as an active learning tool to minimize the number of image examples that need to be labeled by an analyst.
This paper introduces entropy quad-trees, which are structures derived from quad-trees by allowing nodes to split only when those correspond to sufficiently complex sub-domains of a data domain. Complexity is evaluated using an information-theoretic measure based on the analysis of the entropy associated to sets of objects designated by nodes. An a...
In this paper, we study a new research problem of causal discovery from streaming features. A unique characteristic of streaming features is that not all features can be available before learning begins. Feature generation and selection often have to be interleaved. Managing streaming features has been extensively studied in classification, but lit...
We use an association analysis-based strategy for exploration of multi-attribute spatial datasets possessing naturally arising classification. In this demonstration, we present a prototype system, ESTATE (Exploring Spatial daTa Association patTErns), inverting such classification by interpreting different classes found in the dataset in terms of se...
Identifying impact craters on planetary surfaces is one fundamental task in planetary science. In this paper, we present an embedded framework on auto-detection of craters, using feature selection and boosting strategies. The paradigm aims at building a universal and practical crater detector. This methodology addresses three issues that such a too...
We propose an association analysis-based strategy for exploration of multi-attribute spatial datasets possessing naturally
arising classification. Proposed strategy, ESTATE (Exploring Spatial daTa Association patTErns), inverts such classification by interpreting different classes found in the dataset in terms of sets of discriminative
patterns of...
We study an interesting and challenging problem, online streaming feature selection, in which the size of the feature set is unknown, and not all features are available for learning while leaving the number of observations constant. In this problem, the candidate features arrive one at a time, and the learner's task is to select a “best so far...
This paper introduces an algorithm for capturing high complexity regions of a data domain. In this work, we focus on domains in R<sup>2</sup>. In particular, we analyze 2-dimensional image domains. Two different methods for mining are considered. The first method performs an information-theoretic analysis based on entropy to find diverse areas. The...
Craters are important geographical features caused by the impacts of meteoroids. Craters have been widely studied because they contain crucial information about the age and geologic formations of planets. This paper discusses an automated crater-detection framework using knowledge discovery and data mining (KDD) process including sampling, feature...
Word sense disambiguation (WSD) is one of the main challenges in Computational Linguistics. TreeMatch is a WSD system originally developed using data from SemEval 2007 Task 7 (Coarse-grained English All-words Task) that has been adapted for use in SemEval 2010 Task 17 (All-words Word Sense Disambiguation on a Specific Domain). The system is based o...
Association analysis provides a natural, data-centric framework for the discovery of patterns of explanatory variables that are linked to a certain outcome. In this paper we demonstrate how such a framework can be applied for political analysis, using an expository example of discovering different spatio-social motifs of support for Barack Obama in...
A new algorithm utilizes shape and texture information in high resolution orbital images to detect subkilometer craters on Mars with a detection rate >82%.
We use an association analysis-based strategy for exploration of multi-attribute spatial datasets possessing naturally arising classification. In this demonstration, we present a prototype system, ESTATE (Exploring Spatial daTa Association patTErns), inverting such classification by interpreting different classes found in the dataset in terms of se...
This project aims to overcome the access barriers to virtual worlds for motor- and speech-impaired users by building a gaze-controlled interface for Second Life that will enable them to interact with the virtual world by just moving their eyes. We have conducted a study to assess (1) the facilitation of gaze-controlled text input using word predict...
Knowledge plays a central role in intelligent sys- tems. Manual knowledge acquisition is very ine-cient and expensive. In this paper, we present (1) an au- tomatic method to acquire a large amount of lexical- dependency knowledge, and (2) an innovative knowl- edge representation model to efiectively minimize the impact of noise and improve knowledg...
Nowadays many companies and public organizations use powerful database systems for collecting and managing information. Huge
amount of data records are often accumulated within a short period of time. Valuable information is embedded in these data,
which could help discover interesting knowledge and significantly assist in decision-making process....
Feature-based hot spots are localized regions where the attributes of objects attain high values. There is considerable interest in automatic identification of feature-based hot spots. This paper approaches the problem of finding feature-based hot spots from a data mining perspective, and describes a method that relies on supervised clustering to p...
Large amounts of remotely sensed data calls for data min- ing techniques to fully utilize their rich information content. In this paper, we study new means of discovery and sum- marization of knowledge contained in the spatial patterns of remote sensing datasets. Several geospatial feature vari- ables are fused together, and the vector of their val...
Word sense disambiguation is the process of determining which sense of a word is used in a given context. Due to its importance in understanding semantics of natural languages, word sense disambiguation has been exten- sively studied in Computational Linguistics. However, existing methods either are brit- tle and narrowly focus on specific topics o...
Word classification is of significant interest in the domain of natural language processing and it has direct applications in information retrieval and knowledge discovery. This paper presents an experimental method using Naïve Bayes for word classification. The method is based on combing successful feature selection techniques on Mutual Informatio...
Efficient means of determining factors controlling spatial distribution of an environmental class variable are of significant interest in Earth science. In this paper, we present a method for automated discovery of controlling factors by mining for emerging patterns in a database constructed from the fusion of several explanatory datasets. We intro...
This paper proposes a novel framework for mining regional co-location patterns with respect to sets of continuous variables in spatial datasets. The goal is to identify regions in which multiple continuous variables with values from the wings of their statistical distribution are co-located. A co-location mining framework is introduced that operate...
This paper presents a novel region discovery framework geared towards finding scientifically interesting places in spatial
datasets. We view region discovery as a clustering problem in which an externally given fitness function has to be maximized.
The framework adapts four representative clustering algorithms, exemplifying prototype-based, grid-ba...
A special challenge for spatial data mining is that information is not distributed uniformly in spatial data sets. Consequently, the discovery of regional knowledge is of fundamental importance. Unfortunately, regional patterns frequently fail to be discovered due to insuf- ficient global confidence and/or support in traditional association rule mi...