Xingquan Zhu

Florida Atlantic University, Boca Raton, Florida, United States

Are you Xingquan Zhu?

Claim your profile

Publications (187)98.6 Total impact

  • Source
    Shirui Pan, Jia Wu, Xingquan Zhu, Chengqi Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Many applications involve stream data with structural dependency, graph representations, and continuously increasing volumes. For these applications, it is very common that their class distributions are imbalanced with minority (or positive) samples being only a small portion of the \hbox{population}, which imposes significant challenges for learning models to accurately identify minority samples. This problem is further complicated with the presence of noise, because they are similar to minority samples and any treatment for the class imbalance may falsely focus on the noise and result in deterioration of accuracy. In this paper, we propose a classification model to tackle imbalanced graph streams with noise. Our method, graph ensemble boosting, employs an ensemble-based framework to partition graph stream into chunks each containing a number of noisy graphs with imbalanced class distributions. For each individual chunk, we propose a boosting algorithm to combine discriminative subgraph pattern selection and model learning as a unified framework for graph classification. To tackle concept drifting in graph streams, an instance level weighting mechanism is used to dynamically adjust the instance weight, through which the boosting framework can emphasize on difficult graph \hbox{samples}. The classifiers built from different graph chunks form an ensemble for graph stream classification. Experiments on real-life imbalanced graph streams demonstrate clear benefits of our boosting design for handling imbalanced noisy graph stream.
    04/2015; 45(5):940-954. DOI:10.1109/TCYB.2014.2341031
  • Source
    Dataset: iSRD
    Hamzah Al Najada, Xingquan Zhu
  • [Show abstract] [Hide abstract]
    ABSTRACT: Naive Bayes (NB) is a popular machine learning tool for classification, due to its simplicity, high computational efficiency, and good classification accuracy, especially for high dimensional data such as texts. In reality, the pronounced advantage of NB is often challenged by the strong conditional independence assumption between attributes, which may deteriorate the classification performance. Accordingly, numerous efforts have been made to improve NB, by using approaches such as structure extension, attribute selection, attribute weighting, instance weighting, local learning and so on. In this paper, we propose a new Artificial Immune System (AIS) based self-adaptive attribute weighting method for Naive Bayes classification. The proposed method, namely AISWNB, uses immunity theory in Artificial Immune Systems to search optimal attribute weight values, where self-adjusted weight values will alleviate the conditional independence assumption and help calculate the conditional probability in an accurate way. One noticeable advantage of AISWNB is that the unique immune system based evolutionary computation process, including initialization, clone, section, and mutation, ensures that AISWNB can adjust itself to the data without explicit specification of functional or distributional forms of the underlying model. As a result, AISWNB can obtain good attribute weight values during the learning process. Experiments and comparisons on 36 machine learning benchmark data sets and six image classification data sets demonstrate that AISWNB significantly outperforms its peers in classification accuracy, class probability estimation, and class ranking performance.
    Expert Systems with Applications 02/2015; 42(3):1487–1502. DOI:10.1016/j.eswa.2014.09.019 · 1.97 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Ensemble learning is a common tool for data stream classification, mainly because of its inherent advantages of handling large volumes of stream data and concept drifting. Previous studies, to date, have been primarily focused on building accurate ensemble models from stream data. However, a linear scan of a large number of base classifiers in the ensemble during prediction incurs significant costs in response time, preventing ensemble learning from being practical for many real-world time-critical data stream applications, such as Web traffic stream monitoring, spam detection, and intrusion detection. In these applications, data streams usually arrive at a speed of GB/second, and it is necessary to classify each stream record in a timely manner. To address this problem, we propose a novel Ensemble-tree (E-tree for short) indexing structure to organize all base classifiers in an ensemble for fast prediction. On one hand, E-trees treat ensembles as spatial databases and employ an R-tree like height-balanced structure to reduce the expected prediction time from linear to sub-linear complexity. On the other hand, E-trees can be automatically updated by continuously integrating new classifiers and discarding outdated ones, well adapting to new trends and patterns underneath data streams. Theoretical analysis and empirical studies on both synthetic and real-world data streams demonstrate the performance of our approach.
    IEEE Transactions on Knowledge and Data Engineering 02/2015; 27(2):461-474. DOI:10.1109/TKDE.2014.2298018 · 1.82 Impact Factor
  • Shirui Pan, Jia Wu, Xingquan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: Graph classification has drawn great interests in recent years due to the increasing number of applications involving objects with complex structure relationships. To date, all existing graph classification algorithms assume, explicitly or implicitly, that misclassifying instances in different classes incurs an equal amount of cost (or risk), which is often not the case in real-life applications (where misclassifying a certain class of samples, such as diseased patients, is subject to more expensive costs than others). Although cost-sensitive learning has been extensively studied, all methods are based on data with instance-feature representation. Graphs, however, do not have features available for learning and the feature space of graph data are likely infinite and needs to be carefully explored in order to favor classes with a higher cost. In this paper, we propose, CogBoost, a fast costsensitive graph classification algorithm, which aims to minimize the misclassification costs (instead of the errors) and achieve fast learning speed for large scale graph datasets. To minimize the misclassification costs, CogBoost iteratively selects the most discriminative subgraph by considering costs of different classes, and then solves a linear programming problem in each iteration by using Bayes decision rule based optimal loss function. In addition, a cutting plane algorithm is derived to speed up the solving of linear programs for fast learning on large graph datasets. Experiments and comparisons on real-world large graph datasets demonstrate the effectiveness and the efficiency of our algorithm.
    IEEE Transactions on Knowledge and Data Engineering 01/2015; DOI:10.1109/TKDE.2015.2391115 · 1.82 Impact Factor
  • Meng Fang, Jie Yin, Xingquan Zhu, Chengqi Zhang
    IEEE Transactions on Knowledge and Data Engineering 01/2015; DOI:10.1109/TKDE.2015.2413789 · 1.82 Impact Factor
  • Boyu Li, Ting Guo, Xingquan Zhu, Zhanshan Li
    [Show abstract] [Hide abstract]
    ABSTRACT: Model-based diagnosis in discrete event systems (DESs) is a major research topic in failure diagnosis, where diagnosability plays an important role in the construction of the diagnosis engine. To improve the solution efficiency for diagnosability, this paper proposes novel techniques to solve the problems of testing and optimizing for diagnosability. We propose a new concept, reverse twin plant, which is generated backwards from the final states of the DESs so there is no need to generate a complete copy of the DES model to determine the diagnosability. Such a design makes our testing algorithm much faster than existing methods. An efficient optimizing algorithm, which makes a non-diagnosable system diagnosable, is also proposed in the paper by expanding the minimal observable space with operation on just a part of the DES model. Examples and theoretical studies demonstrate the performance of the proposed designs.
    Engineering Applications of Artificial Intelligence 11/2014; DOI:10.1016/j.engappai.2014.10.007 · 1.96 Impact Factor
  • Jia Wu, Xingquan Zhu, Chengqi Zhang, P.S. Yu
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper formulates a multi-graph learning task. In our problem setting, a bag contains a number of graphs and a class label. A bag is labeled positive if at least one graph in the bag is positive, and negative otherwise. In addition, the genuine label of each graph in a positive bag is unknown, and all graphs in a negative bag are negative. The aim of multi-graph learning is to build a learning model from a number of labeled training bags to predict previously unseen test bags with maximum accuracy. This problem setting is essentially different from existing multi-instance learning (MIL), where instances in MIL share well-defined feature values, but no features are available to represent graphs in a multi-graph bag. To solve the problem, we propose a Multi-Graph Feature based Learning ( gMGFL) algorithm that explores and selects a set of discriminative subgraphs as features to transfer each bag into a single instance, with the bag label being propagated to the transferred instance. As a result, the multi-graph bags form a labeled training instance set, so generic learning algorithms, such as decision trees, can be used to derive learning models for multi-graph classification. Experiments and comparisons on real-world multi-graph tasks demonstrate the algorithm performance.
    IEEE Transactions on Knowledge and Data Engineering 10/2014; 26(10):2382-2396. DOI:10.1109/TKDE.2013.2297923 · 1.82 Impact Factor
  • Source
    Hamzah Al Najada, Xingquan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: Internet is playing an essential role for modern information systems. Applications, such as e-commerce websites, are becoming popularly available for people to purchase different types of products online. During such an online shopping process, users often rely on online review reports from previous customers to make the final decision. Because online reviews are playing essential roles for the selling of online products (or services), some vendors (or customers) are providing fake/spam reviews to mislead the customers. Any false reviews of the products may result in unfair market competition and financial loss for the customers or vendors. In this research, we aim to distinguish between spam and non-spam reviews by using supervised classification methods. When training a classifier to identify spam vs. non-spam reviews, a challenging issue is that spam reviews are only a very small portion of the online review reports. This naturally leads to a data imbalance issue for training classifiers for spam review detection, where learning methods without emphasizing on minority samples (i.e., spams) may result in poor performance in detecting spam reviews (although the overall accuracy of the algorithm might be relatively high). In order to tackle the challenge, we employ a bagging based approach to build a number of balanced datasets, through which we can train a set of spam classifiers and use their ensemble to detect review spams. Experiments and comparisons demonstrate that our method, iSRD, outperforms baseline methods for review spam detection.
    IEEE IRI 2014, San Francisco, California, USA; 08/2014
  • Bin Li, Xingquan Zhu, Ruijiang Li, Chengqi Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Cross-domain collaborative filtering (CF) aims to share common rating knowledge across multiple related CF domains to boost the CF performance. In this paper, we view CF domains as a 2-D site-time coordinate system, on which multiple related domains, such as similar recommender sites or successive time-slices, can share group-level rating patterns. We propose a unified framework for cross-domain CF over the site-time coordinate system by sharing group-level rating patterns and imposing user/item dependence across domains. A generative model, say ratings over site-time (ROST), which can generate and predict ratings for multiple related CF domains, is developed as the basic model for the framework. We further introduce cross-domain user/item dependence into ROST and extend it to two real-world cross-domain CF scenarios: 1) ROST (sites) for alleviating rating sparsity in the target domain, where multiple similar sites are viewed as related CF domains and some items in the target domain depend on their correspondences in the related ones; and 2) ROST (time) for modeling user-interest drift over time, where a series of time-slices are viewed as related CF domains and a user at current time-slice depends on herself in the previous time-slice. All these ROST models are instances of the proposed unified framework. The experimental results show that ROST (sites) can effectively alleviate the sparsity problem to improve rating prediction performance and ROST (time) can clearly track and visualize user-interest drift over time.
    08/2014; 45(5). DOI:10.1109/TCYB.2014.2343982
  • Jia Wu, Shirui Pan, Xingquan Zhu, Zhihua Cai
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we formulate a novel graph-based learning problem, multi-graph classification (MGC), which aims to learn a classifier from a set of labeled bags each containing a number of graphs inside the bag. A bag is labeled positive, if at least one graph in the bag is positive, and negative otherwise. Such a multi-graph representation can be used for many real-world applications, such as webpage classification, where a webpage can be regarded as a bag with texts and images inside the webpage being represented as graphs. This problem is a generalization of multi-instance learning (MIL) but with vital differences, mainly because instances in MIL share a common feature space whereas no feature is available to represent graphs in a multi-graph bag. To solve the problem, we propose a boosting based multi-graph classification framework (bMGC). Given a set of labeled multi-graph bags, bMGC employs dynamic weight adjustment at both bag- and graph-levels to select one subgraph in each iteration as a weak classifier. In each iteration, bag and graph weights are adjusted such that an incorrectly classified bag will receive a higher weight because its predicted bag label conflicts to the genuine label, whereas an incorrectly classified graph will receive a lower weight value if the graph is in a positive bag (or a higher weight if the graph is in a negative bag). Accordingly, bMGC is able to differentiate graphs in positive and negative bags to derive effective classifiers to form a boosting model for MGC. Experiments and comparisons on real-world multi-graph learning tasks demonstrate the algorithm performance.
    07/2014; 45(3). DOI:10.1109/TCYB.2014.2327111
  • Meng Fang, Xingquan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: Traditional active learning assumes that the labeler is capable of providing ground truth label for each queried instance. In reality, a labeler might not have sufficient knowledge to label a queried instance but can only guess the label with his/her best knowledge. As a result, the label provided by the labeler, who is regarded to have uncertain labeling knowledge, might be incorrect. In this paper, we formulate this problem as a new “uncertain labeling knowledge” based active learning paradigm, and our key is to characterize the knowledge set of each labeler for active learning. By taking each unlabeled instance’s information and its likelihood of belonging to the uncertain knowledge set as a whole, we define an objective function to ensure that each queried instance is the most informative one for labeling and the labeler should also have sufficient knowledge to label the instance. To ensure label quality, we propose to use diversity density to characterize a labeler’s uncertain knowledge and further employ an error-reduction-based mechanism to either accept or decline a labeler’s label on uncertain instances. Experiments demonstrate the effectiveness of the proposed algorithm for real-world active learning tasks with uncertain labeling knowledge.
    Pattern Recognition Letters 07/2014; 43:98–108. DOI:10.1016/j.patrec.2013.10.011 · 1.06 Impact Factor
  • Yifan Fu, Bin Li, Xingquan Zhu, Chengqi Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Traditional active learning methods require the labeler to provide a class label for each queried instance. The labelers are normally highly skilled domain experts to ensure the correctness of the provided labels, which in turn results in expensive labeling cost. To reduce labeling cost, an alternative solution is to allow nonexpert labelers to carry out the labeling task without explicitly telling the class label of each queried instance. In this paper, we propose a new active learning paradigm, in which a nonexpert labeler is only asked “whether a pair of instances belong to the same class”, namely, a pairwise label homogeneity. Under such circumstances, our active learning goal is twofold: (1) decide which pair of instances should be selected for query, and (2) how to make use of the pairwise homogeneity information to improve the active learner. To achieve the goal, we propose a “Pairwise Query on Max-flow Paths” strategy to query pairwise label homogeneity from a nonexpert labeler, whose query results are further used to dynamically update a Min-cut model (to differentiate instances in different classes). In addition, a “Confidence-based Data Selection” measure is used to evaluate data utility based on the Min-cut model’s prediction results. The selected instances, with inferred class labels, are included into the labeled set to form a closed-loop active learning process. Experimental results and comparisons with state-of-the-art methods demonstrate that our new active learning paradigm can result in good performance with nonexpert labelers.
    IEEE Transactions on Knowledge and Data Engineering 04/2014; 26(4):808-822. DOI:10.1109/TKDE.2013.165 · 1.82 Impact Factor
  • Source
    Meng Fang, Jie Yin, Xingquan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper addresses the problem of transferring useful knowledge from a source network to predict node labels in a newly formed target network. While existing transfer learning research has primarily focused on vector-based data, in which the instances are assumed to be independent and identically distributed, how to effectively transfer knowledge across different information networks has not been well studied, mainly because networks may have their distinct node features and link relationships between nodes. In this paper, we propose a new transfer learning algorithm that attempts to transfer common latent structure features across the source and target networks. The proposed algorithm discovers these latent features by constructing label propagation matrices in the source and target networks, and mapping them into a shared latent feature space. The latent features capture common structure patterns shared by two networks, and serve as domain-independent features to be transferred between networks. Together with domain-dependent node features, we thereafter propose an iterative classification algorithm that leverages label correlations to predict node labels in the target network. Experiments on real-world networks demonstrate that our proposed algorithm can successfully achieve knowledge transfer between networks to help improve the accuracy of classifying nodes in the target network.
    03/2014; DOI:10.1109/ICDM.2013.116
  • Guohua Liang, Xingquan Zhu, Chengqi Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Many real world applications involve highly imbalanced class distribution. Research into learning from imbalanced class distribution is considered to be one of ten challenging problems in data mining research, and it has increasingly captured the attention of both academia and industry. In this work, we study the effects of different levels of imbalanced class distribution on bagging predictors by using under-sampling techniques. Despite the popularity of bagging in many real-world applications, some questions have not been clearly answered in the existing research, such as the effect of varying the levels of class distribution on different bagging predictors, e.g., whether bagging is superior to single learners when the levels of class distribution change. Most classification learning algorithms are designed to maximize the overall accuracy rate and assume that training instances are uniformly distributed; however, the overall accuracy does not represent correct prediction on the minority class, which is the class of interest to users. The overall accuracy metric is therefore ineffective for evaluating the performance of classifiers in extremely imbalanced data. This study investigates the effect of varying levels of class distribution on different bagging predictors based on the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) as a performance metric, using an under-sampling technique on 14 data-sets with imbalanced class distributions. Our experimental results indicate that Decision Table (DTable) and RepTree are the learning algorithms with the best bagging AUC performance. The AUC performances of bagging predictors are statistically superior to single learners, with the exception of Support Vector Machines (SVM) and Decision Stump (DStump).
    02/2014; 5(1):63-71. DOI:10.1007/s13042-012-0125-5
  • Xindong Wu, Xingquan Zhu, Gong-Qing Wu, Wei Ding
    [Show abstract] [Hide abstract]
    ABSTRACT: Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.
    IEEE Transactions on Knowledge and Data Engineering 01/2014; 26(1):97-107. DOI:10.1109/TKDE.2013.109 · 1.82 Impact Factor
  • Jia Wu, Xingquan Zhu, Chengqi Zhang, Zhihua Cai
    [Show abstract] [Hide abstract]
    ABSTRACT: Multi-instance learning concerns about building learning models from a number of labeled instance bags, where each bag consists of instances with unknown labels. A bag is labeled positive if one or more multiple instances inside the bag is positive, and negative otherwise. For all existing multi-instance learning algorithms, they are only applicable to the setting where instances in each bag are represented by a set of well defined feature values. In this paper, we advance the problem to a multi-instance multi-graph setting, where a bag contains a number of instances and graphs in pairs, and the learning objective is to derive classification models from labeled bags, containing both instances and graphs, to predict previously unseen bags with maximum accuracy. To achieve the goal, the main challenge is to properly represent graphs inside each bag and further take advantage of complementary information between instance and graph pairs for learning. In the paper, we propose a Dual Embedding Multi-Instance Multi-Graph Learning (DE-MIMG) algorithm, which employs a dual embedding learning approach to (1) embed instance distributions into the informative sub graphs discovery process, and (2) embed discovered sub graphs into the instance feature selection process. The dual embedding process results in an optimal representation for each bag to provide combined instance and graph information for learning. Experiments and comparisons on real-world multi-instance multi-graph learning tasks demonstrate the algorithm performance.
    2013 IEEE International Conference on Data Mining (ICDM); 12/2013
  • Hanning Yuan, Meng Fang, Xingquan Zhu
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a Hierarchical Sampling-based Multi-Instance ensemble LEarning (HSMILE) method. Due to the unique multi-instance learning nature, a positive bag contains at least one positive instance whereas samples (instance and sample are interchangeable terms in this paper) in a negative bag are all negative, simply applying bootstrap sampling to individual bags may severely damage a positive bag because a sampled positive bag may not contain any positive sample at all. To solve the problem, we propose to calculate probable positive sample distributions in each positive bag and use the distributions to preserve at least one positive instance in a sampled bag. The hierarchical sampling involves inter- and intrabag sampling to adequately perturb bootstrap sample sets for multi-instance ensemble learning. Theoretical analysis and experiments confirm that HSMILE outperforms existing multi-instance ensemble learning methods.
    IEEE Transactions on Knowledge and Data Engineering 12/2013; 25(12):2900-2905. DOI:10.1109/TKDE.2012.245 · 1.82 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Influence maximization, defined as finding a small subset of nodes that maximizes spread of influence in social networks, is NP-hard under both Linear Threshold (LT) and Independent Cascade (IC) models, where a line of greedy/heuristic algorithms have been proposed. The simple greedy algorithm [14] achieves an approximation ratio of 1-1/e. The advanced CELF algorithm [16], by exploiting the sub modular property of the spread function, runs 700 times faster than the simple greedy algorithm on average. However, CELF is still inefficient [4], as the first iteration calls for N times of spread estimations (N is the number of nodes in networks), which is computationally expensive especially for large networks. To this end, in this paper we derive an upper bound function for the spread function. The bound can be used to reduce the number of Monte-Carlo simulation calls in greedy algorithms, especially in the first iteration of initialization. Based on the upper bound, we propose an efficient Upper Bound based Lazy Forward algorithm (UBLF in short), by incorporating the bound into the CELF algorithm. We test and compare our algorithm with prior algorithms on real-world data sets. Experimental results demonstrate that UBLF, compared with CELF, reduces more than 95% Monte-Carlo simulations and achieves at least 2-5 times speed-raising when the seed set is small.
    2013 IEEE International Conference on Data Mining (ICDM); 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Network centrality score is an important measure toassess the importance and the major roles that each node plays in a network. In addition, the centrality score is also vitally important in assessing the overall structure and connectivity of a network. In a narrow sense, nearly all network mining algorithms, such as social network community detection, link predictions etc., involve certaintypes of centrality scores to some extent. Despite of its importance, very few researches have empirically analyzed the robustness of these measures in different network environments. Our existing works know very little about how network centrality score behaves at macro- (i.e. network) and micro- (i.e. individual node) levels. At the network level, what are the inherent connections betweennetwork topology structures and centrality scores? Will a sparse network be more (or less) robust in its centrality scores if any change is introduced to the network? At individual node levels, what types of nodes (high or low node degree) are more sensitive in their centrality scores, when changes are imposed to the network?And which centrality score is more reliable in revealing the genuine network structures? In this paper, we empirically analyze the robustness of three types of centrality scores: Betweenness centrality score, Closeness centrality score, and Eigen-vector centrality score for various types of networks. We systematicallyintroduce biased and unbiased changes to the networks, by adding and removing different percentages of edges and nodes, through which we can compare and analyze the robustness and sensitivity of each centrality score measurement. Our empirical studies drawimportant findings to help understand the behaviors of centrality scores in different social networks.
    Proceedings of the 2013 IEEE 25th International Conference on Tools with Artificial Intelligence; 11/2013

Publication Stats

2k Citations
98.60 Total Impact Points

Institutions

  • 2006–2015
    • Florida Atlantic University
      • Department of Computer and Electrical Engineering and Computer Science
      Boca Raton, Florida, United States
  • 2009–2014
    • University of Technology Sydney
      • • Centre for Quantum Computation and Intelligent Systems (QCIS)
      • • Faculty of Engineering and Information Technology
      Sydney, New South Wales, Australia
  • 2013
    • University of Illinois at Chicago
      Chicago, Illinois, United States
  • 2003–2009
    • University of Vermont
      • Department of Computer Science
      Burlington, Vermont, United States
  • 2008
    • Hefei University of Technology
      • Department of Computer Science & Technology
      Luchow, Anhui Sheng, China
  • 2007
    • Chinese Academy of Sciences
      • Research Center for Cyber Economy and Knowledge Management
      Peping, Beijing, China
  • 2001–2004
    • University of North Carolina at Charlotte
      • Department of Computer Science
      Charlotte, NC, United States
  • 2001–2003
    • Purdue University
      • Department of Computer Science
      West Lafayette, IN, United States
  • 2000–2002
    • Fudan University
      • School of Computer Science
      Shanghai, Shanghai Shi, China