[Show abstract][Hide abstract] ABSTRACT: With the advent of big data era, efficiently and effectively querying useful information on the Web, the largest heterogeneous data source in the world, is becoming increasingly challenging. Page ranking is an essential component of search engines because it determines the presentation sequence of the tens of millions of returned pages associated with a single query. It therefore plays a significant role in regulating the search quality and user experience for information retrieval. When measuring the authority of a web page, most methods focus on the quantity and the quality of the neighborhood pages that direct to it using inbound hyperlinks. However, these methods ignore the diversity of such neighborhood pages, which we believe is an important metric for objectively evaluating web page authority. In comparison with true authority pages that usually contain a large number of inbound hyperlinks from a wide variety of sources, it is difficult for fake authorities, which boost their page rank using techniques such as link farms, to occupy the high diversity of inbound hyperlinks due to prohibitively high costs. We propose a probabilistic counting-based method to quantitatively and efficiently compute the diversity of inbound hyperlinks. We then propose a novel link-based ranking algorithm, named Drank, to rank pages by simultaneously analyzing the quantity, quality and diversity of their inbound hyperlinks. The validations on both synthetic and real-world data show that Drank outperforms other state-of-the-art methods in terms of both finding high-quality pages and suppressing web spams.
Knowledge-Based Systems 01/2015; 77. DOI:10.1016/j.knosys.2014.12.028 · 2.95 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The selection plays an important role in the machine-vision-based online detection of foreign fibers in cotton because of improvement detection accuracy and speed. Feature sets of foreign fibers in cotton belong to multi-character feature sets. That means the high-quality feature sets of foreign fibers in cotton consist of three classes of features which are respectively the color, texture and shape features. The multi-character feature sets naturally contain a space constraint which lead to the smaller feature space than the general feature set with the same number of features, however the existing algorithms do not consider the space characteristic of multi-character feature sets and treat the multi-character feature sets as the general feature sets. This paper proposed an improved ant colony optimization for features election, whose objective is to find the (near) optimal subsets in multi-character feature sets. In the proposed algorithm, group constraint is adopted to limit subset constructing process and probability transition for reducing the effect of invalid subsets and improve the convergence efficiency. As a result, the algorithm can effectively find the high-quality subsets in the feature space of multi-character feature sets. The proposed algorithm is tested in the datasets of foreign fibers in cotton and comparisons with other methods are also made. The experimental results show that the proposed algorithm can find the high-quality subsets with smaller size and high classification accuracy. This is very important to improve performance of online detection systems of foreign fibers in cotton.
[Show abstract][Hide abstract] ABSTRACT: Proper parameter settings of support vector machine (SVM) and feature selection are of great importance to its efficiency and accuracy. In this paper, we propose a parallel time variant particle swarm optimization (TVPSO) algorithm to simultaneously perform the parameter optimization and feature selection for SVM, termed PTVPSO-SVM. It is implemented in a parallel environment using Parallel Virtual Machine (PVM). In the proposed method, a weighted function is adopted to design the objective function of PSO, which takes into account the average classification accuracy rates (ACC) of SVM, the number of support vectors (SVs) and the selected features simultaneously. Furthermore, mutation operators are introduced to overcome the problem of the premature convergence of PSO algorithm. In addition, an improved binary PSO algorithm is employed to enhance the performance of PSO algorithm in feature selection task. The performance of the proposed method is compared with that of other methods on a comprehensive set of 30 benchmark data sets. The empirical results demonstrate that the proposed method cannot only obtain much more appropriate model parameters, discriminative feature subset as well as smaller sets of SVs but also significantly reduce the computational time, giving high predictive accuracy.
[Show abstract][Hide abstract] ABSTRACT: Given a network and a group of target nodes, the task of proximity alignment is to find out a sequence of nodes that are the most relevant to the targets in terms of the linkage structure of the network. Proximity alignment will find important applications in many areas such as online recommendation in e-commerce and infectious disease controlling in public healthcare. In spite of great efforts having been made to design various metrics of similarities and centralities in terms of network structure, to the best of our knowledge, there have been no studies in the literature that address the issue of proximity alignment by explicitly and adequately exploring the intrinsic connections between macroscopic community structure and microscopic node proximities. However, the influence of community structure on proximity alignment is indispensable not only because they are ubiquitous in real-world networks but also they can characterize node proximity in a more natural way. In this work, a novel proximity alignment method called the PAA is proposed to address this problem. The PAA first decomposes the given network into communities based on its global structure and then compute node proximities based on the local structure of communities. In this way, the solution of the PAA is expected to be more reasonable in the sense of both global and local relevance among nodes being sufficiently considered during the process of proximity aligning. To handle large-scale networks, the PAA is implemented by a proposed online-offline schema, in which expensive computations such as community detection will be done offline so that online queries can be quickly responded by calculating node proximities in an efficient way based on indexed communities. The efficacy and the applications of the PAA have been validated and demonstrated. Our work shows that the PAA outperforms existing methods and enables us to explore real-world networks from a novel perspective.
[Show abstract][Hide abstract] ABSTRACT: Discovery of communities in complex networks is a fundamental data analysis task in various domains. Generative models are a promising class of techniques for identifying modular properties from networks, which has been actively discussed recently. However, most of them cannot preserve the degree sequence of networks, which will distort the community detection results. Rather than using a blockmodel as most current works do, here we generalize a configuration model, namely, a null model of modularity, to solve this problem. Towards decomposing and combining sub-graphs according to the soft community memberships, our model incorporates the ability to describe community structures, something the original model does not have. Also, it has the property, as with the original model, that it fixes the expected degree sequence to be the same as that of the observed network. We combine both the community property and degree sequence preserving into a single unified model, which gives better community results compared with other models. Thereafter, we learn the model using a technique of nonnegative matrix factorization and determine the number of communities by applying consensus clustering. We test this approach both on synthetic benchmarks and on real-world networks, and compare it with two similar methods. The experimental results demonstrate the superior performance of our method over competing methods in detecting both disjoint and overlapping communities.
Journal of Statistical Mechanics Theory and Experiment 09/2013; 2013(09):P09013. DOI:10.1088/1742-5468/2013/09/P09013 · 2.40 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Network community mining algorithms aim at efficiently and effectively discovering all such communities from a given network. Many related methods have been proposed and applied to different areas including social network analysis, gene network analysis and web clustering engines. Most of the existing methods for mining communities are centralized. In this paper, we present a multi-agent based decentralized algorithm, in which a group of autonomous agents work together to mine a network through a proposed self-aggregation and self-organization mechanism. Thanks to its decentralized feature, our method is potentially suitable for dealing with distributed networks, whose global structures are hard to obtain due to their geographical distributions, decentralized controls or huge sizes. The effectiveness of our method has been tested against different benchmark networks.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we propose a multi-layer ant-based algorithm MABA, which
detects communities from networks by means of locally optimizing
modularity using individual ants. The basic version of MABA, namely
SABA, combines a self-avoiding label propagation technique with a
simulated annealing strategy for ant diffusion in networks. Once the
communities are found by SABA, this method can be reapplied to a higher
level network where each obtained community is regarded as a new vertex.
The aforementioned process is repeated iteratively, and this corresponds
to MABA. Thanks to the intrinsic multi-level nature of our algorithm, it
possesses the potential ability to unfold multi-scale hierarchical
structures. Furthermore, MABA has the ability that mitigates the
resolution limit of modularity. The proposed MABA has been evaluated on
both computer-generated benchmarks and widely used real-world networks,
and has been compared with a set of competitive algorithms. Experimental
results demonstrate that MABA is both effective and efficient (in near
linear time with respect to the size of network) for discovering
Advances in Complex Systems 03/2013; 15(08). DOI:10.1142/S0219525912500361 · 0.97 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In order to further improve the performance of current genetic
algorithms aiming at discovering communities, a local search based
genetic algorithm GALS is here proposed. The core of GALS is a local
search based mutation technique. In order to overcome the drawbacks of
traditional mutation methods, the paper develops the concept of marginal
gene and then the local monotonicity of modularity function Q is deduced
from each nodes local view. Based on these two elements, a new mutation
method combined with a local search strategy is presented. GALS has been
evaluated on both synthetic benchmarks and several real networks, and
compared with some presently competing algorithms. Experimental results
show that GALS is highly effective and efficient for discovering
International Journal of Computational Intelligence Systems 03/2013; 6(2). DOI:10.1080/18756891.2013.773175 · 0.45 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The real-world continuous double auction CDA market is a dynamic environment. However, most of the existing agent bidding strategies are simply designed for static markets. A new detecting method for bidding strategy is necessary for more practical simulations and applications. In this paper, we present a novel agent-based computing approach called the GDX Plus GDXP model. In the proposed model, trades are decided according to the market events in history combined with the forecast of market trends. The GDXP model employs a dynamic adjustment mechanism to make the bidding strategy adapt to the shocks in a dynamic environment. The experimental results of the comparison between GDXP and other typical models, with respect to both static and dynamic CDA markets, demonstrate the performance of the GDXP model.
Web Intelligence and Agent Systems 01/2013; 11(1):55-65. DOI:10.3233/WIA-130262
[Show abstract][Hide abstract] ABSTRACT: Community structure is ubiquitous in real-world networks and community detection is of fundamental importance in many applications. Although considerable efforts have been made to address the task, the objective of seeking a good trade-off between effectiveness and efficiency, especially in the case of large-scale networks, remains challenging. This paper explores the nature of community structure from a probabilistic perspective and introduces a novel community detection algorithm named as PMC, which stands for probabilistically mining communities, to meet the challenging objective. In PMC, community detection is modeled as a constrained quadratic optimization problem that can be efficiently solved by a random walk based heuristic. The performance of PMC has been rigorously validated through comparisons with six representative methods against both synthetic and real-world networks with different scales. Moreover, two applications of analyzing real-world networks by means of PMC have been demonstrated.
[Show abstract][Hide abstract] ABSTRACT: Malaria transmission can be affected by multiple or even hidden factors, making it difficult to timely and accurately predict the impact of elimination and eradication programs that have been undertaken and the potential resurgence and spread that may continue to emerge. One approach at the moment is to develop and deploy surveillance systems in an attempt to identify them as timely as possible and thus to enable policy makers to modify and implement strategies for further preventing the transmission. Most of the surveillance data will be of temporal and spatial nature. From an interdisciplinary point of view, it would be interesting to ask the following important as well as challenging question: Based on the available surveillance data in temporal and spatial forms, how can we build a more effective surveillance mechanism for monitoring and early detecting the relative prevalence and transmission patterns of malaria? What we can note from the existing clustering-based surveillance software systems is that they do not infer the underlying transmission networks of malaria. However, such networks can be quite informative and insightful as they characterize how malaria transmits from one place to another. They can also in turn allow public health policy makers and researchers to uncover the hidden and interacting factors such as environment, genetics and ecology and to discover/predict malaria transmission patterns/trends. The network perspective further extends the present approaches to modelling malaria transmission based on a set of chosen factors. In this article, we survey the related work on transmission network inference, discuss how such an approach can be utilized in developing an effective computational means for inferring malaria transmission networks based on partial surveillance data, and what methodological steps and issues may be involved in its formulation and validation.
[Show abstract][Hide abstract] ABSTRACT: Discovery of communities in complex networks is a fundamental data analysis problem with applications in various domains. Most of the existing approaches have focused on discovering communities of nodes, while recent studies have shown great advantages and utilities of the knowledge of communities of links in networks. From this new perspective, we propose a link dynamics based algorithm, called UELC, for identifying link communities of networks. In UELC, the stochastic process of a link–node–link random walk is employed to unfold an embedded bipartition structure of links in a network. The local mixing properties of the Markov chain underlying the random walk are then utilized to extract two emerging link communities. Further, the random walk and the bipartitioning processes are wrapped in an iterative subdivision strategy to recursively identify link partitions that segregate the network links into multiple subdivisions. We evaluate the performance of the new method on synthetic benchmarks and demonstrate its utility on real-world networks. Our experimental results show that our method is highly effective for discovering link communities in complex networks. As a comparison, we also extend UELC to extracting communities of nodes, and show that it is effective for node community identification.
Journal of Statistical Mechanics Theory and Experiment 10/2012; 2012(10):P10015. DOI:10.1088/1742-5468/2012/10/P10015 · 2.40 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Network communities refer to groups of vertices within which their connecting links are dense but between which they are sparse. A network community mining problem (or NCMP for short) is concerned with the problem of finding all such communities from a given network. A wide variety of applications can be formulated as NCMPs, ranging from social and/or biological network analysis to web mining and searching. So far, many algorithms addressing NCMPs have been developed and most of them fall into the categories of either optimization based or heuristic methods. Distinct from the existing studies, the work presented in this paper explores the notion of network communities and their properties based on the dynamics of a stochastic model naturally introduced. In the paper, a relationship between the hierarchical community structure of a network and the local mixing properties of such a stochastic model has been established with the large-deviation theory. Topological information regarding to the community structures hidden in networks can be inferred from their spectral signatures. Based on the above-mentioned relationship, this work proposes a general framework for characterizing, analyzing, and mining network communities. Utilizing the two basic properties of metastability, i.e., being locally uniform and temporarily fixed, an efficient implementation of the framework, called the LM algorithm, has been developed that can scalably mine communities hidden in large-scale networks. The effectiveness and efficiency of the LM algorithm have been theoretically analyzed as well as experimentally validated.
IEEE Transactions on Knowledge and Data Engineering 03/2012; 24(2-24):326 - 337. DOI:10.1109/TKDE.2010.233 · 2.07 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: To accurately and actively provide users with their potentially interested information or services is the main task of a recommender system. Collaborative filtering is one of the most widely adopted recommender methods, whereas it is suffering the issue of sparse rating data that will severely degenerate the quality of recommendations. To address this issue, the article proposes a novel method, named the FTRA (Fusing Trust and Ratings), trying to improve the performance of collaborative filtering recommendation by means of elaborately integrating twofold sparse information, i.e., the conventional rating data given by users and the social trust network among the same users. The performance of FTRA is rigorously validated by comparing it with six representative methods on a real-world dataset. The experimental results show that the FTRA outperforms all other competitors in terms of both precision and recall. More importantly, our work suggests that the strategy of augmenting sparse rating data by fusing trust networks does significantly improve the quality of conventional collaborative filtering recommendation, and its quality could be further improved by means of designing more effective integrating schemes.
Computational Intelligence and Security (CIS), 2012 Eighth International Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present an enhanced fuzzy k-nearest neighbor (FKNN) classifier based computer aided diagnostic (CAD) system for thyroid disease. The neighborhood size k and the fuzzy strength parameter m in FKNN classifier are adaptively specified by the particle swarm optimization (PSO) approach. The adaptive control parameters including time-varying acceleration coefficients (TVAC) and time-varying inertia weight (TVIW) are employed to efficiently control the local and global search ability of PSO algorithm. In addition, we have validated the effectiveness of the principle component analysis (PCA) in constructing a more discriminative subspace for classification. The effectiveness of the resultant CAD system, termed as PCA-PSO-FKNN, has been rigorously evaluated against the thyroid disease dataset, which is commonly used among researchers who use machine learning methods for thyroid disease diagnosis. Compared to the existing methods in previous studies, the proposed system has achieved the highest classification accuracy reported so far via 10-fold cross-validation (CV) analysis, with the mean accuracy of 98.82% and with the maximum accuracy of 99.09%. Promisingly, the proposed CAD system might serve as a new candidate of powerful tools for diagnosing thyroid disease with excellent performance.
Journal of Medical Systems 12/2011; 36(5):3243-54. DOI:10.1007/s10916-011-9815-x · 2.21 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Bankruptcy prediction is one of the most important issues in financial decision-making. Constructing effective corporate bankruptcy prediction models in time is essential to make companies or banks prevent from bankruptcy. This study proposes a novel bankruptcy prediction model based on an adaptive fuzzy k-nearest neighbor (FKNN) method, where the neighborhood size k and the fuzzy strength parameter m are adaptively specified by the continuous particle swarm optimization (PSO) approach. In addition to performing the parameter optimization for FKNN, PSO is also utilized to choose the most discriminative subset of features for prediction. Adaptive control parameters including time-varying acceleration coefficients (TVAC) and time-varying inertia weight (TVIW) are employed to efficiently control the local and global search ability of PSO algorithm. Moreover, both the continuous and binary PSO are implemented in parallel on a multi-core platform. The proposed bankruptcy prediction model, named PTVPSO-FKNN, is compared with five other state-of-the-art classifiers on two real-life cases. The obtained results clearly confirm the superiority of the proposed model in terms of classification accuracy, Type I error, Type II error and area under the receiver operating characteristic curve (AUC) criterion. The proposed model also demonstrates its ability to identify the most discriminative financial ratios. Additionally, the proposed model has reduced a large amount of computational time owing to its parallel implementation. Promisingly, PTVPSO-FKNN might serve as a new candidate of powerful early warning systems for bankruptcy prediction with excellent performance.
Knowledge-Based Systems 12/2011; 24(8):1348-1359. DOI:10.1016/j.knosys.2011.06.008 · 2.95 Impact Factor