Xiangfeng Luo

Shanghai University, Shanghai, Shanghai Shi, China

Are you Xiangfeng Luo?

Claim your profile

Publications (89)32.85 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Nonnegative Matrix Factorization (NMF) aims to factorize a matrix into two optimized nonnegative matrices appropriate for the intended applications. The method has been widely used for unsupervised learning tasks, including recommender systems (rating matrix of users by items) and document clustering (weighting matrix of papers by keywords). However, traditional NMF methods typically assume the number of latent factors (i.e., dimensionality of the loading matrices) to be fixed. This assumption makes them inflexible for many applications. In this paper, we propose a nonparametric NMF framework to mitigate this issue by using dependent Indian Buffet Processes (dIBP). In a nutshell, we apply a correlation function for the generation of two stick weights associated with each pair of columns of loading matrices, while still maintaining their respective marginal distribution specified by IBP. As a consequence, the generation of two loading matrices will be column-wise (indirectly) correlated. Under this same framework, two classes of correlation function are proposed (1) using Bivariate beta distribution and (2) using Copula function. Both methods allow us to adopt our work for various applications by flexibly choosing an appropriate parameter settings. Compared with the other state-of-the art approaches in this area, such as using Gaussian Process (GP)-based dIBP, our work is seen to be much more flexible in terms of allowing the two corresponding binary matrix columns to have greater variations in their non-zero entries. Our experiments on the real-world and synthetic datasets show that three proposed models perform well on the document clustering task comparing standard NMF without predefining the dimension for the factor matrices, and the Bivariate beta distribution-based and Copula-based models have better flexibility than the GP-based model.
  • Source
    Junyu Xuan · Jie Lu · Xiangfeng Luo · Guangquan Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Nonnegative Matrix Factorization (NMF) aims to factorize a matrix into two optimized nonnegative matrices and has been widely used for unsupervised learning tasks such as product recommendation based on a rating matrix. However, although networks between nodes with the same nature exist, standard NMF overlooks them, e.g., the social network between users. This problem leads to comparatively low recommendation accuracy because these networks are also reflections of the nature of the nodes, such as the preferences of users in a social network. Also, social networks, as complex networks, have many different structures. Each structure is a composition of links between nodes and reflects the nature of nodes, so retaining the different network structures will lead to differences in recommendation performance. To investigate the impact of these network structures on the factorization, this paper proposes four multi-level network factorization algorithms based on the standard NMF, which integrates the vertical network (e.g., rating matrix) with the structures of horizontal network (e.g., user social network). These algorithms are carefully designed with corresponding convergence proofs to retain four desired network structures. Experiments on synthetic data show that the proposed algorithms are able to preserve the desired network structures as designed. Experiments on real-world data show that considering the horizontal networks improves the accuracy of document clustering and recommendation with standard NMF, and various structures show their differences in performance on these two tasks. These results can be directly used in document clustering and recommendation systems.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Traditional Relational Topic Models provide a way to discover the hidden topics from a document network. Many theoretical and practical tasks, such as dimensional reduction, document clustering, link prediction, benefit from this revealed knowledge. However, existing relational topic models are based on an assumption that the number of hidden topics is known in advance, and this is impractical in many real-world applications. Therefore, in order to relax this assumption, we propose a nonparametric relational topic model in this paper. Instead of using fixed-dimensional probability distributions in its generative model, we use stochastic processes. Specifically, a gamma process is assigned to each document, which represents the topic interest of this document. Although this method provides an elegant solution, it brings additional challenges when mathematically modeling the inherent network structure of typical document network, i.e., two spatially closer documents tend to have more similar topics. Furthermore, we require that the topics are shared by all the documents. In order to resolve these challenges, we use a subsampling strategy to assign each document a different gamma process from the global gamma process, and the subsampling probabilities of documents are assigned with a Markov Random Field constraint that inherits the document network structure. Through the designed posterior inference algorithm, we can discover the hidden topics and its number simultaneously. Experimental results on both synthetic and real-world network datasets demonstrate the capabilities of learning the hidden topics and, more importantly, the number of topics.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Incorporating the side information of text corpus, i.e., authors, time stamps, and emotional tags, into the traditional text mining models has gained significant interests in the area of information retrieval, statistical natural language processing, and machine learning. One branch of these works is the so-called Author Topic Model (ATM), which incorporates the authors's interests as side information into the classical topic model. However, the existing ATM needs to predefine the number of topics, which is difficult and inappropriate in many real-world settings. In this paper, we propose an Infinite Author Topic (IAT) model to resolve this issue. Instead of assigning a discrete probability on fixed number of topics, we use a stochastic process to determine the number of topics from the data itself. To be specific, we extend a gamma-negative binomial process to three levels in order to capture the author-document-keyword hierarchical structure. Furthermore, each document is assigned a mixed gamma process that accounts for the multi-author's contribution towards this document. An efficient Gibbs sampling inference algorithm with each conditional distribution being closed-form is developed for the IAT model. Experiments on several real-world datasets show the capabilities of our IAT model to learn the hidden topics, authors' interests on these topics and the number of topics simultaneously.
  • Xiao Wei · Xiangfeng Luo · Qing Li · Jun Zhang · Zheng Xu
    [Show abstract] [Hide abstract]
    ABSTRACT: Online comment has become a popular and efficient way for sellers to acquire feedback from customers and improve their service quality. However, some key issues need to be solved about evaluating and improving the hotel service quality based on online comments automatically, such as how to use the less trustworthy online comments, how to discover the quality defects from online comments, and how to recommend more feasible or economical evaluation indexes to improve the service quality based on online comments. To solve the above problems, this paper first improves fuzzy comprehensive evaluation (FCE) by importing trustworthy degree to it and proposes an automatic hotel service quality assessment method using the improved FCE, which can automatically get more trustworthy evaluation from a large amount of less trustworthy online comments. Then, the causal relations among evaluation indexes are mined from online comments to build the fuzzy cognitive map for the hotel service quality, which is useful to unfold the problematic areas of hotel service quality, and recommend more economical solutions to improving the service quality. Finally, both case studies and experiments are conducted to demonstrate that the proposed methods are effective in evaluating and improving the hotel service quality using online comments.
    IEEE Transactions on Fuzzy Systems 02/2015; 23(1):72-84. DOI:10.1109/TFUZZ.2015.2390226 · 6.31 Impact Factor
  • Junyu Xuan · Jie Lu · Guangquan Zhang · Xiangfeng Luo
    [Show abstract] [Hide abstract]
    ABSTRACT: Graph mining has been a popular research area because of its numerous application scenarios. Many unstructured and structured data can be represented as graphs, such as, documents, chemical molecular structures, and images. However, an issue in relation to current research on graphs is that they cannot adequately discover the topics hidden in graph-structured data which can be beneficial for both the unsupervised learning and supervised learning of the graphs. Although topic models have proved to be very successful in discovering latent topics, the standard topic models cannot be directly applied to graph-structured data due to the "bag-of-word" assumption. In this paper, an innovative graph topic model (GTM) is proposed to address this issue, which uses Bernoulli distributions to model the edges between nodes in a graph. It can, therefore, make the edges in a graph contribute to latent topic discovery and further improve the accuracy of the supervised and unsupervised learning of graphs. The experimental results on two different types of graph datasets show that the proposed GTM outperforms the latent Dirichlet allocation on classification by using the unveiled topics of these two models to represent graphs.
    Cybernetics, IEEE Transactions on 01/2015; DOI:10.1109/TCYB.2014.2386282 · 3.47 Impact Factor
  • Xiangfeng Luo · Jun Zhang · Qing Li · Xiao Wei · Lei Lu
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper advocates for a novel approach to recommend texts at various levels of difficulties based on a proposed method, the algebraic complexity of texts (ACT). Different from traditional complexity measures that mainly focus on surface features like the numbers of syllables per word, characters per word, or words per sentence, ACT draws from the perspective of human concept learning, which can reflect the complex semantic relations inside texts. To cope with the high cost of measuring ACT, the Degree-2 Hypothesis of ACT is proposed to reduce the measurement from unrestricted dimensions to three dimensions. Based on the principle of “mental anchor,” an extension of ACT and its general edition [denoted as extension of text algebraic complexity (EACT) and general extension of text algebraic complexity (GEACT)] are developed, which take keywords’ and association rules’ complexities into account. Finally, using the scores given by humans as a benchmark, we compare our proposed methods with linguistic models. The experimental results show the order GEACT>EACT>ACT> Linguistic models, which means GEACT performs the best, while linguistic models perform the worst. Additionally, GEACT with lower convex functions has the best ability in measuring the algebraic complexities of text understanding. It may also indicate that the human complexity curve tends to be a curve like lower convex function rather than linear functions.
    IEEE Transactions on Human-Machine Systems 10/2014; 44(5):638-649. DOI:10.1109/THMS.2014.2329874 · 1.98 Impact Factor
  • Yang Liu · Xiangfeng Luo · Junyu Xuan
    [Show abstract] [Hide abstract]
    ABSTRACT: Online hot event discovery has become a flourishing frontier where online document streams are monitored to discover newly occurring events or assigned to previously detected events. However, hot events have the nature to evolve, and their inherent topically related words are also likely to evolve. It makes event discovery a challenging task for traditional-mining approaches. Combining word association and semantic community, Association Link Network (ALN) organizes the loosely distributed associated resources. This paper presents an ALN-based novel online hot event discovery approach. Technically, this approach is enacted around three stages. In the first stage, we extract significant features to represent the content of each document from the online document stream. During the second stage, we classify the online document stream into topically related detected events considering event evolution in the form of ALN. At the third stage, we create an ALN-based event detection algorithm, which is used to timely discover newly occurring hot events. The online datasets used in our empirical studies are acquired from Baidu News, which spans a range of 1315 hot events and 236,300 documents. Experimental results demonstrate the hot events discovery ability with respect to high accuracy, good scalability, and short runtime. Copyright © 2014 John Wiley & Sons, Ltd.
    Concurrency and Computation Practice and Experience 09/2014; DOI:10.1002/cpe.3374 · 0.78 Impact Factor
  • Chuanping Hu · Zheng Xu · Yunhuai Liu · Lin Mei · Lan Chen · Xiangfeng Luo
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent research shows that multimedia resources in the wild are growing at a staggering rate. The rapid increase number of multimedia resources has brought an urgent need to develop intelligent methods to organize and process them. In this paper, the semantic link network model is used for organizing multimedia resources. A whole model for generating the association relation between multimedia resources using semantic link network model is proposed. The definitions, modules, and mechanisms of the semantic link network are used in the proposed method. The integration between the semantic link network and multimedia resources provides a new prospect for organizing them with their semantics. The tags and the surrounding texts of multimedia resources are used to measure their semantic association. The hierarchical semantic of multimedia resources is defined by their annotated tags and surrounding texts. The semantics of tags and surrounding texts are different in the proposed framework. The modules of semantic link network model are implemented to measure association relations. A real data set including 100 thousand images with social tags from Flickr is used in our experiments. Two evaluation methods, including clustering and retrieval, are performed, which shows the proposed method can measure the semantic relatedness between Flickr images accurately and robustly.
    IEEE Transactions on Emerging Topics in Computing 09/2014; 2(3):376-387. DOI:10.1109/TETC.2014.2316525
  • Jun Zhang · Xiangfeng Luo · Lei Lu · Weidong Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: The acquisition of deep textual semantics is a key issue which significantly improves the performances of elearning, web search and web knowledge services, etc. Though many models have been developed to acquire textual semantics, the acquisition of deep textual semantics is still a challenge issue. Herein, an acquisition model of deep textual semantics is developed to enhance the capability of text understanding, which includes two parts: 1) how to obtain and organize the domain knowledge extracted from text set and 2) how to activate the domain knowledge for obtaining the deep textual semantics. The activation process involves the Gough mode reading theory, Landscape model and memory cognitive process. The Gough mode is the main human reading model that enables the authors to acquire deep semantics in a text readingprocess. Generalized semantic field is proposed to store the domain knowledge in the form of Long Term Memory (LTM). Specialized semantic field, which is acquired by the interaction process between the text fragment and the domain knowledge, is introduced to describe the change process of textual semantics. By their mutual actions, the authors can get the deep textual semantics which enhances the capability of text understanding; therefore, the machine can understand the text more precisely and correctly than those models only obtaining surface textual semantics.
    International Journal of Cognitive Informatics and Natural Intelligence 08/2014; 6(2):82-103. DOI:10.4018/jcini.2012040105
  • Xinzhi Wang · Xiangfeng Luo · Huiming Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: Web events, whose data occur as one kind of big data, have attracted considerable interests during the past years. However, most existing related works fail to measure the veracity of web events. In this research, we propose an approach to measure the veracity of web event via its uncertainty. Firstly, the proposed approach mines several event features from the data of web event which may influence on the measuring process of uncertainty. Secondly, one computational model is introduced to simulate the influence process of the above features on the evolution process of web event. Thirdly, matrix operations are done to confirm that the result of the proposed iterative algorithm is in coincidence with the computational model. Finally, experiments are made based on the analysis above, and the results proved that the proposed uncertainty measuring algorithm is efficient and has high accuracy to measure the veracity of web event from the big data.
    Journal of Systems and Software 07/2014; 102. DOI:10.1016/j.jss.2014.07.023 · 1.25 Impact Factor
  • Zheng Xu · Xiangfeng Luo · Shunxiang Zhang · Xiao Wei · Lin Mei · Chuanping Hu
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we study the problem of mining temporal semantic relations between entities. The goal of the studied problem is to mine and annotate a semantic relation with temporal, concise, and structured information, which can release the explicit, implicit, and diversity semantic relations between entities. The temporal semantic annotations can help users to learn and understand the unfamiliar or new emerged semantic relations between entities. The proposed temporal semantic annotation structure integrates the features from IEEE and Renlifang. We propose a general method to generate temporal semantic annotation of a semantic relation between entities by constructing its connection entities, lexical syntactic patterns, context sentences, context graph, and context communities. Empirical experiments on two different datasets including a LinkedIn dataset and movie star dataset show that the proposed method is effective and accurate. Different from the manually generated annotation repository such as Wikipedia and LinkedIn, the proposed method can automatically mine the semantic relation between entities and does not need any prior knowledge such as ontology or the hierarchical knowledge base. The proposed method can be used on some applications, which proves the effectiveness of the proposed temporal semantic relations on many web mining tasks.
    Future Generation Computer Systems 07/2014; 37:468–477. DOI:10.1016/j.future.2013.09.027 · 2.64 Impact Factor
  • Jun Zhang · Qing Li · Xiangfeng Luo · Xiao Wei
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the most fundamental works for providing better Web services is the discovery of inter-word relations. However, the state of the art is either to acquire specific relations (e.g., causality) by involving much human efforts, or incapable of specifying relations in detail when no human effort is needed. In this paper, we propose a novel mechanism based on linguistics and cognitive psychology to automatically learn and specify association relations between words. The proposed mechanism, termed as ALSAR, includes two major processes: the first is to learn association relations from the perspective of verb valency grammar in linguistics, and the second is to further lable/specify the association relations with the help of related verbs. The resultant mechanism (i.e., ALSAR) is able to provide semantic descriptors which make inter-word relations more explicit without involving any human labeling. Furthermore, ALSAR incurs a very low complexity, and experimental evaluations on Chinese news articles crawled from Baidu News demonstrate good performance of ALSAR.
    Web-Age Information Management, 06/2014: pages 578-589;
  • Zheng Xu · Xiao Wei · Xiangfeng Luo · Yunhuai Liu · Lin Mei · Chuanping Hu · Lan Chen
    [Show abstract] [Hide abstract]
    ABSTRACT: An explosive growth in the volume, velocity, and variety of the data available on the Internet is witnessed recently. The data originated from multiple types of sources including mobile devices, sensors, individual archives, social networks, Internet of Things, enterprises, cameras, software logs, health data has led to one of the most challenging research issues of the big data era. In this paper, Knowle—an online news management system upon semantic link network model is introduced. Knowle is a news event centrality data management system. The core elements of Knowle are news events on the Web, which are linked by their semantic relations. Knowle is a hierarchical data system, which has three different layers including the bottom layer (concepts), the middle layer (resources), and the top layer (events). The basic blocks of Knowle system—news collection, resources representation, semantic relations mining, semantic linking news events are given. Knowle does not require data providers to follow semantic standards such as RDF or OWL, which is a semantics-rich self-organized network. It reflects various semantic relations of concepts, news, and events. Moreover, in the case study, Knowle is used for organizing and mining health news, which shows the potential on forming the basis of designing and developing big data analytics based innovation framework in health domain.
    Future Generation Computer Systems 04/2014; 43. DOI:10.1016/j.future.2014.04.002 · 2.64 Impact Factor
  • Shunxiang Zhang · Xiangfeng Luo · Junyu Xuan · Xue Chen · Weimin Xu
    [Show abstract] [Hide abstract]
    ABSTRACT: Association Link Network (ALN) is a kind of Semantic Link Network built by mining the association relations among multimedia Web resources for effectively supporting Web intelligent application such as Web-based learning, and semantic search. This paper explores the Small-World properties of ALN to provide theoretical support for association learning (i.e., a simple idea of “learning from Web resources”). First, a filtering algorithm of ALN is proposed to generate the filtered status of ALN, aiming to observe the Small-World properties of ALN at given network size and filtering parameter. Comparison of the Small-World properties between ALN and random graph shows that ALN reveals prominent Small-World characteristic. Then, we investigate the evolution of Small-World properties over time at several incremental network sizes. The average path length of ALN scales with the network size, while clustering coefficient of ALN is independent of the network size. And we find that ALN has smaller average path length and higher clustering coefficient than WWW at the same network size and network average degree. After that, based on the Small-World characteristic of ALN, we present an Association Learning Model (ALM), which can efficiently provide association learning of Web resources in breadth or depth for learners.
    World Wide Web 03/2014; DOI:10.1007/s11280-012-0171-7 · 1.62 Impact Factor
  • Source
    Zheng Xu · Xiangfeng Luo · Yunhuai Liu · Lin Mei · Chuanping Hu
    [Show abstract] [Hide abstract]
    ABSTRACT: Relatedness measurement between multimedia such as images and videos plays an important role in computer vision, which is a base for many multimedia related applications including clustering, searching, recommendation, and annotation. Recently, with the explosion of social media, users can upload media data and annotate content with descriptive tags. In this paper, we aim at measuring the semantic relatedness of Flickr images. Firstly, four information theory based functions are used to measure the semantic relatedness of tags. Secondly, the integration of tags pair based on bipartite graph is proposed to remove the noise and redundancy. Thirdly, the order information of tags is added to measure the semantic relatedness, which emphasizes the tags with high positions. The data sets including 1000 images from Flickr are used to evaluate the proposed method. Two data mining tasks including clustering and searching are performed by the proposed method, which shows the effectiveness and robustness of the proposed method. Moreover, some applications such as searching and faceted exploration are introduced using the proposed method, which shows that the proposed method has broad prospects on web based tasks.
    The Scientific World Journal 02/2014; 2014(4):758089. DOI:10.1155/2014/758089 · 1.73 Impact Factor
  • Zheng Xu · Xiangfeng Luo · Lin Mei · Chuanping Hu
    [Show abstract] [Hide abstract]
    ABSTRACT: Association relations between concepts are a class of simple but powerful regularities in binary data, which play important roles in enterprises and organizations with huge amounts of data. However, although there can be easily large number of association relation mined from databases, since existing objective and subjective methods scarcely take semantics into consideration, it has been recognized early in the knowledge discovery literature that most of them are of no interest to the user. In this paper, the semantic discrimination capability (SDC) of association relation is measured based on discrimination value model first. The formula of SDC integrating both statistical and graph features is proposed from five different strategies. The high correlation coefficient of the proposed method against discrimination value shows that the proposed SDC measure is accuracy. Moreover, an application using SDC on document clustering is carried out, which shows that SDC has broad prospects on data‐related task such as document clustering. Copyright 2013 John Wiley © Sons, Ltd.
    Concurrency and Computation Practice and Experience 02/2014; 26(2). DOI:10.1002/cpe.2999 · 0.78 Impact Factor
  • Xiangfeng Luo · Jun Zhang · Feiyue Ye · Peng Wang · Chuanliang Cai
    [Show abstract] [Hide abstract]
    ABSTRACT: How to build a text knowledge representation model, which carries rich knowledge and has a flexible reasoning ability as well as can be automatically constructed with a low computational complexity, is a fundamental challenge for reasoning-based knowledge services, especially with the rapid growth of web resources. However, current text knowledge representation models either lose much knowledge [e. g., vector space model (VSM)] or have a high complex computation [e. g., latent Dirichlet allocation (LDA)]; even some of them cannot be constructed automatically [e. g., web ontology language, (OWL)]. In this paper, a novel text knowledge representation model, power series representation (PSR) model, which has a low complex computation in text knowledge constructing process, is proposed to leverage the contradiction between carrying rich knowledge and automatic construction. First, concept algebra of human concept learning is developed to represent text knowledge as the form of power series. Then, degree-2 power series hypothesis is introduced to simplify the proposed PSR model, which can be automatically constructed with a lower complex computation and has more knowledge than the VSM and LDA. After that, degree-2 power series hypothesis-based reasoning operations are developed, which provide a more flexible reasoning ability than OWL and LDA. Furthermore, experiments and comparisons with current knowledge representation models show that our model has better characteristics than others when representing text knowledge. Finally, a demo is given to indicate that PSR model has a good prospect over the area of web semantic search.
    IEEE Transactions on Systems, Man, and Cybernetics: Systems 01/2014; 44(1):86-102. DOI:10.1109/TSMCC.2012.2231674 · 1.70 Impact Factor
  • Junyu Xuan · Xiangfeng Luo · Jie Lu
    [Show abstract] [Hide abstract]
    ABSTRACT: On the web, there are numerous websites publishing web pages to cover the events occurring in society. The web events data satisfies the well-accepted attributes of big data: Volume, Velocity, Variety and Value. As a great value of web events data, website preferences can help the followers of web events, e.g. peoples or organizations, to select the proper websites to follow their interested aspects of web events. However, the big volume, fast evolution speed, multisource and unstructured data all together make the value of website preferences mining very challenging. In this paper, website preference is formally defined at first. Then, according to the hierarchical attribute of web events data, we propose a hierarchical network model to organize big data of a web event from different organizations, different areas and different nations at a given time stamp. With this hierarchical network structure in hand, two strategies are proposed to mine the value of websites preferences from web events data. The first straightforward strategy utilizes the communities of keyword level network and the mapping relations between websites and keywords to unveil the Value in them. By taking the whole hierarchical network structure into consideration, an iterative algorithm is proposed in second strategy to refine the keyword communities like the first strategy. At last, an evaluation criteria of website preferences is designed to compare the performances of two proposed strategies. Experimental results show the proper combination of horizontal relations (each level network) with vertical relations (mapping relations between three level networks) can extract more value from web events data and then improve the efficiency on website preferences mining.
    Proceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering; 12/2013
  • Zheng Xu · Xiangfeng Luo · Xiao Wei · Lin Mei
    [Show abstract] [Hide abstract]
    ABSTRACT: Online popular events, which are constructed from news stories using the techniques of Topic Detection and Tracking (TDT), bring convenience to users who intend to see what is going on through the Internet. Recently, the web is becoming an important event information provider and poster due to its real-time, open, and dynamic features. However, it is difficult to detect events since the huge scale and dynamics of the internet. In this paper, we define the novel problem of investigating impact factors for event detection. We give the definitions of five impact factors including the number of increased web pages, the number of increased keywords, the number of communities, the average clustering coefficient, and the average similarities of web pages. These five impact factors contain statistic and content information of an event. Empirical experiments on real datasets including Google Zeitgeist and Google Trends show that that the number of web pages and the average clustering coefficient can be used to detect events. Some strategies integrating the number of web pages and the average clustering coefficient are also employed. The evaluations on real dataset show that the proposed function integrating the number of web pages and the average clustering coefficient can be used for event detection efficiently and correctly.
    Proceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering; 12/2013