[Show abstract][Hide abstract] ABSTRACT: Nonnegative Matrix Factorization (NMF) aims to factorize a matrix into two
optimized nonnegative matrices appropriate for the intended applications. The
method has been widely used for unsupervised learning tasks, including
recommender systems (rating matrix of users by items) and document clustering
(weighting matrix of papers by keywords). However, traditional NMF methods
typically assume the number of latent factors (i.e., dimensionality of the
loading matrices) to be fixed. This assumption makes them inflexible for many
applications. In this paper, we propose a nonparametric NMF framework to
mitigate this issue by using dependent Indian Buffet Processes (dIBP). In a
nutshell, we apply a correlation function for the generation of two stick
weights associated with each pair of columns of loading matrices, while still
maintaining their respective marginal distribution specified by IBP. As a
consequence, the generation of two loading matrices will be column-wise
(indirectly) correlated. Under this same framework, two classes of correlation
function are proposed (1) using Bivariate beta distribution and (2) using
Copula function. Both methods allow us to adopt our work for various
applications by flexibly choosing an appropriate parameter settings. Compared
with the other state-of-the art approaches in this area, such as using Gaussian
Process (GP)-based dIBP, our work is seen to be much more flexible in terms of
allowing the two corresponding binary matrix columns to have greater variations
in their non-zero entries. Our experiments on the real-world and synthetic
datasets show that three proposed models perform well on the document
clustering task comparing standard NMF without predefining the dimension for
the factor matrices, and the Bivariate beta distribution-based and Copula-based
models have better flexibility than the GP-based model.
[Show abstract][Hide abstract] ABSTRACT: Nonnegative Matrix Factorization (NMF) aims to factorize a matrix into two
optimized nonnegative matrices and has been widely used for unsupervised
learning tasks such as product recommendation based on a rating matrix.
However, although networks between nodes with the same nature exist, standard
NMF overlooks them, e.g., the social network between users. This problem leads
to comparatively low recommendation accuracy because these networks are also
reflections of the nature of the nodes, such as the preferences of users in a
social network. Also, social networks, as complex networks, have many different
structures. Each structure is a composition of links between nodes and reflects
the nature of nodes, so retaining the different network structures will lead to
differences in recommendation performance. To investigate the impact of these
network structures on the factorization, this paper proposes four multi-level
network factorization algorithms based on the standard NMF, which integrates
the vertical network (e.g., rating matrix) with the structures of horizontal
network (e.g., user social network). These algorithms are carefully designed
with corresponding convergence proofs to retain four desired network
structures. Experiments on synthetic data show that the proposed algorithms are
able to preserve the desired network structures as designed. Experiments on
real-world data show that considering the horizontal networks improves the
accuracy of document clustering and recommendation with standard NMF, and
various structures show their differences in performance on these two tasks.
These results can be directly used in document clustering and recommendation
[Show abstract][Hide abstract] ABSTRACT: Traditional Relational Topic Models provide a way to discover the hidden
topics from a document network. Many theoretical and practical tasks, such as
dimensional reduction, document clustering, link prediction, benefit from this
revealed knowledge. However, existing relational topic models are based on an
assumption that the number of hidden topics is known in advance, and this is
impractical in many real-world applications. Therefore, in order to relax this
assumption, we propose a nonparametric relational topic model in this paper.
Instead of using fixed-dimensional probability distributions in its generative
model, we use stochastic processes. Specifically, a gamma process is assigned
to each document, which represents the topic interest of this document.
Although this method provides an elegant solution, it brings additional
challenges when mathematically modeling the inherent network structure of
typical document network, i.e., two spatially closer documents tend to have
more similar topics. Furthermore, we require that the topics are shared by all
the documents. In order to resolve these challenges, we use a subsampling
strategy to assign each document a different gamma process from the global
gamma process, and the subsampling probabilities of documents are assigned with
a Markov Random Field constraint that inherits the document network structure.
Through the designed posterior inference algorithm, we can discover the hidden
topics and its number simultaneously. Experimental results on both synthetic
and real-world network datasets demonstrate the capabilities of learning the
hidden topics and, more importantly, the number of topics.
[Show abstract][Hide abstract] ABSTRACT: Incorporating the side information of text corpus, i.e., authors, time
stamps, and emotional tags, into the traditional text mining models has gained
significant interests in the area of information retrieval, statistical natural
language processing, and machine learning. One branch of these works is the
so-called Author Topic Model (ATM), which incorporates the authors's interests
as side information into the classical topic model. However, the existing ATM
needs to predefine the number of topics, which is difficult and inappropriate
in many real-world settings. In this paper, we propose an Infinite Author Topic
(IAT) model to resolve this issue. Instead of assigning a discrete probability
on fixed number of topics, we use a stochastic process to determine the number
of topics from the data itself. To be specific, we extend a gamma-negative
binomial process to three levels in order to capture the
author-document-keyword hierarchical structure. Furthermore, each document is
assigned a mixed gamma process that accounts for the multi-author's
contribution towards this document. An efficient Gibbs sampling inference
algorithm with each conditional distribution being closed-form is developed for
the IAT model. Experiments on several real-world datasets show the capabilities
of our IAT model to learn the hidden topics, authors' interests on these topics
and the number of topics simultaneously.
[Show abstract][Hide abstract] ABSTRACT: Online comment has become a popular and efficient
way for sellers to acquire feedback from customers and improve
their service quality. However, some key issues need to be solved
about evaluating and improving the hotel service quality based on
online comments automatically, such as how to use the less trustworthy
online comments, how to discover the quality defects from
online comments, and how to recommend more feasible or economical
evaluation indexes to improve the service quality based on
online comments. To solve the above problems, this paper first improves
fuzzy comprehensive evaluation (FCE) by importing trustworthy
degree to it and proposes an automatic hotel service quality
assessment method using the improved FCE, which can automatically
get more trustworthy evaluation from a large amount of less
trustworthy online comments. Then, the causal relations among
evaluation indexes are mined from online comments to build the
fuzzy cognitive map for the hotel service quality, which is useful
to unfold the problematic areas of hotel service quality, and recommend
more economical solutions to improving the service quality.
Finally, both case studies and experiments are conducted to
demonstrate that the proposed methods are effective in evaluating
and improving the hotel service quality using online comments.
IEEE Transactions on Fuzzy Systems 02/2015; 23(1):72-84. DOI:10.1109/TFUZZ.2015.2390226 · 6.31 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Graph mining has been a popular research area because of its numerous application scenarios. Many unstructured and structured data can be represented as graphs, such as, documents, chemical molecular structures, and images. However, an issue in relation to current research on graphs is that they cannot adequately discover the topics hidden in graph-structured data which can be beneficial for both the unsupervised learning and supervised learning of the graphs. Although topic models have proved to be very successful in discovering latent topics, the standard topic models cannot be directly applied to graph-structured data due to the "bag-of-word" assumption. In this paper, an innovative graph topic model (GTM) is proposed to address this issue, which uses Bernoulli distributions to model the edges between nodes in a graph. It can, therefore, make the edges in a graph contribute to latent topic discovery and further improve the accuracy of the supervised and unsupervised learning of graphs. The experimental results on two different types of graph datasets show that the proposed GTM outperforms the latent Dirichlet allocation on classification by using the unveiled topics of these two models to represent graphs.
Cybernetics, IEEE Transactions on 01/2015; DOI:10.1109/TCYB.2014.2386282 · 3.47 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper advocates for a novel approach to recommend texts at various levels of difficulties based on a proposed method, the algebraic complexity of texts (ACT). Different from traditional complexity measures that mainly focus on surface features like the numbers of syllables per word, characters per word, or words per sentence, ACT draws from the perspective of human concept learning, which can reflect the complex semantic relations inside texts. To cope with the high cost of measuring ACT, the Degree-2 Hypothesis of ACT is proposed to reduce the measurement from unrestricted dimensions to three dimensions. Based on the principle of “mental anchor,” an extension of ACT and its general edition [denoted as extension of text algebraic complexity (EACT) and general extension of text algebraic complexity (GEACT)] are developed, which take keywords’ and association rules’ complexities into account. Finally, using the scores given by humans as a benchmark, we compare our proposed methods with linguistic models. The experimental results show the order GEACT>EACT>ACT> Linguistic models, which means GEACT performs the best, while linguistic models perform the worst. Additionally, GEACT with lower convex functions has the best ability in measuring the algebraic complexities of text understanding. It may also indicate that the human complexity curve tends to be a curve like lower convex function rather than linear functions.
IEEE Transactions on Human-Machine Systems 10/2014; 44(5):638-649. DOI:10.1109/THMS.2014.2329874 · 1.98 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Recent research shows that multimedia resources in the wild are growing at a staggering rate. The rapid increase number of multimedia resources has brought an urgent need to develop intelligent methods to organize and process them. In this paper, the semantic link network model is used for organizing multimedia resources. A whole model for generating the association relation between multimedia resources using semantic link network model is proposed. The definitions, modules, and mechanisms of the semantic link network are used in the proposed method. The integration between the semantic link network and multimedia resources provides a new prospect for organizing them with their semantics. The tags and the surrounding texts of multimedia resources are used to measure their semantic association. The hierarchical semantic of multimedia resources is defined by their annotated tags and surrounding texts. The semantics of tags and surrounding texts are different in the proposed framework. The modules of semantic link network model are implemented to measure association relations. A real data set including 100 thousand images with social tags from Flickr is used in our experiments. Two evaluation methods, including clustering and retrieval, are performed, which shows the proposed method can measure the semantic relatedness between Flickr images accurately and robustly.
IEEE Transactions on Emerging Topics in Computing 09/2014; 2(3):376-387. DOI:10.1109/TETC.2014.2316525
[Show abstract][Hide abstract] ABSTRACT: The acquisition of deep textual semantics is a key issue which significantly improves the performances of elearning, web search and web knowledge services, etc. Though many models have been developed to acquire textual semantics, the acquisition of deep textual semantics is still a challenge issue. Herein, an acquisition model of deep textual semantics is developed to enhance the capability of text understanding, which includes two parts: 1) how to obtain and organize the domain knowledge extracted from text set and 2) how to activate the domain knowledge for obtaining the deep textual semantics. The activation process involves the Gough mode reading theory, Landscape model and memory cognitive process. The Gough mode is the main human reading model that enables the authors to acquire deep semantics in a text readingprocess. Generalized semantic field is proposed to store the domain knowledge in the form of Long Term Memory (LTM). Specialized semantic field, which is acquired by the interaction process between the text fragment and the domain knowledge, is introduced to describe the change process of textual semantics. By their mutual actions, the authors can get the deep textual semantics which enhances the capability of text understanding; therefore, the machine can understand the text more precisely and correctly than those models only obtaining surface textual semantics.
International Journal of Cognitive Informatics and Natural Intelligence 08/2014; 6(2):82-103. DOI:10.4018/jcini.2012040105
[Show abstract][Hide abstract] ABSTRACT: Web events, whose data occur as one kind of big data, have attracted considerable interests during the past years. However, most existing related works fail to measure the veracity of web events. In this research, we propose an approach to measure the veracity of web event via its uncertainty. Firstly, the proposed approach mines several event features from the data of web event which may influence on the measuring process of uncertainty. Secondly, one computational model is introduced to simulate the influence process of the above features on the evolution process of web event. Thirdly, matrix operations are done to confirm that the result of the proposed iterative algorithm is in coincidence with the computational model. Finally, experiments are made based on the analysis above, and the results proved that the proposed uncertainty measuring algorithm is efficient and has high accuracy to measure the veracity of web event from the big data.
Journal of Systems and Software 07/2014; 102. DOI:10.1016/j.jss.2014.07.023 · 1.25 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In this paper, we study the problem of mining temporal semantic relations between entities. The goal of the studied problem is to mine and annotate a semantic relation with temporal, concise, and structured information, which can release the explicit, implicit, and diversity semantic relations between entities. The temporal semantic annotations can help users to learn and understand the unfamiliar or new emerged semantic relations between entities. The proposed temporal semantic annotation structure integrates the features from IEEE and Renlifang. We propose a general method to generate temporal semantic annotation of a semantic relation between entities by constructing its connection entities, lexical syntactic patterns, context sentences, context graph, and context communities. Empirical experiments on two different datasets including a LinkedIn dataset and movie star dataset show that the proposed method is effective and accurate. Different from the manually generated annotation repository such as Wikipedia and LinkedIn, the proposed method can automatically mine the semantic relation between entities and does not need any prior knowledge such as ontology or the hierarchical knowledge base. The proposed method can be used on some applications, which proves the effectiveness of the proposed temporal semantic relations on many web mining tasks.
[Show abstract][Hide abstract] ABSTRACT: One of the most fundamental works for providing better Web services is the discovery of inter-word relations. However, the state of the art is either to acquire specific relations (e.g., causality) by involving much human efforts, or incapable of specifying relations in detail when no human effort is needed. In this paper, we propose a novel mechanism based on linguistics and cognitive psychology to automatically learn and specify association relations between words. The proposed mechanism, termed as ALSAR, includes two major processes: the first is to learn association relations from the perspective of verb valency grammar in linguistics, and the second is to further lable/specify the association relations with the help of related verbs. The resultant mechanism (i.e., ALSAR) is able to provide semantic descriptors which make inter-word relations more explicit without involving any human labeling. Furthermore, ALSAR incurs a very low complexity, and experimental evaluations on Chinese news articles crawled from Baidu News demonstrate good performance of ALSAR.
Web-Age Information Management, 06/2014: pages 578-589;
[Show abstract][Hide abstract] ABSTRACT: An explosive growth in the volume, velocity, and variety of the data available on the Internet is witnessed recently. The data originated from multiple types of sources including mobile devices, sensors, individual archives, social networks, Internet of Things, enterprises, cameras, software logs, health data has led to one of the most challenging research issues of the big data era. In this paper, Knowle—an online news management system upon semantic link network model is introduced. Knowle is a news event centrality data management system. The core elements of Knowle are news events on the Web, which are linked by their semantic relations. Knowle is a hierarchical data system, which has three different layers including the bottom layer (concepts), the middle layer (resources), and the top layer (events). The basic blocks of Knowle system—news collection, resources representation, semantic relations mining, semantic linking news events are given. Knowle does not require data providers to follow semantic standards such as RDF or OWL, which is a semantics-rich self-organized network. It reflects various semantic relations of concepts, news, and events. Moreover, in the case study, Knowle is used for organizing and mining health news, which shows the potential on forming the basis of designing and developing big data analytics based innovation framework in health domain.
[Show abstract][Hide abstract] ABSTRACT: Association Link Network (ALN) is a kind of Semantic Link Network built by mining the association relations among multimedia Web resources for effectively supporting Web intelligent application such as Web-based learning, and semantic search. This paper explores the Small-World properties of ALN to provide theoretical support for association learning (i.e., a simple idea of “learning from Web resources”). First, a filtering algorithm of ALN is proposed to generate the filtered status of ALN, aiming to observe the Small-World properties of ALN at given network size and filtering parameter. Comparison of the Small-World properties between ALN and random graph shows that ALN reveals prominent Small-World characteristic. Then, we investigate the evolution of Small-World properties over time at several incremental network sizes. The average path length of ALN scales with the network size, while clustering coefficient of ALN is independent of the network size. And we find that ALN has smaller average path length and higher clustering coefficient than WWW at the same network size and network average degree. After that, based on the Small-World characteristic of ALN, we present an Association Learning Model (ALM), which can efficiently provide association learning of Web resources in breadth or depth for learners.
World Wide Web 03/2014; DOI:10.1007/s11280-012-0171-7 · 1.62 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Relatedness measurement between multimedia such as images and videos plays an important role in computer vision, which is a base for many multimedia related applications including clustering, searching, recommendation, and annotation. Recently, with the explosion of social media, users can upload media data and annotate content with descriptive tags. In this paper, we aim at measuring the semantic relatedness of Flickr images. Firstly, four information theory based functions are used to measure the semantic relatedness of tags. Secondly, the integration of tags pair based on bipartite graph is proposed to remove the noise and redundancy. Thirdly, the order information of tags is added to measure the semantic relatedness, which emphasizes the tags with high positions. The data sets including 1000 images from Flickr are used to evaluate the proposed method. Two data mining tasks including clustering and searching are performed by the proposed method, which shows the effectiveness and robustness of the proposed method. Moreover, some applications such as searching and faceted exploration are introduced using the proposed method, which shows that the proposed method has broad prospects on web based tasks.
The Scientific World Journal 02/2014; 2014(4):758089. DOI:10.1155/2014/758089 · 1.73 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: How to build a text knowledge representation model, which carries rich knowledge and has a flexible reasoning ability as well as can be automatically constructed with a low computational complexity, is a fundamental challenge for reasoning-based knowledge services, especially with the rapid growth of web resources. However, current text knowledge representation models either lose much knowledge [e. g., vector space model (VSM)] or have a high complex computation [e. g., latent Dirichlet allocation (LDA)]; even some of them cannot be constructed automatically [e. g., web ontology language, (OWL)]. In this paper, a novel text knowledge representation model, power series representation (PSR) model, which has a low complex computation in text knowledge constructing process, is proposed to leverage the contradiction between carrying rich knowledge and automatic construction. First, concept algebra of human concept learning is developed to represent text knowledge as the form of power series. Then, degree-2 power series hypothesis is introduced to simplify the proposed PSR model, which can be automatically constructed with a lower complex computation and has more knowledge than the VSM and LDA. After that, degree-2 power series hypothesis-based reasoning operations are developed, which provide a more flexible reasoning ability than OWL and LDA. Furthermore, experiments and comparisons with current knowledge representation models show that our model has better characteristics than others when representing text knowledge. Finally, a demo is given to indicate that PSR model has a good prospect over the area of web semantic search.
IEEE Transactions on Systems, Man, and Cybernetics: Systems 01/2014; 44(1):86-102. DOI:10.1109/TSMCC.2012.2231674 · 1.70 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: On the web, there are numerous websites publishing web pages to cover the events occurring in society. The web events data satisfies the well-accepted attributes of big data: Volume, Velocity, Variety and Value. As a great value of web events data, website preferences can help the followers of web events, e.g. peoples or organizations, to select the proper websites to follow their interested aspects of web events. However, the big volume, fast evolution speed, multisource and unstructured data all together make the value of website preferences mining very challenging. In this paper, website preference is formally defined at first. Then, according to the hierarchical attribute of web events data, we propose a hierarchical network model to organize big data of a web event from different organizations, different areas and different nations at a given time stamp. With this hierarchical network structure in hand, two strategies are proposed to mine the value of websites preferences from web events data. The first straightforward strategy utilizes the communities of keyword level network and the mapping relations between websites and keywords to unveil the Value in them. By taking the whole hierarchical network structure into consideration, an iterative algorithm is proposed in second strategy to refine the keyword communities like the first strategy. At last, an evaluation criteria of website preferences is designed to compare the performances of two proposed strategies. Experimental results show the proper combination of horizontal relations (each level network) with vertical relations (mapping relations between three level networks) can extract more value from web events data and then improve the efficiency on website preferences mining.
Proceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering; 12/2013
[Show abstract][Hide abstract] ABSTRACT: Online popular events, which are constructed from news stories using the techniques of Topic Detection and Tracking (TDT), bring convenience to users who intend to see what is going on through the Internet. Recently, the web is becoming an important event information provider and poster due to its real-time, open, and dynamic features. However, it is difficult to detect events since the huge scale and dynamics of the internet. In this paper, we define the novel problem of investigating impact factors for event detection. We give the definitions of five impact factors including the number of increased web pages, the number of increased keywords, the number of communities, the average clustering coefficient, and the average similarities of web pages. These five impact factors contain statistic and content information of an event. Empirical experiments on real datasets including Google Zeitgeist and Google Trends show that that the number of web pages and the average clustering coefficient can be used to detect events. Some strategies integrating the number of web pages and the average clustering coefficient are also employed. The evaluations on real dataset show that the proposed function integrating the number of web pages and the average clustering coefficient can be used for event detection efficiently and correctly.
Proceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering; 12/2013