Article

Fuzzy Transductive Support Vector Machines For Hypertext Classification.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A method to assign fuzzy labels to unlabeled hypertext documents based on hyperlink structure information is first proposed. Then, the construction of the fuzzy transductive support vector machines is described. Also, an algorithm to train the fuzzy transductive support vector machines is presented. While in the transductive support vector machines all the test examples are treated equally, in the fuzzy transductive support vector machines, test examples are treated discriminatively according to their fuzzy labels, hence a more reliable decision function. Experimental results on the WebKB corpus show that, by fusing the plain text information and the hyperlink structure information, much better classification performance can be achieved.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In reality, the competitive analysis is used in many problems, especially in network optimization. In Hong Liu's paper, 11 the author suggested a method to assign fuzzy labels to unlabeled hypertext documents, and also presents an algorithm to train the fuzzy transductive support vector machine. More results concerning fuzzy theory application can be found in Refs. ...
... With (12) and (13), (11) becomes, ...
Article
Full-text available
Based on some results of the fuzzy network computation and competitive analysis, the Online Fuzzy Most Reliable Path Problem (OFRP), which is one of the most important problems in network optimization with uncertainty, has been originally proposed by our team. In this paper, the preliminaries about fuzzy and the most reliable path and competitive analysis are given first. Following that, the mathematical model of OFRP, in which two kinds of uncertainties, namely online and fuzzy, are combined to be considered at the same time, is established. Then some online fuzzy algorithms are developed to address the OFRP and the rigorous proofs of the competitive analysis are given in detail. Finally, some possible research directions of the OFRP are discussed and the conclusions are drawn.
... SVM also offers a sound theoretical framework for statistical learning realized in presence of small samples, and has been successfully applied to handle many practical problems. [6][7][8][9][10] However, SLT is constructed on probability measure based on real valued random samples. Therefore, it becomes difficult to take advantage of the theory when dealing with statistical learning problems based on non-real valued random samples on non-probability measure space encountered in real-world. ...
Article
Full-text available
In order to deal with learning problems of random set samples encountered in real-world, according to random set theory and convex quadratic programming, a new support vector machine based on random set samples is constructed. Experimental results show that the new support vector machine is feasible and effective.
Article
Based on some results of the fuzzy network computation and competitive analysis, the Online Fuzzy Most Reliable Path Problem (OFRP), which is one of the most important problems in network optimization with uncertainty, has been originally proposed by our team. In this paper, the preliminaries about fuzzy and the most reliable path and competitive analysis are given first. Following that, the mathematical model of OFRP, in which two kinds of uncertainties, namely online and fuzzy, are combined to be considered at the same time, is established. Then some online fuzzy algorithms are developed to address the OFRP and the rigorous proofs of the competitive analysis are given in detail. Finally, some possible research directions of the OFRP are discussed and the conclusions are drawn.
Article
Transductive support vector machine (TSVM) is a well-known algorithm that realizes transductive learning in the field of support vector classification. This paper constructs a bi-fuzzy progressive transductive support vector machine (BFPTSVM) algorithm by combining the proposed notation of bi-fuzzy memberships for the temporary labeled sample appeared in progressive learning process and the sample-pruning strategy, which decreases the computation complexity and store memory of algorithm. Simulation experiments show that the BFPTSVM algorithm derives better classification performance and converges rapidly with better stability compared to the other learning algorithms.
Conference Paper
This paper develops a fast and accurate algorithm for training transductive SVMs classifiers, which utilizes the classification information of unlabeled data in a progressive way. For improving the generalization accuracy further, we employ three important criteria to enhance the algorithm, i.e. confidence evaluation, suppression of labeled data, stopping with stabilization. Experimental results on several real world datasets confirm the effectiveness of these criteria and show that the new algorithm can reach to comparable accuracy as several state-of-the-art approaches for training transductive SVMs in much less training time.
Conference Paper
While transductive support vector machine (TSVM) utilizes the information carried by the unlabeled samples for classification and acquires better classification performance than support vector machine (SVM), the number of positive samples must be appointed before training and it is not changed during the training phase. In this paper, a sequential minimal transductive support vector machine (SMTSVM) is discussed to overcome the deficiency in TSVM. It solves the problem of estimation the penalty value after changing a temporary label by introducing the sequential minimal way. The experimental results show that SMTSVM is very promising.
Article
The transductive support vector machine (TSVM) is the transductive inference of the support vector machine. The TSVM utilizes the information carried by the unlabeled samples for classification and acquires better classification performance than the regular support vector machine (SVM). As effective as the TSVM is, it still has obvious deficiency: The number of positive samples must be appointed before training and it is not changed during the training phase. This deficiency is caused by the pair-wise exchanging criterion used in the TSVM. In this paper, we propose a new transductive training algorithm by substituting the pair-wise exchanging criterion with the individually judging and changing criterion. Experimental results show that the new method releases the restriction of the appointment of the number of positive samples beforehand and improves the adaptability of the TSVM.
Article
Due to its wide applicability, the problem of semi-supervised classification is attracting increas- ing attention in machine learning. Semi-Supervised Support Vector Machines (S3VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S3VMs. This paper reviews key ideas in this literature. The performance and behavior of various S3VM algorithms is studied together, under a common exper-
Article
Full-text available
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
Book
Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.
Article
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.
Chapter
We present a new approach to learning hypertext classifiers that combines a statistical text-learning method with a relational rule learner. This approach is well suited to learning in hypertext domains because its statistical component allows it to characterize text in terms of word frequencies, whereas its relational component is able to describe how neighboring documents are related to each other by hyperlinks that connect them. We evaluate our approach by applying it to tasks that involve learning definitions for (i) classes of pages; (ii) particular relations that exist between pairs of pages, and (iii) locating a particular class of information in the internal structure of pages. Our experiments demonstrate that this new approach is able to learn more accurate classifiers than either of its constituent methods alone.
Article
Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related Web sites all provide rich information for classifying hypertext documents. How to appropriately represent that information and automatically learn statistical patterns for solving hypertext classification problems is an open question. This paper seeks a principled approach to providing the answers. Specifically, we define five hypertext regularities which may (or may not) hold in a particular application domain, and whose presence (or absence) may significantly influence the optimal design of a classifier. Using three hypertext datasets and three well-known learning algorithms (Naive Bayes, Nearest Neighbor, and First Order Inductive Learner), we examine these regularities in different domains, and compare alternative ways to exploit them. Our results show that the identification of hypertext regularities in the data and the selection of appropriate representations for hypertext in particular domains are crucial, but seldom obvious, in real-world problems. We find that adding the words in the linked neighborhood to the page having those links (both inlinks and outlinks) were helpful for all our classifiers on one data set, but more harmful than helpful for two out of the three classifiers on the remaining datasets. We also observed that extracting meta data from related Web sites was extremely useful for improving classification accuracy in some of those domains. Finally, the relative performance of the classifiers being tested provided insights into their strengths and limitations for solving classification problems involving diverse and often noisy Web pages.
Conference Paper
The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable world wide knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would enable much more effective retrieval of Web information, and promote new uses of the Web to support knowledge-based inference and problem solving. Our approach is to develop a trainable information extraction system that takes two inputs: an ontology defining the classes and relations of interest, and a set of training data consisting of labeled regions of hypertext representing instances of these classes and relations. Given these inputs, the system learns to extract information from other pages and hyperlinks on the Web. This paper describes our general approach, several machine learning algorithms for this task, and promising initial results with a prototype system.
Article
We describe a method for improving the classification of short text strings using a combination of labeled training data plus a secondary corpus of unlabeled but related longer documents. We show that such unlabeled background knowledge can greatly decrease error rates, particularly if the number of examples or the size of the strings in the training set is small. This is particularly useful when labeling text is a labor-intensive job and when there is a large amount of information available about a particular problem on the World Wide Web. Our approach views the task as one of information integration using WHIRL, a tool that combines database functionalities with techniques from the information-retrieval literature. 1. Introduction The task of classifying textual data that has been culled from sites on the World Wide Web is both difficult and intensively studied (Cohen & Hirsh, 1998; Joachims, 1998; Nigam et al., 1999). Applications of various machine learning techniqu...
Article
Machine learning typically involves discovering regularities in a training set, then applying these learned regularities to classify objects in a test set. In this paper we present an approach to discovering additional regularities in the test set, and show that in relational domains such test set regularities can be used to improve classification accuracy beyond that achieved using the training set alone. For example, we have previously shown how FOIL, a relational learner, can learn to classify Web pages by discovering training set regularities in the words occurring on target pages, and on other pages related by hyperlinks. Here we show how the classification accuracy of FOIL on this task can be improved by discovering additional regularities on the test set pages that must be classified. Our approach can be seen as an extension to Kleinberg's Hubs and Authorities algorithm that analyzes hyperlink relations among Web pages. We present evidence that this new algor...
Article
We consider the problem of using a large unlabeled sample to boost performance of a learning algorithm when only a small set of labeled examples is available. In particular, we consider a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views. For example, the description of a web page can be partitioned into the words occurring on that page, and the words occurring in hyperlinks that point to that page. We assume that either view of the example would be sufficient for learning if we had enough labeled data, but our goal is to use both views together to allow inexpensive unlabeled data to augment a much smaller set of labeled examples. Specifically, the presence of two distinct views of each example suggests strategies in which two learning algorithms are trained separately on each view, and then each algorithm 's predictions on new unlabeled examples are used to e...
Article
This paper introduces Transductive Support Vector Machines (TSVMs) for text classification. While regular Support Vector Machines (SVMs) try to induce a general decision function for a learning task, Transductive Support Vector Machines take into account a particular test set and try to minimize misclassifications of just those particular examples. The paper presents an analysis of why TSVMs are well suited for text classification. These theoretical findings are supported by experiments on three test collections. The experiments show substantial improvements over inductive methods, especially for small training sets, cutting the number of labeled training examples down to a twentieth on some tasks. This work also proposes an algorithm for training TSVMs efficiently, handling 10,000 examples and more. 1