Conference PaperPDF Available

Linked Data and PageRank-based classification

Authors:

Abstract and Figures

In this article, we would like to present new approach to classification with Linked Data and PageRank. Our research is focused on classification methods that are enhanced by semantic information. The semantic information can be obtained from ontology or from Linked Data. DBpedia was used as source of Linked Data in our case. Feature selection method is semantically based so features can be recognized by nonprofessional users because they are in a human readable and understandable form. PageRank is used during feature selection and generation phase for expansion of basic features into more general representatives. It means that feature selection and processing is based on a network relations obtained from Linked Data. The features can be used by standard classification algorithms. We will present the promising preliminary results that show the easy applicability of this approach to different datasets.
Content may be subject to copyright.
LINKED DATA AND PAGERANK BASED
CLASSIFICATION
Michal Nykl, Karel Ježek
Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia, Pilsen
Univerzitni 22, 306 14 Pilsen, Czech Republic
Martin Dostal, Dalibor Fiala
NTIS, Faculty of Applied Sciences, University of West Bohemia, Pilsen
Univerzitni 8, 306 14 Pilsen, Czech Republic
This is preprint version, original version is:
NYKL, Michal, JEŽEK, Karel, DOSTAL, Martin and FIALA, Dalibor. Linked Data and PageRank based classification.
In: IADIS International Conference Theory and Practice in Modern Computing 2013 (part of MCCSIS 2013).
Praha: IADIS Press, 2013, pp. 61-64. ISBN: 978-972-8939-94-6.
ABSTRACT
In this article, we would like to present new approach to classification with Linked Data and PageRank. Our research
is focused on classification methods that are enhanced by semantic information. The semantic information can be
obtained from ontology or from Linked Data. DBpedia was used as source of Linked Data in our case. Feature selection
method is semantically based so features can be recognized by nonprofessional users because they are in a human
readable and understandable form. PageRank is used during feature selection and generation phase for expansion of basic
features into more general representatives. It means that feature selection and processing is based on a network relations
obtained from Linked Data. The features can be used by standard classification algorithms. We will present the promising
preliminary results that show the easy applicability of this approach to different datasets.
KEYWORDS
Linked Data, PageRank, classification, feature selection
1. INTRODUCTION
Document classification is an important part of document management systems and other text processing
services. Today`s methods are usually statistically oriented that means a big amount of data is required for
training phase of these classification algorithms. The preparation of sufficient classification training sets and
proper feature selection methods is challenging task even for domain specialists. So the common solution of
this problem is based on relatively comprehensive corpuses that contain a lot of documents divided into
different classification classes. Statistical methods are trying to discover relations between terms
and classification classes during training phase.
Our approach recognizes interesting keywords and expands them using semantic information obtained
from Linked Data. For example based on a feature MySql we can do the feature expansion into databases
without explicit occurrence of this word in document content. The classification phase can process these
parent concepts and use them for correct pairing of documents and classification classes.
Next, we will explain basic principles of Linked Data and introduce the primary variant of PageRank that
will be used in next sections. Related work is discussed in Section 2 and our approach to feature selection is
described in Section 3. Preliminary evaluation of our method was performed with 20 News Groups dataset.
The results are presented in Section 4.
1.1 Linked Data and PageRank
The concept of Linked Data was first introduced by Tim Berners-Lee (Berners-Lee, 2006). He formulated
four rules for machine readable content on the Web:
Use URIs as names for things.
Use HTTP URIs so that people can look up those names.
When someone looks up a URI, provide useful information using the standards
(RDF*, SPARQL).
Include links to other URIs so that they can discover more things.
More specific is the idea of Linked Open Data, which is based on the presumption of freely published
data without restrictions in usage or additional fees.
The PageRank algorithm was developed in 1998 by Page and Brine (Brine, 1998) as approach to web
pages ranking through exploring hyperlinks structure on the Internet. The importance of each web page
depends on the number and PageRank value of all web pages that link to it. This approach also known
as “Random surfer walking has been studied and improved for citation analysis (Ma, 2008). Our modified
version of PageRank (1) corresponds to its matrix definition (Langville, 2006) where Px(a) is a value of node
a in iteration x, d is a damping factor usually set to 0.85, V is a set of all nodes in the graph, U is a set of
nodes with link to node a, D is a set of all dangling nodes and wij is weight of link from node i to node j.

   

 

 (1)
2. PREVIOUS WORK
Document classification can be defined as content-based assignment of one or more predefined categories
(classification classes) to documents. We can distinguish two phases in document classification processing,
the learning phase and the classification phase. In the learning phase user define categories by giving training
documents for each of these categories. Quality improves with increasing number of training documents.
This is a weak point of document classification because the solid training collection is required.
There have been many supervised learning techniques for document classification. Some of these
techniques include Naive Bayes, k-nearest neighbor, vector approaches e.g. Rochio, support vector machines,
boosting (Schapire, 1999), rule learning algorithms (Cohen, 1996), Maximum Entropy and Latent semantic
analysis.
DBPedia was used as a source of Linked Data presented in this article. We use a local copy of Linked
Data stored in our relation database for performance purposes, but SPARQL endpoint could also be used.
DBPedia is semantically enriched Wikipedia that was successfully employed previously for computing
semantic relatedness of documents. WikiRelate! (Strube, 2006) combines path based measures, information
content based measures, and text overlap based measures. Explicit Semantic Analysis (Gabrilovich, 2007)
uses machine learning techniques to explicitly represent the meaning of a text as a weighted vector
of Wikipedia-based concepts.
Another approach to document classification (Wang, 2005) proposed term graph model as improved
version of the vector space model. The aim of this model is to represent the content of a document with
relationship between keywords. This model enables to define similarity functions and PageRank-style
algorithm. The vectors of PageRank score values were created for each document. The rank correlation and
the term distance were used as a similarity measures to assign document to classification class. An alternative
approach to document classification uses hypernyms and other directly related concepts (Bloehdorn, 2004;
Ramakrishnanan, 2003). Next step in document classification can be marked as feature expansion with
additional semantic information from ontology (De Melo, 2007). This approach (De Melo, 2007)
is exploiting the external knowledge for mapping terms to regions of concepts. For exploration of related
concepts, the traversal graph algorithm is used.
3. FEATURE SELECTION
Feature selection is the most important part of our approach. This method and its connection with Linked
Data and PageRank consist of following steps:
1. Basic features are selected from documents on the base of the TFIDF (other methods, e.g.
2, can be
used too).
2. The features of each document are mapped to Linked Data nodes (identified with URI).
The mapping is based on the full or partial compliance between feature (term) and name of the node
in a corresponding language. This enables to form the first version of the graph.
3. One step expansion of the graph by Linked Data is executed. The Graph expansion illustrates
Fig. 1, where the original (basic) nodes are marked as I and the expanded nodes are marked as II.
The weights of all links are assigned. Variants of weighting see below (Fig. 2).
4. PageRank algorithm is applied to this graph.
5. If at least one of the last added nodes have a higher PageRank score, than any of the older nodes we
continue with step 3, otherwise the nodes with the highest scores represent the selected features.
Figure 1. Graph expansion with PageRank
We have investigated three options for initialization of the PageRank algorithm (see Fig. 2), where the
steps of node expansion in a graph are marked with I, II and III:
a) All edges are assigned the same weight equal to 1.
b) The basic nodes are advantaged with self-citation edge with increased weight.
c) The basic nodes are advantaged with self-citation edge. The edges to the new nodes are
penalized base on the quadratic distance of path from the basic node.
Our evaluation of this three possibilities shows, that the variant c) achieves the best results due to
effective limitation in expansion of the basic nodes. The other versions requires explicit limiting criterion for
graph expansion. In the next section, this variant was used for evaluation of our feature selection approach.
Figure 2. Initialization of the PageRank
4. RESULTS
For evaluation purposes the subset of 20 News groups collection was used. The reason was that our source of
Linked Data was not sufficient to distinguish between similar categories in 20 News groups collection like
comp.os.ms-windows.misc and comp.windows. Another problem was overtraining that occurs with
approximately 100 training documents for 1 category.
Our approach will be completely based on the path length in a graph of nodes from Linked Data. But for
preliminary results and evaluation purposes, we decided to compare our feature selection method and
standard statistical approach with the same vector space classification algorithm (Rocchio). The comparison
(see Fig. 3) is done with macro-averaging F1(β=1) measure. The number of testing documents is determined
as 20% of training documents.
Figure 3. Graph F1 measure for Rocchio classification algorithm
5. CONCLUSION
Our method for document classification with Linked Data is promising especially in tasks with inadequate
training sets or for quick filtering of existing documents. In those cases the training phase could be very
expensive and inappropriate waste of time for user. Our method allows the definition of assigning categories
using only a single node from Linked Data with automatic expansion on both sites category definition and
feature selection. In the future, we would like to eliminate the overtraining problem and we would like to
create a solid method for document classification directly based on the graph analysis.
This work was supported by the grants GAČR P103/11/1489 and NTIS CZ.1.05/1.1.00/02.0090.
REFERENCES
Berners-Lee, T., 2006. URL: http://www.w3.org/DesignIssues/LinkedData.html, date: 2006-07-27, cited: 2013/01/12.
Bloedhorn, S. and Hotho, A., 2004. Boosting for Text Classification with Semantic Features. WebKDD’04. Seattle, USA.
Brine, S. and Page, L., 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and
ISDN Systems, Vol. 30, Issues 1-7, pp 107-117.
Cohen, W. and Singer Y., 1996. Context-sensitive learning methods for text categorization. ACM SIGIR ’96.
De Melo, G. and Siersdorfer, S., 2007. Multilingual text classification using ontologies. 29th EC on IR, Rome, Italy.
Gabrilovich, E. et al, 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis.
Langville, A.N. et al., 2006. Google’s PageRank and Beyond: The Science of Search Engine Ranking. USA.
Ma, N. et al, 2008. Bringing PageRank to the citation analysis. Information Processing & Management, 44, 800810.
Schapire R. and Singer Y., 1999. BoosTexter: A boosting-based system for text categorization. Machine Learning.
Strube, M. and Ponzetto, S.P., 2006. WikiRelate! Computing semantic relatedness using Wikipedia. AAAI’06, USA.
Wang, W. et al, 2005. Term Graph Model for Text Classification. ADMA’05. Wuhan, China, pp. 19-30.
... In study [8], the PageRank algorithm is modified to correspond to the matrix of the graph where Px(a) is a value of node a in iteration x, d is a damping factor (for web graph orginally set to about 0.85), V is the vertex set whose nodes in the graph, U is a set of nodes with link to node a, D symbolize the set of all dangling nodes and w ij is link weight for each lin k ...
Conference Paper
Full-text available
The main purpose of the semantic web is to develop standards and technologies that will enable well-defined and linked information and services to be easily computer-readable and computer-understandable in the web environment. Linked data is one of the approaches used to acquire meaningful integrity by gathering data-related data collections by creating semantic links between the web pages that make up the content of the semantic web. Linked data is based on RDF (Resource Description Framework) technology. RDF is a data model that provides space-independent formal semantics with respect to chart resources. In a linked data application, the most important decision point is how to access the linked data. Linked data crawler is a program that explores linked data in web by tracking RDF links. In this work, DBLP (Database S ystems and Logic Programming) data set is used as a source of Linked Data. DBLP gradually expanded toward all fields of computer science. An example will be presented related to pageRank sorting of RDF resources in the DBLP dataset. As a result; the search area has shrunk and search results have improved.
Thesis
Full-text available
Tato práce se zabývá výzkumem metod pro hodnocení významnosti vrcholů v rozsáhlých grafových strukturách. Navržené metody jsou aplikovány při vyhodnocení citačních sítí a sítí vytvořených z Linked Data. V úvodu práce jsou popsány cíle, které nás k návrhu nových metod vedly. Následně lze text práce pomyslně rozdělit na dvě části, z nichž první a obsáhlejší část je věnována návrhu metod pro hodnocení autorů vědeckých publikací a druhá část je věnována návrhu metody pro určení klíčových slov textového dokumentu. Společnou vlastností všech navržených metod je použitý algoritmus PageRank. V první části práce je nejprve shrnut aktuální stav poznání v oblasti citační analýzy a zmíněny nejznámější bibliografické databáze a algoritmy, které bývají při citační analýze používány. Zvláštní prostor je věnován popisu algoritmu PageRank, který jsme při výzkumu používali a dále upravovali. Následně první část obsahuje popis návrhu nových metod pro hodnocení významnosti autorů a popis experimentálního ověření jejich kvality. Pro experimenty byly použity datové kolekce CiteSeer, DBLP a WoS, přičemž výsledky získané z kolekce WoS byly, vzhledem k jejím vlastnostem, prohlášeny za nejdůvěryhodnější. Poté, co se prokázala vhodnost nově navržených metod pro hodnocení autorů, jsme provedli další experimenty, jejichž cílem bylo metody ještě více vylepšit. Zde se pro hodnocení autorů ukázalo nejvhodnější parametrizovat PageRank aplikovaný na citační síť publikací významností časopisů, ve kterých byly publikace zveřejněny. Vhodnost navržených metod a platnost vyvozených závěrů byly ověřeny také vyhodnocením specializovaných kategorií WoS. V druhé části práce jsou nejprve zmíněny významné práce z oblasti klasifikace textových dokumentů a z oblasti využití PageRanku pro extraktivní sumarizaci obsahu dokumentu. Následně je popsán návrh naší metody pro volbu klíčových slov textového dokumentu. Tato metoda využívá PageRank a Linked Data, čímž dokáže určit k textu dokumentu vysoce relevantní klíčová slova, která v textu nemusejí být explicitně uvedena. Kvalita navržené metody byla experimentálně ověřena jejím použitím v klasifikátoru dokumentů, který byl aplikován na dokumenty z kolekce diskusních článků 20 Newsgroups a na dokumenty z vlastní kolekce konferenčních Call-for-Papers. Určená klíčová slova byla použita jako vlastnosti dokumentů. Závěrem bylo, že navržená metoda je vhodná zejména v situacích, kdy máme malé množství dat pro natrénování klasifikátoru. Autorovy vědecké přínosy, které jsou popsány v této práci, byly publikovány formou pěti vědeckých článků, z nichž dva byly zveřejněny v časopisech a tři v konferenčních sbornících.
Thesis
Full-text available
Informetrics is a relatively new scientific discipline linking computer science with information science and using various data mining, statistical, and graph-theoretical approaches and methods to measure information. It may be regarded as a general field of science that comprises scientometrics, bibliometrics, webometrics, and other –metrics fields that have all seen an enormous growth in recent years. In fact, in a time when the need for scientific advancement and technological innovation is immense, but funding sources are limited, measuring the quality of research outputs has become indispensable and recognized by many. Indeed, the recently founded Journal of Informetrics has immediately become one of the fastest growing high-impact journals in the Journal Citation Reports® database by Thomson Reuters, which clearly demonstrates the importance of informetrics as a research domain. This habilitation thesis presents the research I have conducted in the last several years with the aim of developing new methods to evaluate scientific research output more fairly and whose main results I have published in leading journals of the field. First, a new method based on the PageRank algorithm by Google is presented that detects the most influential re-searchers by analyzing citation as well as collaboration networks and by taking into account the time of publications and collaborations (published in Journal of Informetrics). Second, a large-scale bibliometric analysis of a huge data collection gained from the CiteSeer digital library is carried out in order to determine the scientific production and impact of countries in computer science (appeared in Information Processing and Management). Third, the same digital library is used to find the most prominent computer science researchers by applying 12 different ranking methods (published in Scientometrics). Fourth, a new scientometric indicator is introduced that is based on the h-index and can grow as well as decline (in press in Journal of the American Society for Information Science and Technology). Fifth, a large-scale analysis of institutional suborganizations in the field of library and information science is carried out (published in Information). And sixth, the differences between CiteSeer and its successor, CiteSeerX, in terms of coauthorship networks are investigated (appeared in Journal of Theoretical and Applied Information Technology). The work presented in this habilitation thesis is a significant contribution to the advancement of informetrics, particularly in the Czech Republic, where the field is almost unknown.
Conference Paper
Full-text available
Computing semantic relatedness of natural language texts requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia. We use machine learning techniques to explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts. Assessing the relatedness of texts in this space amounts to comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r=0.56 to 0.75 for individual words and from r=0.60 to 0.72 for texts. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.
Conference Paper
Full-text available
In this paper, we investigate strategies for automatically classifying documents in different languages thematically, geographically or according to other criteria. A novel linguistically motivated text representation scheme is presented that can be used with machine learning algorithms in order to learn classifications from pre-classified examples and then automatically classify documents that might be provided in entirely different languages. Our approach makes use of ontologies and lexical resources but goes beyond a simple mapping from terms to concepts by fully exploiting the external knowledge manifested in such resources and mapping to entire regions of concepts. For this, a graph traversal algorithm is used to explore related concepts that might be relevant. Extensive testing has shown that our methods lead to significant improvements compared to existing approaches.
Article
This work focuses on algorithms which learn from examples to perform multiclass text and speech categorization tasks. Our approach is based on a new and improved family of boosting algorithms. We describe in detail an implementation, called BoosTexter, of the new boosting algorithms for text categorization tasks. We present results comparing the performance of BoosTexter and a number of other text-categorization algorithms on a variety of tasks. We conclude by describing the application of our system to automatic call-type identification from unconstrained spoken customer responses.
Article
Why doesn't your home page appear on the first page of search results, even when you query your own name? How do other web pages always appear at the top? What creates these powerful rankings? And how? The first book ever about the science of web page rankings, Google's PageRank and Beyond supplies the answers to these and other questions and more. The book serves two very different audiences: the curious science reader and the technical computational reader. The chapters build in mathematical sophistication, so that the first five are accessible to the general academic reader. While other chapters are much more mathematical in nature, each one contains something for both audiences. For example, the authors include entertaining asides such as how search engines make money and how the Great Firewall of China influences research. The book includes an extensive background chapter designed to help readers learn more about the mathematics of search engines, and it contains several MATLAB codes and links to sample web data sets. The philosophy throughout is to encourage readers to experiment with the ideas and algorithms in the text. Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided. Many illustrative examples and entertaining asides MATLAB code Accessible and informal style Complete and self-contained section for mathematics review
Article
The paper attempts to provide an alternative method for measuring the importance of scientific papers based on the Google’s PageRank. The method is a meaningful extension of the common integer counting of citations and is then experimented for bringing PageRank to the citation analysis in a large citation network. It offers a more integrated picture of the publications’ influence in a specific field. We firstly calculate the PageRanks of scientific papers. The distributional characteristics and comparison with the traditionally used number of citations are then analyzed in detail. Furthermore, the PageRank is implemented in the evaluation of research influence for several countries in the field of Biochemistry and Molecular Biology during the time period of 2000–2005. Finally, some advantages of bringing PageRank to the citation analysis are concluded.
Conference Paper
Current text classification systems typically use term stems for repre- senting document content. Semantic Web technologies allow the usage of features on a higher semantic level than single words for text classification purposes. In this paper we propose such an enhancement of the classical document representation through concepts extracted from background knowledge. Boosting, a successful machine learning technique is used for classification. Comparative experimental evaluations in three different settings support our approach through consistent im- provement of the results. An analysis of the results shows that this improvement is due to two separate effects.
Conference Paper
Most existing text classification methods (and text mining methods at large) are based on representing the documents using the traditional vector space model. We argue that important information, such as the relationship among words, is lost. We propose a term graph model to represent not only the content of a document but also the relationship among the keywords. We demonstrate that the new model enables us to define new similarity functions, such as considering rank correlation based on PageRank-style algorithms, for the classification purpose. Our preliminary results show promising results of our new model.
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.
Article
Two recently implemented machine-learning algorithms, RIPPER and sleeping-experts for phrases, are evaluated on a number of large text categorization problems. These algorithms both construct classifiers that allow the "context" of a word w to affect how (or even whether) the presence or absence of w will contribute to a classification. However, RIPPER and sleeping-experts differ radically in many other respects: differences include different notions as to what constitutes a context, different ways of combining contexts to construct a classifier, different methods to search for a combination of contexts, and different criteria as to what contexts should be included in such a combination. In spite of these differences, both RIPPER and sleeping-experts perform extremely well across a wide variety of categorization problems, generally outperforming previously applied learning methods. We view this result as a confirmation of the usefulness of classifiers that represent contextual information.
URL: http://www.w3.org/DesignIssues/LinkedData.html, date
  • T Berners-Lee
Berners-Lee, T., 2006. URL: http://www.w3.org/DesignIssues/LinkedData.html, date: 2006-07-27, cited: 2013/01/12.