Article

Evaluation of Folksonomy Induction Algorithms

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Algorithms for constructing hierarchical structures from user-generated metadata have caught the interest of the academic community in recent years. In social tagging systems, the output of these algorithms is usually referred to as folksonomies (from folk-generated taxonomies). Evaluation of folksonomies and folksonomy induction algorithms is a challenging issue complicated by the lack of golden standards, lack of comprehensive methods and tools as well as a lack of research and empirical/simulation studies applying these methods. In this article, we report results from a broad comparative study of state-of-the-art folksonomy induction algorithms that we have applied and evaluated in the context of five social tagging systems. In addition to adopting semantic evaluation techniques, we present and adopt a new technique that can be used to evaluate the usefulness of folksonomies for navigation. Our work sheds new light on the properties and characteristics of state-of-the-art folksonomy induction algorithms and introduces a new pragmatic approach to folksonomy evaluation, while at the same time identifying some important limitations and challenges of folksonomy evaluation. Our results show that folksonomy induction algorithms specifically developed to capture intuitions of social tagging systems outperform traditional hierarchical clustering techniques. To the best of our knowledge, this work represents the largest and most comprehensive evaluation study of state-of-the-art folksonomy induction algorithms to date.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This prevents the use of tags to efficiently support semantic-based navigation, information retrieval, and recommendation. Thus many studies have attempted to explore methods to infer structured knowledge, such as relations, concept hierarchies, and lightweight ontologies, from tags [58,124,165]. Limitations of the current methods motivate to learn more accurate and useful structured knowledge from social tagging data [48]. ...
... The result of personal free tagging of information and objects (anything with a URL) for one's own retrieval, which provides empirical material to elicit semantics and to learn structured knowledge. [58,124,135,164,165,184] ...
... Studies have thus identified folksonomies as a source for eliciting semantic relations and developing structured knowledge [58,165], or even as a lightweight ontology [124]. The rich metadata in folksonomies, however, are accompanied by some inherent issues as other user-generated social media data. ...
Thesis
Full-text available
Knowledge has long been a crucial element in Artificial Intelligence (AI), which can be traced back to knowledge-based systems, or expert systems, in the 1960s. Knowledge provides contexts to facilitate machine understanding and improves the explainability and performance of many semantic-based applications. The acquisition of knowledge is, however, a complex step, normally requiring much effort and time from domain experts. In machine learning as one key domain of AI, the learning and leveraging of structured knowledge, such as ontologies and knowledge graphs, have become popular in recent years with the advent of massive user-generated social media data. The main hypothesis in this thesis is therefore that a substantial amount of useful knowledge can be derived from user-generated social media data. A popular, common type of social media data is social tagging data, accumulated from users' tagging in social media platforms. Social tagging data exhibit unstructured characteristics, including noisiness, flatness, sparsity, incompleteness, which prevent their efficient knowledge discovery and usage. The aim of this thesis is thus to learn useful structured knowledge from social media data regarding these unstructured characteristics. Several research questions have then been formulated related to the hypothesis and the research challenges. A knowledge-centred view has been considered throughout this thesis: knowledge bridges the gap between massive user-generated data to semantic-based applications. The study first reviews concepts related to structured knowledge, then focuses on two main parts, learning structured knowledge and leveraging structured knowledge from social tagging data. To learn structured knowledge, a machine learning system is proposed to predict subsumption relations from social tags. The main idea is to learn to predict accurate relations with features, generated with probabilistic topic modelling and founded on a formal set of assumptions on deriving subsumption relations. Tag concept hierarchies can then be organised to enrich existing Knowledge Bases (KBs), such as DBpedia and ACM Computing Classification Systems. The study presents relation-level evaluation, ontology-level evaluation, and the novel, Knowledge Base Enrichment based evaluation, and shows that the proposed approach can generate high quality and meaningful hierarchies to enrich existing KBs. To leverage structured knowledge of tags, the research focuses on the task of automated social annotation and propose a knowledge-enhanced deep learning model. Semantic-based loss regularisation has been proposed to enhance the deep learning model with the similarity and subsumption relations between tags. Besides, a novel, guided attention mechanism, has been proposed to mimic the users' behaviour of reading the title before digesting the content for annotation. The integrated model, Joint Multi-label Attention Network (JMAN), significantly outperformed the state-of-the-art, popular baseline methods, with consistent performance gain of the semantic-based loss regularisers on several deep learning models, on four real-world datasets. With the careful treatment of the unstructured characteristics and with the novel probabilistic and neural network based approaches, useful knowledge can be learned from user-generated social media data and leveraged to support semantic-based applications. This validates the hypothesis of the research and addresses the research questions. Future studies are considered to explore methods to efficiently learn and leverage other various types of structured knowledge and to extend current approaches to other user-generated data.
... Research in this line also leverages co-occurrence features [40] and usually relies on specific contents from the tagged resources [49]. 30 This work aims to address the two major issues in the existing research: first, inference is primarily done through analysis of the tag co-occurrence and largely overlooks the complex meanings of tags, which often leads to low prediction accuracy; second, existing evaluation studies regarding the quality of the knowledge discovered from large-scale datasets are not adequate, e.g., the study 35 in [42] did not formally evaluate the enriched knowledge. Our study focuses on learning from academic social tagging data, i.e., tagging data for academic publications and resources. ...
... To our best knowledge, this is one of the largest and most systematic evaluation studies for relation learning from academic social data (cf. [42]); this is also the first study focusing on enriching large-scale KBs. The proposed method outperforms 70 the state of the art in terms of F 1 score and taxonomic similarity measures when evaluated against gold standard KBs, and is further validated through human evaluation of the KB enrichment. ...
... The study in [6] extended this approach with sense 105 disambiguation and applied betweenness centrality on a tag-tag co-occurrence network. The work in [42] evaluated both methods proposed in [6,26] and validated the usefulness of graph centrality in creating taxonomies from tags. This class of methods heavily relies on co-occurrence information and may not derive accurate subsumption relations [21]. ...
Article
Full-text available
There has been considerable interest in transforming unstructured social tagging data into structured knowledge for semantic-based retrieval and recommendation. Research in this line mostly exploits data co-occurrence and often overlooks the complex and ambiguous meanings of tags. Furthermore, there have been few comprehensive evaluation studies regarding the quality of the discovered knowledge. We propose a supervised learning method to discover subsumption relations from tags. The key to this method is quantifying the probabilistic association among tags to better characterise their relations. We further develop an algorithm to organise tags into hierarchies based on the learned relations. Experiments were conducted using a large, publicly available dataset, Bibsonomy, and three popular, human-engineered or data-driven knowledge bases: DBpedia, Microsoft Concept Graph, and ACM Computing Classification System. We performed a comprehensive evaluation using different strategies: relation-level, ontology-level, and knowledge base enrichment based evaluation. The results clearly show that the proposed method can extract knowledge of better quality than the existing methods against the gold standard knowledge bases. The proposed approach can also enrich knowledge bases with new subsumption relations, having the potential to significantly reduce time and human effort for knowledge base maintenance and ontology evolution.
... One specific, challenging research topic in this line is concerned with learning structured knowledge by exploring and exploiting social tagging data or folksonomies. There have been numerous methods and techniques to induce semantics from the noisy and unstructured folksonomies, and to construct structured knowledge with rich semantics, which have been shown useful for many application areas, such as domain ontology enrichment [1], information retrieval and navigation [2][3][4], recommender systems [5][6], academic communication [7], e-learning [8], mood mining [9], geotagging [10], etc. ...
... In the evaluation study [3], divisive hierarchical K-means clustering is performed on five social tagging datasets, Bibsonomy, CiteULike, del.icio.us, Flickr and Last.fm. ...
... Different from many of the existing studies, our paper overviewed the core methods and techniques deriving concepts and their relations from social tagging data. For related review papers focusing on data preprocessing, evaluations, and formal steps to associate semantics to folksonomies, the readers are advised to refer to [15], [3], [18] respectively. ...
Conference Paper
Full-text available
For more than a decade, researchers have been proposing various methods and techniques to mine social tagging data and to learn structured knowledge. It is essential to conduct a comprehensive survey on the related work, which would benefit the research community by providing better understanding of the state-of-the-art and insights into the future research directions. The paper first defines the spectrum of Knowledge Organization Systems, from unstructured with less semantics to highly structured with richer semantics. It then reviews the related work by classifying the methods and techniques into two main categories, namely, learning term lists and learning relations. The method and techniques originated from natural language processing, data mining, machine learning, social network analysis, and the Semantic Web are discussed in detail under the two categories. We summarize the prominent issues with the current research and highlight future directions on learning constantly evolving knowledge from social media data.
... It has become a key part on most online portals, such as Delicious, Blogger, Flickr, Twitter and Facebook. In recent years, folksonomies have emerged as an alternative to traditional classifications of organizing information [3,4]. They benefit from the power of collective intelligence to offer an easier (in terms of time, effort and cognitive costs) approach to organizing web resources [5]. ...
... Plangprasopchok et al. [37] adapted affinity propagation proposed by Frey & Dueck [38] to build deeper and denser tag hierarchies from folksonomies. However, Strohmaier et al. [4] have proved that generality-based approaches to learning tag hierarchy, with degree centrality as generality measure and co-occurrence as similarity measure, e.g. [10] have a superior performance compared to probabilistic models, e.g. ...
... In addition, they are not appropriate to use them for acquiring semantic relations in tag collections since these collections tend to be much more inconsistent than text collections [47]. Moreover, Strohmaier et al., in their study of tag hierarchy building algorithms, show that the approaches tailored towards collaborative tagging systems outperform the approaches based on traditional hierarchical clustering techniques [4]. ...
Conference Paper
Full-text available
Building taxonomies for Web content manually is costly and time-consuming. An alternative is to allow users to create folksonomies: collective social classifications. However, folksonomies have inconsistent structures and their use for searching and browsing is limited. Approaches have been proposed for acquiring implicit hierarchical structures from folksonomies, but these approaches suffer from the “generality-popularity” problem, in that they assume that popularity is a proxy for generality (that high level taxonomic terms will occur more often than low level ones). In this paper we test this assumption, and propose an improved approach (based on the Heymann-Benz algorithm) for tackling this problem by direction checking relations against a corpus of text. Our results show that popularity works as a proxy for generality in at most 77% of cases, but that this can be improved to 81% using our approach. This improvement will translate to higher quality tag hierarchy structures.
... Tagging is a process that allows individuals to freely assign tags to a web object or resource, whereas folksonomy (a set of user, tag, resource triples) is the result of that process [2]. In recent years, folksonomies have emerged as an alternative to traditional classifications of organizing information [3]. However, they share the inconsistent structure problem that is inherited from uncontrolled vocabularies, which causes many problems like ambiguity, homonymy (same spelling but different meanings), and synonymy (terms have the same meaning) [4,5]. ...
... Plangprasopchok et al. [7] adapted affinity propagation introduced by Frey & Dueck [17] to construct deeper and denser tag hierarchies from folksonomies. Yet Strohmaier et al. [3] showed that generality-based approaches of tag hierarchy, with degree centrality as generality measure and co-occurrence as similarity measure, e.g. [8] show a superior performance compared to probabilistic models, e.g. ...
... Evaluating taxonomy construction is a major challenge since there is not an approved evaluation dataset [3], nor an agreed methodology in the literature [19]. However, this subsection proposes a broad evaluation process to evaluate two things: 1) The quality of tag hierarchies constructed from our tag pairs approach, compared to tag hierarchies constructed from flat tags (Evaluation metrics: 1, 3 and 4). ...
Conference Paper
Full-text available
Building taxonomies for web content is costly. An alternative is to allow users to create folksonomies, collective social classifications. However, folksonomies lack structure and their use for searching and browsing is limited. Current approaches for acquiring latent hierarchical structures from folksonomies have had limited success. We explore whether asking users for tag pairs, rather than individual tags, can increase the quality of derived tag hierarchies. We measure the usability cost, and in particular cognitive effort required to create tag pairs rather than individual tags. Our results show that when applied to tag pairs a hierarchy creation algorithm (Heymann-Benz) has superior performance than when applied to individual tags, and with little impact on usability. However, the resulting hierarchies lack richness, and could be seen as less expressive than those derived from individual tags. This indicates that expressivity, not usability, is the limiting factor for collective tagging approaches aimed at crowdsourcing taxonomies.
... In 2012, Strohmaier, Helic et al. compared different folksonomy induction algorithms through decentralized search [29]. They showed that, based on evaluation through navigation, clustering algorithms developed for social tagging systems performed better than standard hierarchical clustering algorithms. ...
... In the theory of network navigability, Jon Kleinberg showed that networks that are formed according to a background hierarchy (i.e., a tree) are efficiently navigable [15], provided the search agent has access to that background hierarchy during the search. This method, called Hierarchical Decentralized Search, has been successfully applied in previous research ( [12], [29]). This paper extends this application by using ontologies as the background knowledge. ...
... The experiments presented in this paper were conducted on a decentralized search simulator. This simulator was an extension of previous work by Helic, Strohmaier et al ( [11], [29]) and implemented in C++ based on the Stanford Network Analysis Project framework [3]. It permitted the simulation of decentralized search on a given network and used a provided ...
Article
Full-text available
The need to examine the behavior of different user groups is a fundamental requirement when building information systems. In this paper, we present Ontology-based Decentralized Search (OBDS), a novel method to model the navigation behavior of users equipped with different types of background knowledge. Ontology-based Decentralized Search combines ontologies and decentralized search, an established method for navigation in social networks, to model navigation behavior in information networks. The method uses ontologies as an explicit representation of background knowledge to inform the navigation process and guide it towards navigation targets. By using different ontologies, users equipped with different types of background knowledge can be represented. We demonstrate our method using four biomedical ontologies and their associated Wikipedia articles. We compare our simulation results with base line approaches and with results obtained from a user study and find that our method produces click paths that have properties similar to those originating from human navigators. The results suggest that our method can be used to model human navigation behavior in systems that are based on information networks such as Wikipedia.
... By constructing tag hierarchies from the bipartite tag-resource network structures of a number of tagging systems and by using this background knowledge as input for our hierarchical decentralized search algorithm, we could show that tag hierarchies perform extremely well in searching social tagging systems. In subsequent work [21], we also demonstrated that the most semantically sound tag hierarchies are also those that perform well on navigational tasks. However, our previous experiments were based on intuitions how humans navigate and we have not yet compared our simulations (based on decentralized search) with real human navigational paths. ...
... However, our previous experiments were based on intuitions how humans navigate and we have not yet compared our simulations (based on decentralized search) with real human navigational paths. Hence, the purpose of this paper is to compare simulations based on hierarchical decentralized search with a large-scale corpus of human navigational paths and to reveal whether or not it is justified to simulate human navigational behavior in information networks with the hierarchical decentralized search procedure as introduced and used by us in previous work [11, 21, 22, 23]. To that end, we compared more than 150,000 click trails of users navigating the complete English Wikipedia with simulations. ...
... In previous work [21] we showed that our algorithm is depended on the quality of the hierarchical knowledge extracted from the information network. As also shown, the best results are archived by creating hierarchies that are created by graph based clustering algorithms that are based on the tag network's tag co-occurrence graph. ...
Conference Paper
Full-text available
Decentralized search in networks is an activity that is often performed in online tasks. It refers to situations where a user has no global knowledge of a network’s topology, but only local knowledge. On Wikipedia for instance, humans typically have local knowledge of the links emanating from a given Wikipedia article, but no global knowledge of the entire Wikipedia graph. This makes the task of navigation to a target Wikipedia article from a given starting article an interesting problem for both humans and algorithms. As we know from previous studies, people can have very efficient decentralized search procedures that find shortest paths in many cases, using intuitions about a given network. These intuitions can be modeled as hierarchical background knowledge that people access to approximate a networks’ topology. In this paper, we explore the differences and similarities between decentralized search that utilizes hierarchical background knowledge and actual human navigation in information networks. For that purpose we perform a large scale study on the Wikipedia information network with over 500,000 users and 1,500,000 click trails. As our results reveal, a decentralized search procedure based on hierarchies created directly from the link structure of the information network simulates human navigational behavior better than simulations based on hierarchies that are created from external knowledge.
... Effectively, tag clouds are a new " social " way to find and visualize information providing both: one-click access to information and a snapshot of the " aboutness " of a tagged collection. Not surprisingly, a large volume of research has been devoted to developing better approaches to construct and visualize tag clouds [5, 30, 18] as well as more advanced tag constructs such as clustered/classified tag clouds [23, 32, 2, 39, 16, 25] and tag hierarchies [10, 19, 34, 35]. The majority of research on tag clouds and hierarchies used an information-or network-theoretical approach to evaluate the quality of different tag constructs in terms of search and navigation while ignoring the user prospective. ...
... Their analysis showed that users could effectively deploy query recommendations to explore large sets of images annotated with tags. Other studies [19, 34] explored another advanced tag construct, tag hierarchy, for tag-based navigation. By utilizing a decentralized search framework [34], the authors found that there are significant differences among different approaches to tag hierarchy construction in terms of success rate and average path length. ...
... Other studies [19, 34] explored another advanced tag construct, tag hierarchy, for tag-based navigation. By utilizing a decentralized search framework [34], the authors found that there are significant differences among different approaches to tag hierarchy construction in terms of success rate and average path length. Since our primary goal intent in this paper is to explore whether the tag-based browsing constructs could provide any additional value to tag-based search, we apply the most popular interface layout, a tag cloud, as our basic tag interface and compare it to a traditional search box interface. ...
Conference Paper
Full-text available
The availability of social tags has greatly enhanced access to infor-mation. Tag clouds have emerged as a new "social" way to find and visualize information, providing both one-click access to in-formation and a snapshot of the "aboutness" of a tagged collection. A range of research projects explored and compared different tag artifacts for information access ranging from regular tag clouds to tag hierarchies. At the same time, there is a lack of user studies that compare the effectiveness of different types of tag-based browsing interfaces from the users point of view. This paper contributes to the research on tag-based information access by presenting a con-trolled user study that compared three types of tag-based interfaces on two recognized types of search tasks – lookup and exploratory search. Our results demonstrate that tag-based browsing interfaces significantly outperform traditional search interfaces in both per-formance and user satisfaction. At the same time, the differences between the two types of tag-based browsing interfaces explored in our study are not as clear.
... The standard metrics are taxonomic precision (TP), taxonomic recall (TR) and taxonomic F-measure (TF). The idea is to find the similarity between the proposed ontology L and the ground truth ontology G for each tag community C, and to generate a characteristic extract from each of them, ce(C, L) and ce(C, G), which is inline with [3,10]. TP, TR and TF can then be computed by averaging the tp, tr and tf of all the communities. ...
... Most of the previous work [3,10] Ground truth ontologies were constructed for five tag communities, in which there were three big and two small communities, each having around 200 and 60 relations, respectively. We compared the ontology of our method with ontologies of each KB. ...
Chapter
Full-text available
Community Question Answer (CQA) sites are very popular means for knowledge transfer in the form of questions and answers. They rely on tags to connect the askers with the answerers. Since each CQA site contains information about a wide range of topics, it is difficult for users to navigate through the set of available tags and select the best ones for their question annotation. At present, CQA sites present the tags to the users using simple orderings, such as order by popularity and lexical order. This paper proposes a novel unsupervised method to mine different types of relationships between tags and then create a forest of ontologies to representing those relationships. Extracting the tag relationships will help users to understand the tags meanings. Representing them in a forest of ontologies will help the users in better tag navigation, thereby providing the users a clear understanding of the tag usage for question annotation. Moreover, our method can also be combined with existing tag recommendation systems to improve them. We evaluate our tag relationship mining algorithms and tag ontology construction algorithm with the state-of-the-art baseline methods and the three popular knowledge bases, namely DBpedia, ConceptNet, and WebIsAGraph.
... tags are in the form of hashtags to produce alternative access points to tweets. These accumulated tags are commonly referred to as Folksonomies, which have been used for organising online resources [1], browsing [2], semantic-based search and recommendation [3], and learning knowledge structures [4]. It is also reported that tags have higher descriptive and discriminative power compared to other textual features, such as titles, descriptions and comments, for document classification [5]. Figure 1 displays an example of a published paper and its associated tags on Bibsonomy. ...
... To extract the subsumption relations for all tags in each of the datasets (except Zhihu), we grounded the tags to concepts in the external knowledge base, the Microsoft Concept Graph (MCG) 4 . MCG has around 1.8M concepts and instances, and 8.5M subsumption relations. ...
Article
Full-text available
Automated social text annotation is the task of suggesting a set of tags for shared documents on social media platforms. The automated annotation process can reduce users' cognitive overhead in tagging and improve tag management for better search, browsing, and recommendation of documents. It can be formulated as a multi-label classification problem. We propose a novel deep learning based method for this problem, and design an attention-based neural network with semantic-based regularisation, which can mimic users' reading and annotation behaviour to formulate better document representation, leveraging the semantic relations among labels. The network separately models the title and the content of each document and injects an explicit, title-guided attention mechanism into each sentence. To exploit the correlation among labels, we propose two semantic-based loss regularisers, i.e. similarity and subsumption, that enforce the output of the network to conform to label semantics. The model with the semantic-based loss regularis-ers is referred to as the Joint Multi-label Attention Network (JMAN). We conducted a comprehensive evaluation study and compared JMAN to the state-of-the-art baseline models, using four large, real-world social media datasets. In terms of F1, JMAN significantly outperformed Bi-GRU (Bidirectional Gated Recurrent Unit) relatively by around 12.8% to 78.6%, and the Hierarchical Attention Network (HAN) by around 3.9% to 23.8%. The JMAN model demonstrates advantages in convergence and training speed. Further improvement of performance was observed against LDA (Latent Dirichlet Allocation) and SVM (Support Vector Machine). When applying the semantic-based loss regularisers, performance of HAN and Bi-GRU in terms of F1 was also boosted. It is also found that dynamic update of the label semantic matrices (JMAN d) has the potential to further improve the performance of JMAN but at the cost of substantial memory, and warrants further study.
... However, there is no comprehensive research on analysis and comparison of these rules in a rigorous manner. The work in [16] compared four clustering and generality based techniques and found that generality-based methods in general outperform the clustering based ones. It primarily focused on evaluation techniques and did not analyse how the underlying rules and their assumptions affected the results. ...
... There might be issues with the reference-based evaluation, as it only measures the global similarity between the learned hierarchy to the reference hierarchy. However, it is possible that some branches of the learned hierarchy were very similar to the reference hierarchy [16]. Two excerpts (due to the limited space) of the learned hierarchies are illustrated in Fig. 2 with fuzzy set inclusion and probabilistic association rule. ...
Chapter
Full-text available
Automatic generation of hierarchies from social tags is a challenging task. We identified three rules, set inclusion, graph centrality and information-theoretic condition from the literature and proposed two new rules, fuzzy set inclusion and probabilistic association to induce hierarchical relations. We proposed an hierarchy generation algorithm, which can incorporate each rule with different data representations, i.e., resource and Probabilistic Topic Model based representations. The learned hierarchies were compared to some of the widely used reference concept hierarchies. We found that probabilistic association and set inclusion based rules helped produce better quality hierarchies according to the evaluation metrics.
... The key aim of our approach is to solve the popularity-generality problem caused by using clustering techniques. To tackle this problem, our proposed approach extended a promising generality-based algorithm, based on Strohmaier et al. (2012), by using lexicosyntactic patterns applied to a large text corpus, i.e. English Wikipedia. ...
... The algorithm we have developed and used in our approach (Table 3) is an extension of Benz's algorithm (Benz, Hotho and Stutzer, 2010), which itself is an extension of Heymann's algorithm (Heymann and Garcia-Molinay, 2006). Based on a comprehensive study of tag hierarchy construction algorithms, Strohmaier et al. (2012) show that generality-based approaches of tag hierarchywith degree centrality as generality measure and co-occurrence as similarity measure, e.g. Benz's algorithmshow a superior performance compared to other approaches. ...
Conference Paper
Full-text available
Content on the Web is huge and constantly growing, and building taxonomies for such content can help with navigation and organisation, but building taxonomies manually is costly and time-consuming. An alternative is to allow users to construct folksonomies: collective social classifications. Yet, folksonomies are inconsistent and their use for searching and browsing is limited. Approaches have been suggested for acquiring implicit hierarchical structures from folksonomies, however, but these approaches suffer from the ‘popularity-generality’ problem, in that popularity is assumed to be a proxy for generality, i.e. high-level taxonomic terms will occur more often than low-level ones. To tackle this problem, we propose in this paper an improved approach. It is based on the Heymann–Benz algorithm, and works by checking the taxonomic directions against a corpus of text. Our results show that popularity works as a proxy for generality in at most 90.91% of cases, but this can be improved to 95.45% using our approach, which should translate to higher-quality tag hierarchy structures.
... The methods for evaluation of structured information from folksonomies can be conditionally divided in 3 groups: semantic evaluation and evaluation of navigation, as mentioned by Strohmaier et al. [33], and evaluation of recommendation. Semantic evaluation examines the truthfulness of the learned relations. ...
Thesis
Full-text available
Recommendation systems are software systems, most often based on machine learning algorithms, that are used to recommend products and online content to online users, with recommendations based on learning user profile and user preferences. Recommendation systems are becoming more common in our lives, and to a large extent are the driving force behind major online retail and broadcasting websites such as Amazon , Youtube and Netflix. In this master thesis, we explore recommendation systems for books. We base our work on the goodbooks-10k dataset from the website Goodreads.com, from which we mainly observe the customer tags data. Customer tags consist of a variety of unstructured information, such as the opinion and attitude of the user towards the book, and characteristics of the book such as series, author, genre, and more. The goal of our work is to extract structured information from tags in the form of a graph, which through graph-based embedding algorithms can be turned into vector representation and used to create a content-based recommendation system. To achieve this, we conduct experiments with several approaches for learning tag structure - learning tag relationships by using word embeddings, clustering, and mapping to an expert ontology. The experiments show an improvement in the recommendations when using a graph structure, compared to using the data in a raw format. The results obtained give us hope that developing and extending methods similar to the ones examined in this master thesis can lead to better recommendations and can be used in recommender systems for real-life tasks.
... In spite of these limitations, proposals in automatic methods can contribute to complement or even substitute manual taxonomies and make these resources more adaptable to different languages and purposes. This is even clearer specifically in fully statistically-based taxonomies emerging from corpus data in the line of Bullinaria & Levy (2007) or Strohmaier et al. (2012). Bordea et al. (2015) offer a recent description of state of the art in taxonomy induction and present the results of different teams that participated in the SemEval-2015 Task on Taxonomy Extraction Evaluation. ...
Conference Paper
Full-text available
In this paper we describe our work in progress in the automatic development of a taxonomy of Spanish nouns, we offer the Perl implementation we have so far, and we discuss the different problems that still need to be addressed. We designed a statistically-based taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language. We evaluate the quality of the taxonomy both manually and also using Spanish Wordnet as a gold-standard. We estimate an average of 89.07% precision and 25.49% recall considering only the results which the algorithm presents with high degree of certainty, or 77.86% precision and 33.72% recall considering all results.
... A number of algorithms (e.g. Affinity Propagation) have been proposed to obtain folksonomies [72]. One of the issues of folksonomies is that users' ways of annotating are different; therefore, fuzziness approaches have been introduced in the tags. ...
Article
Full-text available
Promoting recommender systems in real-world applications requires deep investigations with emphasis on their next generation. This survey offers a comprehensive and systematic review on recommender system development lifecycles to enlighten researchers and practitioners. The paper conducts statistical research on published recommender systems indexed by Web of Science to get an overview of the state of the art. Based on the reviewed findings, we introduce taxonomies driven by the following five phases: initiation (architecture and data acquisition techniques), design (design types and techniques), development (implementation methods and algorithms), evaluation (metrics and measurement techniques) and application (domains of applications). A layered framework of recommender systems containing market strategy, data, recommender core, interaction, security and evaluation is proposed. Based on the framework, the existing advanced humanized techniques emerged from computational intelligence and some inspiring insights from computational economics and machine learning are provided for researchers to expand the novel aspects of recommender systems.
... Recently, scientific approaches tend to focus on building ontology based on folksonomy instead of free texts [3], [4]. Inducting ontology from folksonomy's tags is relevant for its free expenses in building hierarchies of concepts from different areas, furthermore, for its easy use by an ordinary unskilled user. ...
... Currently, dedicated staff are responsible for handling of metadata in digital libraries, which includes controlled vocabulary and established taxonomies (Bethard et al., 2009). A recent approach to metadata construction and maintenance, namely, "folksonomies", makes use of the social networking mechanism and user tagging (Strohmaier et al., 2012;Lau et al., 2015). ...
Article
Full-text available
Purpose – This study aims to discuss the metadata structure of an online legal information system (OLIS) developed to suit the Indian environment. The OLIS is accessible online at www.olisindia.in. It contains several types of legal information resources to help lawyers, research scholars, students and the common user. The open-access OLIS helps the users to get the required information expeditiously. Dublin Core (DC) metadata standard was selected to create records in the OLIS because of ease of use and high adoption rate. Design/methodology/approach – The OLIS was designed using the system analysis and design method after a needs assessment survey conducted in eight major legal organizations in Delhi. The OLIS, accessible at www.olisindia.in, was accessed to identify and validate the metadata elements with the DC metadata standard. Findings – This paper discusses in detail the metadata structures of the OLIS. The system contains 15 types of resources relating to judicial and legislative information. Each database has a different metadata framework so that information desired by the legal community can be retrieved with precision and quick recall. In addition, a number of functions, such as latest news, online help, Frequently Asked Questions, query submission, online discussion forum for help and video tutorials, have been integrated into the OLIS. Practical implications – The study guides law libraries and library professionals to follow metadata standards in building an open-access database and also provides a legal resources metadata framework that enables them to select suitable resources for their libraries. Originality/value – The study confirms that the metadata elements set for managing judicial and legislative information are different compared to other types of scholarly information. The study can help newly established law university libraries to build legal information systems to suit their environment and satisfy the information needs of the diverse law community. Keywords India, Metadata, Dublin Core, Information system, Judicial information, Legislative information Paper type Research paper
... Although the goal of these hierarchies is to support navigation they are typically optimized for semantics. Previous work has shown that semantically optimal categories often posses desirable navigational properties—at least in the domain of social tagging systems (Strohmaier et al., 2012). However, the question if and how these results can be generalized to other information networks still remains an open one in our community. ...
... Some works tackle this problem by associating tags initially to WordNet Synsets, and then the Synset hierarchical structure is compared against ontologies [16]. In a comparative study realized by [31], it was proven that the algorithm of Heymann et al [7] outperforms all the algorithms introduced in the study. ...
Article
Full-text available
Web 2.0 is an evolution toward a more social, interactive and collaborative web, where user is at the center of service in terms of publications and reactions. This transforms the user from his old status as a consumer to a new one as a producer. Folksonomies are one of the technologies of Web 2.0 that permit users to annotate resources on the Web. This is done by allowing users to use any keyword or tag that they find relevant. Although folksonomies require a context-independent and inter-subjective definition of meaning, many researchers have proven the existence of an implicit semantics in these unstructured data. In this paper, we propose an improvement of our previous approach to extract ontological structures from folksonomies. The major contributions of this paper are a Normalized Co-occurrences in Distinct Users (NCDU) similarity measure, and a new algorithm to define context of tags and detect ambiguous ones. We compared our similarity measure to a widely used method for identifying similar tags based on the cosine measure. We also compared the new algorithm with the Fuzzy Clustering Algorithm (FCM) used in our original approach. The evaluation shows promising results and emphasizes the advantage of our approach.
... In Table 2 we have summarized the evaluation criteria to measure quality of folksonomies and mentioned as to which of our selected aspects cover which criteria. Criteria 1-7 are outlined by Farhan and Sanderson (2008) for e-government folksonomy and 8-10 are pointed out in Helic et al. (2011Helic et al. ( , 2012, Al-Khalifa and Davis (2006), Strohmaier et al. (2012a), Solskinnsbakk et al. (2012). Here we feel the need that research is required to have updated evaluation criteria and folksonomy management metrics (Folksonomy Management, 2014) against which the folksonomies must be evaluated to measure and ensure worthiness of the folksonomy with passage of time (Qualitatively and Quantitatively). ...
Article
Folksonomy gives liberty to its users to freely assign chosen keywords as tags, and this is the main reason behind its popularity. Apart from freedom, this system also reflects the collective intelligence of the crowd. However, this freedom and liberty can degrade quality of the folksonomy. It is required that quality of the folksonomy must remain consistently excellent and does not degrade with the passage of time. This is a survey paper, in which we present a brief survey of the research efforts intended to maintain a quality-protected folksonomy. We have organized our paper by looking at the problem from four aspects namely selection of quality tags, tag management features provided by folksonomy applications, folksonomy cleaning and interoperability of tags across platforms. We conclude our review with some of the interesting research topics, which need to be explored further. Our conclusion will be relevant and beneficial for engineers and designers who aim to design and maintain a quality-protected folksonomy.
... Folksonomies and crowdsourced taxonomies [10,17,20,21,31,32], is a very popular research area bringing forward the importance of user-provided tags and the need for tag-hierarchies for easier browsing and searching of web data. Some of these approaches automatically derive similarity measures between tags and use agglomerative clustering for producing tag-hierarchies, but much of previous work assumes a static tag space, despite its dynamicity. ...
Article
Full-text available
In this work, we present an approach for automatically identifying subsumption relations between web queries, a difficult (due to feature sparseness and ambiguity), but extremely useful task for many applications, ranging from user profiling and semantic enhancement of query logs, to traffic minimisation in distributed search environments (e.g., federations of digital libraries or cloud-based systems). We start by matching each query to the topics of a comprehensive web directory, and use these topics to apply query expansion in an iterative fashion. Subsequently, all expanded queries are mapped onto the DMOZ hierarchy, and the resulting subsumption relations are directly inferred from the directory structure once conflicts in the hierarchy are resolved. We evaluate our technique on real-world queries, and show that our approach is effective under all settings.
... The evaluation carried out by the authors compares the folksonomy automatically with the hierarchy of Open Directory Project. Implement and evaluate three classes of algorithms for the induction of folksonomies is the purpose of work of [23]. The three algorithms create hierarchical structures of tags. ...
Conference Paper
Full-text available
A dialogue system allows a human to interact with a computer, through the natural language. One of the main components of a dialogue system is the Conceptual Model. The Conceptual Model represents a domain and its specification is given by several forms of knowledge representation. We propose to represent it using folksonomies. We describe a method called FolksDialogue that performs the learning of folksonomies from task-oriented dialogues. In order to check whether the structures created by the method are genuine folksonomies, we performed an experiment to prove that they have the small-world phenomenon, which is a characteristic of folksonomies. The generated folksonomies can be useful in the interpretation of dialogue utterances, indicating whether the utterances belong or not to the domains that the folksonomies represent. The experiments show that the folksonomies learned can perform the interpretation of utterances with an accuracy of 69.20%.
... e generated by them through the union of all personal hierarchies, and which represents the folksonomy, was a tree of tags. The technique used for obtaining the folksonomy was based on relational clustering with similarity measures. The authors compared this folksonomy with the hierarchy of the Open Directory Project (Open Directory Project, 2014). Strohmaier et al. (Strohmaier et al., 2012) implemented and evaluated three classes of algorithms for the induction of folksonomies. The three algorithms created hierarchical structures of tags. The process of obtaining the folksonomies using the algorithms involves recursively applied methods of clustering and a hierarchical k-means algorithm. They also used tagging data from s ...
Article
Knowledge discovery is the process of discovering useful knowledge in a broad range of sources, such as relational databases, images, or texts. Dialogues are generated by interaction between people using natural language and can be used as a source of information. Once discovered, knowledge needs to be represented, and there are several approaches to this. In this paper, we propose a method to discover knowledge in task-oriented dialogues by representing these dialogues through folksonomies, using a novel quadripartite model. Folksonomies are knowledge structures composed of users, tags, and resources. Dialogues and folksonomies have a social dimension in common, which renders folksonomies suited to representing knowledge discovered from dialogues. The knowledge represented by folksonomies can be used to interpret new utterances in a dialogue and detect trends, e.g., by discovering topics addressed by people at different time intervals, in the dialogues used to learn the folksonomies. The main difference between our approach and past techniques is that we use the characteristics (the content) of each resource in the discovery process. Experiments involving a real-world task-oriented dialogue corpus showed that using our method, learned folksonomies can interpret utterances with an accuracy of 72.32%. Moreover, another experiment showed that it is possible to use our method to determine topics addressed by interlocutors in dialogues.
... Our approach ameliorates the previous one that was an extension of the work of Heymann et al (Heymann, 2006). In a comparative study realized by (Strohmaier, 2012), it was proven that Heymann's algorithm outperforms all the algorithms introduced in the study. This is why we have chosen it as the basis of our contribution. ...
Article
Full-text available
Folksonomies are one of the technologies of Web 2.0 that permit users to annotate resources on the Web. In this paper, the authors propose an integrated approach to extract ontological structures from unstructured and semi-structured resources. Our proposal overcome limitations of existing approaches. It gives a formal, simple, and efficient solution to the tag clustering and disambiguation problem. Moreover, their approach doesn't need any ontology as an upper guide during the generation process. The generated ontology can be used to enhance various tasks such as ontology evolution and enrichment.
... In the literature one can find different interpretations to the meaning of Folksonomy ( [9], [10]). In this paper we used the definition proposed by the creator of the term, Thomas Van der Wal [11] (and then reinforced by Adam Mathes [6]). ...
Conference Paper
Full-text available
This paper presents the FOLKUS-SD a module of CSCW-SD, intend to build Folksonomies from source codes. The CSCW-SD is an architecture to integrate tools in a development environment through a Multi-Agent System. The Folksonomies are built using dynamic data collected during codification. The Folksonomies' entities are represented by the module as: the users are the developers, the tags are names of classes, methods and attributes, and the resources are source codes. The Folksonomies can be used for software documentation. The Folksonomies produced by FOLKUS-SD can be used in the software documentation for monitoring the status of some project and to get quantitative data about the project and its participants.
... In (Tang et al, 2009) the authors use Folksonomies as a source of terminology and try to infer ontological structures from them using co-occurrence statistics. (Strohmaier, 2012) makes a comprehensive survey and evaluation of the state-of-the-art folksonomy induction algorithms and introduces a new pragmatic approach to folksonomy evaluation. Integration of Semantic web technologies and social web is studied in many other researches, as (Grobelnik et al., 2009), based on social networks, and (Tomašev, 2011), classifying images, discussed in social web sites. ...
Conference Paper
Full-text available
Usage of semantic web technologies is possible only if needed knowledge is ontologically represented. Creating ontologies manually is difficult, time and labor -intensive processes. Many approaches, methods and tools have been developed recently to automate ontology building, evolution and evaluation process. The aim of this paper is to make a brief survey of ontology learning approaches and methods paying particular attention to the resent research and outline main trends in this area.
... Our approach is an extension of work of Heymann et al [5] which was extended by Benz et al [8] where another similarity and generality measures were applied. In a comparative study realized by [32] it was proven that this algorithm outperform all the algorithms introduced in the study. This is why we have chosen it as the basis of our contribution. ...
Conference Paper
Full-text available
Collaborative tagging systems have recently emerged as a powerful way to label and organize large collections of data. The informal social classification structure in these systems, also known as folksonomy, provides a convenient way to annotate resources by allowing users to use any keyword or tag that they find relevant. Although folksonomies and the respective tags often lack a context-independent and intersubjective definition of meaning, the assumption that the evolving structure of these digital records contains implicit evidences for the underlying semantics has been proven by successful approaches of making the emergent semantics explicit. In this paper we propose an approach for extracting ontological structures from folksonomies that exploits the power of fuzzy clustering using new similarity and generality measure. The fuzzy clustering process discovers ambiguous tags and disambiguates them all at once, and the new similarity measure gives more accurate results as it calculates co-occurrences based on distinct users and not only in the number of co-occurrences of two distinct words. The generated ontology can be used to enhance various tasks in the tagging systems, such as tag disambiguation, result visualization, and ontology evolution. Our experimental results on real world data sets show that our method can effectively learn the ontology structure from the folksonomies.
... In our own previous work [20, 21], we investigated the extent to which tag semantics are influenced by user motivation and usage practices. In [31] we investigated the quality of semantic relations in automatically constructed tag hierarchies. By measuring Taxonomic Recall and Precision [9] against a huge number of existing human created concept hierarchies we have shown that algorithms such as e.g. ...
Conference Paper
Full-text available
Although many social tagging systems share a common tri-partite graph structure, the collaborative processes that are generating these structures can differ significantly. For ex-ample, while resources on Delicious are usually tagged by all users who bookmark the web page cnn.com, photos on Flickr are usually tagged just by a single user who uploads the photo. In the literature, this distinction has been described as a distinction between broad vs. narrow folksonomies. This paper sets out to explore navigational differences be-tween broad and narrow folksonomies in social hypertextual systems. We study both kinds of folksonomies on a dataset provided by Mendeley -a collaborative platform where users can annotate and organize scientific articles with tags. Our experiments suggest that broad folksonomies are more use-ful for navigation, and that the collaborative processes that are generating folksonomies matter qualitatively. Our find-ings are relevant for system designers and engineers aiming to improve the navigability of social tagging systems.
... Semantics of tags: On the one hand, researchers explored to what extent semantics emerge from folksonomies by investigating different algorithms for extracting tag networks and hierarchies from such systems (see e.g., [1], [3] or [13]). The work of [14] evaluated three state-of-the-art folksonomy induction algorithms in the context of five social tagging systems. Their results show that those algorithms specifically developed to capture intuitions of social tagging systems outperform traditional hierarchical clustering techniques . ...
Conference Paper
Full-text available
This paper sets out to explore whether data about the usage of hashtags on Twitter contains information about their semantics. Towards that end, we perform initial statistical hypothesis tests to quantify the association between usage patterns and semantics of hashtags. To assess the utility of pragmatic features { which describe how a hashtag is used over time { for semantic analysis of hashtags, we conduct various hashtag stream classification experiments and compare their utility with the utility of lexical features. Our results indicate that pragmatic features indeed contain valuable information for classifying hashtags into semantic categories. Although pragmatic features do not outperform lexical features in our experiments, we argue that pragmatic features are important and relevant for settings in which textual information might be sparse or absent (e.g., in socialvideo streams).
... Our approach is an extension of work of Heymann et al [5] which was extended by Benz et al [8] where another similarity and generality measures were applied. In a comparative study realized by [31] it was proven that this algorithm outperform all the algorithms introduced in the study. This is why we have chosen it as the basis of our contribution. ...
... In previous work we have evaluated a series of existing tag hierarchy algorithms on a theoretical level, without taking user interface constraints into account [10]. As we have found that centralitybased algorithms outperform hierarchical clustering algorithms by a large margin (see [10,31] for more details) we select one of these algorithms to conduct further investigations of their usefulness under typical user interface constraints. It is reasonable to assume that other centrality-based tag hierarchy algorithms will behave similarly under our constraints. ...
Conference Paper
Today, a number of algorithms exist for constructing tag hierarchies from social tagging data. While these algorithms were designed with ontological goals in mind, we know very little about their properties from an information retrieval perspective, such as whether these tag hierarchies support efficient navigation in social tagging systems. The aim of this paper is to investigate the usefulness of such tag hierarchies (sometimes also called folksonomies - from folk-generated taxonomy) as directories that aid navigation in social tagging systems. To this end, we simulate navigation of directories as decentralized search on a network of tags using Kleinberg's model. In this model, a tag hierarchy can be applied as background knowledge for decentralized search. By constraining the visibility of nodes in the directories we aim to mimic typical constraints imposed by a practical user interface (UI), such as limiting the number of displayed subcategories or related categories. Our experiments on five different social tagging datasets show that existing tag hierarchy algorithms can support navigation in theory, but our results also demonstrate that they face tremendous challenges when user interface (UI) restrictions are taken into account. Based on this observation, we introduce a new algorithm that constructs efficiently navigable directories on our datasets. The results are relevant for engineers and scientists aiming to improve navigability of social tagging systems.
Conference Paper
Dialogue systems intend to facilitate the interaction between humans and computers. A key element in a dialogue system is the conceptual model which represents a domain. Folksonomies are very simple forms of knowledge representation which may be used to specify the conceptual model. However , folksonomies by nature have ambiguity. In this paper, we present a method which uses linguistic context for learning folksonomies from task-oriented dialogues. The linguistic context can be useful for reducing ambiguity, for instance, when using the folksonomies for interpreting utterances. Experiments show that the learned folksonomies increase the accuracy of the interpretation compared when not using the contextual information.
Article
Full-text available
Purpose: This study introduces an algorithm to construct tag trees that can be used as a user-friendly navigation tool for knowledge sharing and retrieval by solving two issues of previous studies, i.e. semantic drift and structural skew. Design/methodology/approach: Inspired by the generality based methods, this study builds tag trees from a co-occurrence tag network and uses the h-degree as a node generality metric. The proposed algorithm is characterized by the following four features: (1) the ancestors should be more representative than the descendants, (2) the semantic meaning along the ancestor-descendant paths needs to be coherent, (3) the children of one parent are collectively exhaustive and mutually exclusive in describing their parent, and (4) tags are roughly evenly distributed to their upper-level parents to avoid structural skew. Findings: The proposed algorithm has been compared with a well-established solution Heymann Tag Tree (HTT). The experimental results using a social tag dataset showed that the proposed algorithm with its default condition outperformed HTT in precision based on Open Directory Project (ODP) classification. It has been verified that h-degree can be applied as a better node generality metric compared with degree centrality. Research limitations: A thorough investigation into the evaluation methodology is needed, including user studies and a set of metrics for evaluating semantic coherence and navigation performance. Practical implications: The algorithm will benefit the use of digital resources by generating a flexible domain knowledge structure that is easy to navigate. It could be used to manage multiple resource collections even without social annotations since tags can be keywords created by authors or experts, as well as automatically extracted from text. Originality/value: Few previous studies paid attention to the issue of whether the tagging systems are easy to navigate for users. The contributions of this study are twofold: (1) an algorithm was developed to construct tag trees with consideration given to both semantic coherence and structural balance and (2) the effectiveness of a node generality metric, h-degree, was investigated in a tag co-occurrence network.
Conference Paper
Search engines usually do their jobs well. However, due to the fact that most existing search algorithms are keyword-based, search engines may not work as expected in some scenarios when ambiguity problems are encountered. A possible approach to overcome it is to categorize Web resources in advance. In this research, a k-means variation, the keen-means algorithm, along with its implementation is proposed. The algorithm will dynamically and automatically adjust the k value to achieve better results.
Article
Ontology is the backbone of the Semantic Web, helping users search for relevant resources from the Web of linked data. The existing context-free mapping approach between tags and concepts fails to address the problems of social synonymy and social polysemy when ontologies are induced from folksonomies. The novel contributions of this paper are threefold. First, grounded in the cognitively motivated category utility measure, a novel basic-level concept mining algorithm is developed to construct semantically rich concept vectors to alleviate the problem of social synonymy. Second, contextual aspects of ontology learning are exploited via probabilistic topic modeling to address the problem of social polysemy. Third, a novel context-sensitive domain ontology learning algorithm that combines link-And content-based semantic analysis is developed to identify both taxonomic and associative relations among concepts. To the best of our knowledge, this is the first successful research that exploits a cognitively motivated method to learn context-sensitive domain ontologies from folksonomies. By using the Open Directory Project ontology as a benchmark, we examined the effectiveness of the proposed algorithms based on social annotations crawled from three different folksonomy sites. Our experimental results show that the proposed ontology learning system significantly outperforms the best baseline system by 13083% in terms of taxonomic F-measure. The practical implication of our research is that high-quality ontologies are constructed with minimal human intervention to facilitate concept-driven retrieval of linked data and the knowledge-based interoperability among enterprises.
Article
In this case study, we demonstrate how in an integrated digital library and course management system, metadata can be generated using a bootstrapping mechanism. The integration encompasses sequencing of content by teachers and deployment of content to learners. We show that taxonomy term assignments and a recommender system can be based almost solely on usage data (especially correlations on what teachers have put in the same course or assignment). In particular, we show that with minimal human intervention, taxonomy terms, quality measures, and an association ruleset can be established for a large pool of fine-granular educational assets.
Conference Paper
Semantic web service technologies have been proposed to enable automatic web service discovery and composition. But such approaches are suffered from significant effort to construct domain ontologies and to annotate web services with semantics by third parties. Hence social and collaborative tagging systems have been gaining the popularity on the web. Folksonomy-based web service annotating is emerging, i.e. to annotate web services with semantic from community-generated folksonomies. This paper focuses on how to provide folksonomy-based in-depth annotation of web services. Herein, the in-depth means the annotation is based on a structured folksonomy, and steps inside different parts of the web services in an automatic way. Two problems need addressed: exploring semantics for the folksonomy from original tags of web services, and automatically assigning tags to the different parts of web services. The paper proposes an approach to achieving automatic tags assignment of web services with a structured folksonomy. Such in-depth annotation facilitates web services discovery and composition by providing precise tagging of input, output and etc. A case-study and result of experiments on the pairs of tag-service extracted from a web service portal, seekda, illustrates the effectiveness of the approach.
Article
Novel social media collaboration platforms, such as games with a purpose and mechanised labour marketplaces, are increasingly used for enlisting large populations of non-experts in crowdsourced knowledge acquisition processes. Climate Quiz uses this paradigm for acquiring environmental domain knowledge from non-experts. The game's usage statistics and the quality of the produced data show that Climate Quiz has managed to attract a large number of players but noisy input data and task complexity led to low player engagement and suboptimal task throughput and data quality. To address these limitations, the authors propose embedding the game into a hybrid-genre workflow, which supplements the game with a set of tasks outsourced to micro-workers, thus leveraging the complementary nature of games with a purpose and mechanised labour platforms. Experimental evaluations suggest that such workflows are feasible and have positive effects on the game's enjoyment level and the quality of its output.
Conference Paper
For community managers and hosts it is not only important to identify the current key topics of a community but also to assess the specificity level of the community for: a) creating sub-communities, and: b) anticipating community behaviour and topical evolution. In this paper we present an approach that empirically characterises the topical specificity of online community forums by measuring the abstraction of semantic concepts discussed within such forums. We present a range of concept abstraction measures that function over concept graphs-i.e. resource type-hierarchies and SKOS category structures-and demonstrate the efficacy of our method with an empirical evaluation using a ground truth ranking of forums. Our results show that the proposed approach outperforms a random baseline and that resource type-hierarchies work well when predicting the topical specificity of any forum with various abstraction measures.
Conference Paper
Full-text available
In this article, the authors present a novel approach for computing semantic relatedness and conduct a large-scale study of it on Wikipedia. Unlike existing semantic analysis methods that utilize Wikipedia's content or link structure, the authors propose to use human navigational paths on Wikipedia for this task. The authors obtain 1.8 million human navigational paths from a semi-controlled navigation experiment - a Wikipedia-based navigation game, in which users are required to find short paths between two articles in a given Wikipedia article network. The authors' results are intriguing: They suggest that (i) semantic relatedness computed from human navigational paths may be more precise than semantic relatedness computed from Wikipedia's plain link structure alone and (ii) that not all navigational paths are equally useful. Intelligent selection based on path characteristics can improve accuracy. The authors' work makes an argument for expanding the existing arsenal of data sources for calculating semantic relatedness and to consider the utility of human navigational paths for this task.
Conference Paper
Full-text available
Recently, a number of algorithms have been proposed to obtain hierarchical structures — so-called folksonomies — from social tagging data. Work on these algorithms is in part driven by a belief that folksonomies are useful for tasks such as: (a) Navigating social tagging systems and (b) Acquiring semantic relationships between tags. While the promises and pitfalls of the latter have been studied to some extent, we know very little about the extent to which folksonomies are pragmatically useful for navigating social tagging systems. This paper sets out to address this gap by presenting and applying a pragmatic framework for evaluating folksonomies. We model exploratory navigation of a tagging system as decentralized search on a network of tags. Evaluation is based on the fact that the performance of a decentralized search algorithm depends on the quality of the background knowledge used. The key idea of our approach is to use hierarchical structures learned by folksonomy algorithms as background knowledge for decentralized search. Utilizing decentralized search on tag networks in combination with different folksonomies as hierarchical background knowledge allows us to evaluate navigational tasks in social tagging systems. Our experiments with four state-of-the-art folksonomy algorithms on five different social tagging datasets reveal that existing folksonomy algorithms exhibit significant, previously undiscovered, differences with regard to their utility for navigation. Our results are relevant for engineers aiming to improve navigability of social tagging systems and for scientists aiming to evaluate different folksonomy algorithms from a pragmatic perspective
Conference Paper
Full-text available
This paper is concerned with the problem of browsing social annotations. Today, a lot of services (e.g., Del.icio.us, Filckr) have been provided for helping users to manage and share their favorite URLs and photos based on social annotations. Due to the exponential increasing of the social annotations, more and more users, however, are facing the problem how to effectively find desired resources from large annotation data. Existing methods such as tag cloud and annotation matching work well only on small annotation sets. Thus, an effective approach for browsing large scale annotation sets and the associated resources is in great demand by both ordinary users and service providers. In this paper, we propose a novel algorithm, namely Effective Large Scale Annotation Browser (ELSABer), to browse large-scale social annotation data. ELSABer helps the users browse huge number of annotations in a semantic, hierarchical and efficient way. More specifically, ELSABer has the following features: 1) the semantic relations between annotations are explored for browsing of similar resources; 2) the hierarchical relations between annotations are constructed for browsing in a top-down fashion; 3) the distribution of social annotations is studied for efficient browsing. By incorporating the personal and time information, ELSABer can be further extended for personalized and time-related browsing. A prototype system is implemented and shows promising results.
Conference Paper
Full-text available
It is a widely held belief among designers of social tagging systems that tag clouds represent a useful tool for navigation. This is evident in, for example, the increasing number of tagging systems offering tag clouds for navigational purposes, which hints towards an implicit assumption that tag clouds support efficient navigation. In this paper, we examine and test this assumption from a network-theoretic perspective, and show that in many cases it does not hold. We first model navigation in tagging systems as a bipartite graph of tags and resources and then simulate the navigation process in such a graph. We use network-theoretic properties to analyse the navigability of three tagging datasets with regard to different user interface restrictions imposed by tag clouds. Our results confirm that tag resource networks have efficient navigation properties in theory, but they also show that popular user interface decisions (such as “pagination” combined with reverse-chronological listing of resources) significantly impair the potential of tag clouds as a useful tool for navigation. Based on our findings, we identify a number of avenues for further research and the design of novel tag cloud construction algorithms. Our work is relevant for researchers interested in navigability of emergent hypertext structures, and for engineers seeking to improve the navigability of social tagging systems.
Conference Paper
Full-text available
The proliferation of social Web technologies such as collaborative tagging has led to an increasing awareness of their vulnerability to misuse. Attackers may attempt to distort the system's adaptive behavior by inserting erroneous or misleading annotations, thus altering the way in which information is presented to legitimate users. Prior work on recommender systems has shown that studying different attack types, their properties and their impact, can help identify robust algorithms that make these systems more secure and less vulnerable to manipulation.Unlike traditional recommender systems, a tagging system includes multiple retrieval algorithms to facilitate browsing of resources, users and tags. The challenge is, therefore, evaluating the impact of various types of attacks across different navigation options. In this paper we develop a framework for characterizing attacks against tagging systems. We then propose a methodology for evaluating their global impact based on PageRank. Using real data from a popular tagging systems, we empirically evaluate the effectiveness of several attack types. Our results help us understand how much effort is needed from an attacker to change the behavior of a tagging system and which attack types are more successful against such systems.
Conference Paper
Full-text available
In recent years several measures for the gold standard based evaluation of ontology learning were proposed. They can be distinguished by the layers of an ontology (e.g. lexical term layer and concept hierarchy) they evaluate. Judging those measures with a list of criteria we show that there exist some measures sufficient for evaluating the lexical term layer. However, existing measures for the evaluation of concept hierarchies fail to meet basic criteria. This paper presents a new taxonomic measure which overcomes the problems of current approaches. 1
Article
Full-text available
We develop a geometric framework to study the structure and function of complex networks. We assume that hyperbolic geometry underlies these networks, and we show that with this assumption, heterogeneous degree distributions and strong clustering in complex networks emerge naturally as simple reflections of the negative curvature and metric property of the underlying hyperbolic geometry. Conversely, we show that if a network has some metric structure, and if the network degree distribution is heterogeneous, then the network has an effective hyperbolic geometry underneath. We then establish a mapping between our geometric framework and statistical mechanics of complex networks. This mapping interprets edges in a network as noninteracting fermions whose energies are hyperbolic distances between nodes, while the auxiliary fields coupled to edges are linear functions of these energies or distances. The geometric network ensemble subsumes the standard configuration model and classical random graphs as two limiting cases with degenerate geometric structures. Finally, we show that targeted transport processes without global topology knowledge, made possible by our geometric framework, are maximally efficient, according to all efficiency measures, in networks with strongest heterogeneity and clustering, and that this efficiency is remarkably robust with respect to even catastrophic disturbances and damages to the network structure.
Article
Full-text available
Many social Web sites allow users to annotate the content with descriptive metadata, such as tags, and more recently to organize content hierarchically. These types of structured metadata provide valuable evidence for learning how a community organizes knowledge. For instance, we can aggregate many personal hierarchies into a common taxonomy, also known as a folksonomy, that will aid users in visualizing and browsing social content, and also to help them in organizing their own content. However, learning from social metadata presents several challenges, since it is sparse, shallow, ambiguous, noisy, and inconsistent. We describe an approach to folksonomy learning based on relational clustering, which exploits structured metadata contained in personal hierarchies. Our approach clusters similar hierarchies using their structure and tag statistics, then incrementally weaves them into a deeper, bushier tree. We study folksonomy learning using social metadata extracted from the photo-sharing site Flickr, and demonstrate that the proposed approach addresses the challenges. Moreover, comparing to previous work, the approach produces larger, more accurate folksonomies, and in addition, scales better. Comment: 10 pages, To appear in the Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD) 2010
Article
Full-text available
Structured and semi-structured data describing entities, taxonomies and ontologies appears in many domains. There is a huge interest in integrating structured information from multiple sources; however integrating structured data to infer complex common structures is a difficult task because the integration must aggregate similar structures while avoiding structural inconsistencies that may appear when the data is combined. In this work, we study the integration of structured social metadata: shallow personal hierarchies specified by many individual users on the SocialWeb, and focus on inferring a collection of integrated, consistent taxonomies. We frame this task as an optimization problem with structural constraints. We propose a new inference algorithm, which we refer to as Relational Affinity Propagation (RAP) that extends affinity propagation (Frey and Dueck 2007) by introducing structural constraints. We validate the approach on a real-world social media dataset, collected from the photosharing website Flickr. Our empirical results show that our proposed approach is able to construct deeper and denser structures compared to an approach using only the standard affinity propagation algorithm. Comment: 6 Pages, To appear at AAAI Workshop on Statistical Relational AI
Article
Full-text available
Web 2.0 applications have attracted a considerable amount of attention because their open-ended nature allows users to create light-weight semantic scaffolding to organize and share content. To date, the interplay of the social and semantic components of social media has been only partially explored. Here we focus on Flickr and Last.fm, two social media systems in which we can relate the tagging activity of the users with an explicit representation of their social network. We show that a substantial level of local lexical and topical alignment is observable among users who lie close to each other in the social network. We introduce a null model that preserves user activity while removing local correlations, allowing us to disentangle the actual local alignment between users from statistical effects due to the assortative mixing of user activity and centrality in the social network. This analysis suggests that users with similar topical interests are more likely to be friends, and therefore semantic similarity measures among users based solely on their annotation metadata should be predictive of social links. We test this hypothesis on the Last.fm data set, confirming that the social network constructed from semantic similarity captures actual friendship more accurately than Last.fm's suggestions based on listening patterns. Comment: http://portal.acm.org/citation.cfm?doid=1718487.1718521
Article
Full-text available
Auch erschienen in: Proceedings. WWW2007 Workshop "Tagging and Metadata for Social Information Organization". Banff, 2007. Social resource sharing systems like YouTube and del.icio.us have acquired a large number of users within the last few years. They provide rich resources for data analysis, information retrieval, and knowledge discovery applications. A first step towards this end is to gain better insights into content and structure of these systems. In this paper, we will analyse the main network characteristics of two of the systems. We consider their underlying data structures – socalled folksonomies – as tri-partite hypergraphs, and adapt classical network measures like characteristic path length and clustering coefficient to them. Subsequently, we introduce a network of tag co-occurrence and investigate some of its statistical properties, focusing on correlations in node connectivity and pointing out features that reflect emergent semantics within the folksonomy. We show that simple statistical indicators unambiguously spot non-social behavior such as spam.
Article
Full-text available
Many communication and social networks have power-law link distributions, containing a few nodes that have a very high degree and many with low degree. The high connectivity nodes play the important role of hubs in communication and networking, a fact that can be exploited when designing efficient search algorithms. We introduce a number of local search strategies that utilize high degree nodes in power-law graphs and that have costs scaling sublinearly with the size of the graph. We also demonstrate the utility of these strategies on the GNUTELLA peer-to-peer network.
Article
Full-text available
Can we model the scale-free distribution of Web hypertext degree under realistic assumptions about the behavior of page authors? Can a Web crawler efficiently locate an unknown relevant page? These questions are receiving much attention due to their potential impact for understanding the structure of the Web and for building better search engines. Here I investigate the connection between the linkage and content topology of Web pages. The relationship between a text-induced distance metric and a link-based neighborhood probability distribution displays a phase transition between a region where linkage is not determined by content and one where linkage decays according to a power law. This relationship is used to propose a Web growth model that is shown to accurately predict the distribution of Web page degree, based on textual content and assuming only local knowledge of degree for existing pages. A qualitatively similar phase transition is found between linkage and semantic distance, with an exponential decay tail. Both relationships suggest that efficient paths can be discovered by decentralized Web navigation algorithms based on textual and/or categorical cues.
Conference Paper
Full-text available
The spherical k-means algorithm, i.e., the k-means algorithm with cosine similarity, is a popular method for clustering high-dimensional text data. In this algorithm, each document as well as each cluster mean is represented as a high-dimensional unit-length vector. However, it has been mainly used in hatch mode. Thus is, each cluster mean vector is updated, only after all document vectors being assigned, as the (normalized) average of all the document vectors assigned to that cluster. This paper investigates an online version of the spherical k-means algorithm based on the well-known winner-take-all competitive learning. In this online algorithm, each cluster centroid is incrementally updated given a document. We demonstrate that the online spherical k-means algorithm can achieve significantly better clustering results than the batch version, especially when an annealing-type learning rate schedule is used. We also present heuristics to improve the speed, yet almost without loss of clustering quality.
Article
Ontology Learning for the Semantic Web explores techniques for applying knowledge discovery techniques to different web data sources (such as HTML documents, dictionaries, etc.), in order to support the task of engineering and maintaining ontologies. The approach of ontology learning proposed in Ontology Learning for the Semantic Web includes a number of complementary disciplines that feed in different types of unstructured and semi-structured data. This data is necessary in order to support a semi-automatic ontology engineering process. Ontology Learning for the Semantic Web is designed for researchers and developers of semantic web applications. It also serves as an excellent supplemental reference to advanced level courses in ontologies and the semantic web.
Article
The participatory nature of many Web 2.0 platforms makes a large portion of users' interactions with each other and with information resources digitally observable. The assumption that the evolving structure of these digital records contains implicit evidences for the underlying semantics has been proven by successful approaches of making the emergent semantics explicit, e.g. in the form of light-weight ontologies. In this paper, we provide further evidence for the great poten-tial of self-emerging ontologies from Web 2.0 data, exemplified by collaborative tagging systems. We hereby combine and extend prior research, where we identified crucial aspects for successful methods to infer tag semantics. The additional contribution of this paper is to propose an extended methodology to induce a hierar-chical organization scheme from the initially flat tag space which captures the semantics and the diversity of the shared knowledge. It comprises the introduction of a synsetized folksonomy (which tack-les the problem of synonymous tags) and a clustering approach for tag sense disambiguation. In order to assess the quality of the learned semantics, we com-pare the inferred organization scheme with manually built catego-rization schemes from WordNet and Wikipedia. Our results exhibit clear similarities; so in summary, our work demonstrates a success-ful example of self-emergent ontologies from Web 2.0 data.
Article
Collaborative tagging systems—systems where many casual users annotate objects with free-form strings (tags) of their choosing—have recently emerged as a powerful way to label and organize large collections of data. During our recent investigation into these types of systems, we discovered a simple but remarkably effective algorithm for converting a large corpus of tags annotating objects in a tagging system into a navigable hierarchical taxonomy of tags. We first discuss the algorithm and then present a preliminary model to explain why it is so effective in these types of systems.
Conference Paper
In our work we extend the traditional bipartite model of ontologies with the social dimension, leading to a tripartite model of actors, concepts and instances. We demonstrate the application of this representation by showing how community-based semantics emerges from this model through a process of graph transformation. We illustrate ontology emergence by two case studies, an analysis of a large scale folksonomy system and a novel method for the extraction of community-based ontologies from Web pages.
Article
We address the question of how participants in a small world experiment are able to find short paths in a social network using only local information about their immediate contacts. We simulate such experiments on a network of actual email contacts within an organization as well as on a student social networking website. On the email network we find that small world search strategies using a contact’s position in physical space or in an organizational hierarchy relative to the target can effectively be used to locate most individuals. However, we find that in the online student network, where the data is incomplete and hierarchical structures are not well defined, local search strategies are less effective. We compare our findings to recent theoretical hypotheses about underlying social structure that would enable these simple search strategies to succeed and discuss the implications to social software design.
Article
In our work the traditional bipartite model of ontologies is extended with the social dimension, leading to a tripartite model of actors, concepts and instances. We demonstrate the application of this representation by showing how community-based semantics emerges from this model through a process of graph transformation. We illustrate ontology emergence by two case studies, an analysis of a large scale folksonomy system and a novel method for the extraction of community-based ontologies from Web pages.
Conference Paper
Today, a number of algorithms exist for constructing tag hierarchies from social tagging data. While these algorithms were designed with ontological goals in mind, we know very little about their properties from an information retrieval perspective, such as whether these tag hierarchies support efficient navigation in social tagging systems. The aim of this paper is to investigate the usefulness of such tag hierarchies (sometimes also called folksonomies - from folk-generated taxonomy) as directories that aid navigation in social tagging systems. To this end, we simulate navigation of directories as decentralized search on a network of tags using Kleinberg's model. In this model, a tag hierarchy can be applied as background knowledge for decentralized search. By constraining the visibility of nodes in the directories we aim to mimic typical constraints imposed by a practical user interface (UI), such as limiting the number of displayed subcategories or related categories. Our experiments on five different social tagging datasets show that existing tag hierarchy algorithms can support navigation in theory, but our results also demonstrate that they face tremendous challenges when user interface (UI) restrictions are taken into account. Based on this observation, we introduce a new algorithm that constructs efficiently navigable directories on our datasets. The results are relevant for engineers and scientists aiming to improve navigability of social tagging systems.
Conference Paper
Social bookmarking systems allow users to organise collec- tions of resources on the Web in a collaborative fashion. The increasing popularity of these systems as well as first insights into their emer- gent semantics have made them relevant to disciplines like knowledge extraction and ontology learning. The problem of devising methods to measure the semantic relatedness between tags and characterizing it se- mantically is still largely open. Here we analyze three measures of tag relatedness: tag co-occurrence, cosine similarity of co-occurrence dis- tributions, and FolkRank, an adaptation of the PageRank algorithm to folksonomies. Each measure is computed on tags from a large-scale dataset crawled from the social bookmarking system del.icio.us. To provide a semantic grounding of our findings, a connection to Word- Net (a semantic lexicon for the English language) is established by mapping tags into synonym sets of WordNet, and applying there well- known metrics of semantic similarity. Our results clearly expose dif- ferent characteristics of the selected measures of relatedness, making them applicable to different subtasks of knowledge extraction such as synonym detection or discovery of concept hierarchies.
Conference Paper
In social bookmark tools users are setting up lightweight conceptual structures called folk- sonomies. Currently, the information retrieval support is limited. We present a formal model and a new search algorithm for folksonomies, called FolkRank, that exploits the structure of the folk- sonomy. The proposed algorithm is also applied to find communities within the folksonomy and is used to structure search results. All findings are demonstrated on a large scale dataset. A long ver- sion of this paper has been published at the Euro- pean Semantic Web Conference 2006 (3).
Conference Paper
Recent research provides evidence for the presence of emergent semantics in collaborative tagging systems. While several methods have been proposed, little is known about the factors that influence the evolution of semantic structures in these systems. A natural hypothesis is that the quality of the emergent semantics depends on the pragmatics of tagging: Users with certain usage patterns might contribute more to the resulting semantics than others. In this work, we propose several measures which enable a pragmatic differentiation of taggers by their degree of contribution to emerging semantic structures. We distinguish between categorizers, who typically use a small set of tags as a replacement for hierarchical classification schemes, and describers, who are annotating resources with a wealth of freely associated, descriptive keywords. To study our hypothesis, we apply semantic similarity measures to 64 different partitions of a real-world and large-scale folksonomy containing different ratios of categorizers and describers. Our results not only show that "verbose" taggers are most useful for the emergence of tag semantics, but also that a subset containing only 40% of the most 'verbose' taggers can produce results that match and even outperform the semantic precision obtained from the whole dataset. Moreover, the results suggest that there exists a causal link between the pragmatics of tagging and resulting emergent semantics. This work is relevant for designers and analysts of tagging systems interested (i) in fostering the semantic development of their platforms, (ii) in identifying users introducing "semantic noise", and (iii) in learning ontologies.
Conference Paper
Abstract We present YAGO, a light-weight and extensible ontology with high cov- erage and quality. YAGO builds on entities and relations and currently contains roughly 900,000 entities and 5,000,000 facts. This includes the Is- A hierarchy as well as non-taxonomic relations between entities (such as hasWonPrize). The facts have been automatically extracted from the uni- fication of Wikipedia and WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper. The resulting knowledge base is a major step beyond WordNet: in quality by adding knowl- edge about individuals like persons, organizations, products, etc. with their semantic relationships ‐ and in quantity by increasing the number of facts by more than an order of magnitude. Our empirical evaluation of fact correct- ness shows an accuracy of about 95%. YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS. Finally, we show how YAGO can be further extended by state-of-the-art information
Article
Article
Social networks have the surprising property of being “searchable”: Ordinary people are capable of directing messages through their network of acquaintances to reach a specific but distant target person in only a few steps. We present a model that offers an explanation of social network searchability in terms of recognizable personal identities: sets of characteristics measured along a number of social dimensions. Our model defines a class of searchable networks and a method for searching them that may be applicable to many network search problems, including the location of data files in peer-to-peer networks, pages on the World Wide Web, and information in distributed databases.
Article
Targeted or quasi-targeted propagation of information is a fundamental process running in complex networked systems. Optimal communication in a network is easy to achieve if all its nodes have a full view of the global topological structure of the network. However many complex networks manifest communication efficiency without nodes having a full view of the network, and yet there is no generally applicable explanation of what mechanisms may render efficient such communication in the dark. In this work we model this communication as an oblivious routing process greedily operating on top of a network and making its decisions based only on distances within a hidden metric space lying underneath. Abstracting intrinsic similarities among networked elements, this hidden metric space is not easily reconstructible from the visible network topology. Yet we find that the structure of complex networks observed in reality, characterized by strong clustering and specific values of exponents of power-law degree distributions, maximizes their navigability, i.e., the efficiency of the greedy path-finding strategy in this hidden framework. We explain this observation by showing that more navigable networks have more prominent hierarchical structures which are congruent with the optimal layout of routing paths through a network. This finding has potentially profound implications for constructing efficient routing and searching strategies in communication and social networks, such as the Internet, Web, etc., and merits further research that would explain whether navigability of complex networks does indeed follow naturally from specifics of their evolution.
Conference Paper
We take the category system in Wikipedia as a conceptual network. We label the semantic relations between categories using methods based on connectivity in the network and lexicosyntactic matching. As a result we are able to derive a large scale taxonomy containing a large amount of subsumption, i.e. isa, relations. We evaluate the quality of the created resource by comparing it with ResearchCyc, one of the largest manually annotated ontologies, as well as computing semantic similarity between words in benchmarking datasets.
Article
Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet ¹ provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].
Conference Paper
The usability and the strong social dimension of the Web2.0 applications has encouraged users to create, annotate and share their content thus leading to a rich and content intensive Web. Despite that, the Web2.0 content lacks the explicit semantics that would allow it to be used in large scale intelligent applications. At the same time the advances in Semantic Web technologies imply a promising potential for intelligent applications capable to integrate distributed content and knowledge from various heterogeneous resources. We present FLOR a tool that performs semantic enrichment of folksonomy tagspaces by exploiting online ontologies, thesauri and other knowledge sources.
Article
The Internet infrastructure is severely stressed. Rapidly growing overheads associated with the primary function of the Internet-routing information packets between any two computers in the world-cause concerns among Internet experts that the existing Internet routing architecture may not sustain even another decade. In this paper, we present a method to map the Internet to a hyperbolic space. Guided by a constructed map, which we release with this paper, Internet routing exhibits scaling properties that are theoretically close to the best possible, thus resolving serious scaling limitations that the Internet faces today. Besides this immediate practical viability, our network mapping method can provide a different perspective on the community structure in complex networks.
Article
The study of complex networks has emerged over the past several years as a theme spanning many disciplines, ranging from mathematics and computer science to the social and biological sciences. A significant amount of recent work in this area has focused on the development of random graph models that capture some of the qualitative properties observed in large-scale network data; such models have the potential to help us reason, at a general level, about the ways in which real-world networks are organized. We survey one particular line of network research, concerned with small-world phenomena and decentralized search algorithms, that illustrates this style of analysis. We begin by describing awell-known experiment that provided the first empirical basis for the �six degrees of separation� phenomenon in social networks; wethen discuss some probabilistic network models motivated by this work, illustrating how these models lead to novel algorithmic and graph-theoretic questions, and how they are supported by recent empirical studies of large social networks.
Article
Collaborative tagging systems are now popular tools for organising and sharing information on the Web. While collaborative tagging offers many advantages over the use of controlled vocabularies, they also suffer from problems such as the existence of polysemous tags. We investigate how the different contexts in which individual tags are used can be revealed automatically without consulting any external resources. We consider several different network representations of tags and documents, and apply a graph clustering algorithm on these networks to obtain groups of tags or documents corresponding to the different meanings of an ambiguous tag. Our experiments show that networks which explicitly take the social context into account are more likely to give a better picture of the semantics of a tag.
Article
Traditional Web search engines mostly adopt a keyword-based approach. When the keyword submitted by the user is ambiguous, search result usually consists of documents related to various meanings of the keyword, while the user is probably interested in only one of them. In this paper we attempt to provide a solution to this problem using a k-nearest-neighbour approach to classify documents returned by a search engine, by building classifiers using data collected from collaborative tagging systems. Experiments on search results returned by Google show that our method is able to classify the documents returned with high precision.
Article
Auch erschienen in: Moor, Aldo de u.a. (Hrsg.): Proceedings of the First Conceptual Structures Tool Interoperability Workshop at the 14th International Conference on Conceptual Structures. Aalborg : Universitetsforlag, 2006. S. 87-102 Social bookmark tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. The reason for their immediate success is the fact that no specific skills are needed for participating. In this paper we specify a formal model for folksonomies and briefly describe our own system BibSonomy, which allows for sharing both bookmarks and publication references in a kind of personal library.
Article
Auch erschienen in: Batagelj, Vladimir u.a. (Hrsg.): Data science and classification. (Studies in classification, data analysis, and knowledge organization). Berlin u.a. : Springer, 2006. S. 261-270. ISBN 3-540-34415-2 - 978-3-540-34415-5(The original publication is available at www.springerlink.com) Social bookmark tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. These systems provide currently relatively few structure. We discuss in this paper, how association rule mining can be adopted to analyze and structure folksonomies, and how the results can be used for ontology learning and supporting emergent semantics. We demonstrate our approach on a large scale dataset stemming from an online system.
Article
Networks of coupled dynamical systems have been used to model biological oscillators, Josephson junction arrays, excitable media, neural networks, spatial games, genetic control networks and many other self-organizing systems. Ordinarily, the connection topology is assumed to be either completely regular or completely random. But many biological, technological and social networks lie somewhere between these two extremes. Here we explore simple models of networks that can be tuned through this middle ground: regular networks 'rewired' to introduce increasing amounts of disorder. We find that these systems can be highly clustered, like regular lattices, yet have small characteristic path lengths, like random graphs. We call them 'small-world' networks, by analogy with the small-world phenomenon (popularly known as six degrees of separation. The neural network of the worm Caenorhabditis elegans, the power grid of the western United States, and the collaboration graph of film actors are shown to be small-world networks. Models of dynamical systems with small-world coupling display enhanced signal-propagation speed, computational power, and synchronizability. In particular, infectious diseases spread more easily in small-world networks than in regular lattices.
Article
The small-world phenomenon - the principle that most of us are linked by short chains of acquaintances - was first investigated as a question in sociology and is a feature of a range of networks arising in nature and technology. Experimental study of the phenomenon revealed that it has two fundamental components: first, such short chains are ubiquitous, and second, individuals operating with purely local information are very adept at finding these chains. The first issue has been analysed, and here I investigate the second by modelling how individuals can find short chains in a large social network.
Article
We consider methods for quantifying the similarity of vertices in networks. We propose a measure of similarity based on the concept that two vertices are similar if their immediate neighbors in the network are themselves similar. This leads to a self-consistent matrix formulation of similarity that can be evaluated iteratively using only a knowledge of the adjacency matrix of the network. We test our similarity measure on computer-generated networks for which the expected results are known, and on a number of real-world networks.
Article
Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.
Article
We demonstrate that the self-similarity of some scale-free networks with respect to a simple degree-thresholding renormalization scheme finds a natural interpretation in the assumption that network nodes exist in hidden metric spaces. Clustering, i.e., cycles of length three, plays a crucial role in this framework as a topological reflection of the triangle inequality in the hidden geometry. We prove that a class of hidden variable models with underlying metric spaces are able to accurately reproduce the self-similarity properties that we measured in the real networks. Our findings indicate that hidden geometries underlying these real networks are a plausible explanation for their observed topologies and, in particular, for their self-similarity with respect to the degree-based renormalization.
Article
Introduction The problem of searching for information in networks like the World Wide Web can be approached in a variety of ways, ranging from centralized indexing schemes to decentralized mechanisms that navigate the underlying network without knowledge of its global structure. The decentralized approach appears in a variety of settings: in the behavior of users browsing the Web by following hyperlinks; in the design of focused crawlers [4, 5, 8] and other agents that explore the Web's links to gather information; and in the search protocols underlying decentralized peer-to-peer systems such as Gnutella [10], Freenet [7], and recent research prototypes [21, 22, 23], through which users can share resources without a central server. In recent work, we have been investigating the problem of decentralized search in large information networks [14, 15]. Our initial motivation was an experiment that dealt directly with the search problem in a decidedly pre-Internet context: Stanley Milgram
Article
Long a matter of folklore, the "small-world phenomenon" --- the principle that we are all linked by short chains of acquaintances --- was inaugurated as an area of experimental study in the social sciences through the pioneering work of Stanley Milgram in the 1960's. This work was among the first to make the phenomenon quantitative, allowing people to speak of the "six degrees of separation" between any two people in the United States. Since then, a number of network models have been proposed as frameworks in which to study the problem analytically. One of the most refined of these models was formulated in recent work of Watts and Strogatz; their framework provided compelling evidence that the small-world phenomenon is pervasive in a range of networks arising in nature and technology, and a fundamental ingredient in the evolution of the World Wide Web. But existing models are insu#cient to explain the striking algorithmic component of Milgram's original findings: that individuals using local information are collectively very e#ective at actually constructing short paths between two points in a social network. Although recently proposed network models are rich in short paths, we prove that no decentralized algorithm, operating with local information only, can construct short paths in these networks with non-negligible probability. We then define an infinite family of network models that naturally generalizes the Watts-Strogatz model, and show that for one of these models, there is a decentralized algorithm capable of finding short paths with high probability. More generally, we provide a strong characterization of this family of network models, showing that there is in fact a unique model within the family for which decentralized algorithms are e#ect...
Article
An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory ecient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-ecient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented - a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.
Sustaining the Internet with hyperbolic map-ping Gold standard based ontology evaluation using in-stance assignment
  • M Papadopoulos
  • F And Krioukov
  • D
  • J Brank
  • D Madenic
  • M And Groblenik
BOGU N ´ A, M., PAPADOPOULOS, F., AND KRIOUKOV, D. 2010. Sustaining the Internet with hyperbolic map-ping. Nature Comm. 1, 62. BRANK, J., MADENIC, D., AND GROBLENIK, M. 2006. Gold standard based ontology evaluation using in-stance assignment. In Proceedings of the 4th Workshop on Evaluating Ontologies for the Web (EON '06).