Web Data Extraction, Applications and Techniques: A Survey

Knowledge-Based Systems (Impact Factor: 4.1). 07/2012; DOI: 10.1016/j.knosys.2014.07.007
Source: arXiv

ABSTRACT Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of application domains. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc application domains. Other approaches, instead,
heavily reuse techniques and algorithms developed in the field of Information
This survey aims at providing a structured and comprehensive overview of the
research efforts made in the field of Web Data Extraction. The fil rouge of our
work is to provide a classification of existing approaches in terms of the
applications for which they have been employed. This differentiates our work
from other surveys devoted to classify existing approaches on the basis of the
algorithms, techniques and tools they use.
We classified Web Data Extraction approaches into categories and, for each
category, we illustrated the basic techniques along with their main variants.
We grouped existing applications in two main areas: applications at the
Enterprise level and at the Social Web level. Such a classification relies on a
twofold reason: on one hand, Web Data Extraction techniques emerged as a key
tool to perform data analysis in Business and Competitive Intelligence systems
as well as for business process re-engineering. On the other hand, Web Data
Extraction techniques allow for gathering a large amount of structured data
continuously generated and disseminated by Web 2.0, Social Media and Online
Social Network users and this offers unprecedented opportunities of analyzing
human behaviors on a large scale.
We discussed also about the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.

  • [Show abstract] [Hide abstract]
    ABSTRACT: The participation of experts and other external contributors is a common requirement in the design of educational scenarios for the school of the future. We can find many repositories of learning objects, but it is not so common to find directories containing people who are willing to participate in an educational activity. Much less common is to find information about these people to determine their suitability from an educational perspective. This paper describes a proposal for the automatic enrichment of existing information in a directory of contributors to educational activities. Through this enrichment process, it is possible to enhance the amount of information available to a recommender system to identify the most appropriate people to participate in a particular activity, and reduces the need for human intervention when selecting individuals to contribute to educational activities.
    Management Intelligent Systems, Edited by Casillas, Jorge and Martínez-López, Francisco J. and Vicari, Rosa and De la Prieta, Fernando, 01/2013: pages 83-90; Springer International Publishing., ISBN: 9783319005683
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Online Social Networks (OSNs) are a unique Web and social phe-nomenon, affecting tastes and behaviors of their users and helping them to main-tain/create friendships. It is interesting to analyze the growth and evolution of On-line Social Networks both from the point of view of marketing and offer of new services and from a scientific viewpoint, since their structure and evolution may share similarities with real-life social networks. In social sciences, several techniques for analyzing (offline) social networks have been developed, to evaluate quantita-tive properties (e.g., defining metrics and measures of structural characteristics of the networks) or qualitative aspects (e.g., studying the attachment model for the network evolution, the binary trust relationships, and the link prediction problem). However, OSN analysis poses novel challenges both to Computer and Social scien-tists. We present our long-term research effort in analyzing Facebook, the largest and arguably most successful OSN today: it gathers more than 500 million users. Access to data about Facebook users and their friendship relations, is restricted; thus, we acquired the necessary information directly from the front-end of the Web site, in order to reconstruct a sub-graph representing anonymous interconnections among a significant subset of users. We describe our ad-hoc, privacy-compliant crawler for Facebook data extraction. To minimize bias, we adopt two different graph min-ing techniques: breadth-first search (BFS) and rejection sampling. To analyze the structural properties of samples consisting of millions of nodes, we developed a spe-cific tool for analyzing quantitative and qualitative properties of social networks, adopting and improving existing Social Network Analysis (SNA) techniques and algorithms.
    Computational Social Networks: Mining and Visualization, 01/2012; Springer Verlag.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Understanding social dynamics that govern human phenomena, such as communications and social relationships is a major problem in current computational social sciences. In particular, given the unprecedented success of online social networks (OSNs), in this paper we are concerned with the analysis of aggregation patterns and social dynamics occurring among users of the largest OSN as the date: Facebook. In detail, we discuss the mesoscopic features of the community structure of this network, considering the perspective of the communities, which has not yet been studied on such a large scale. To this purpose, we acquired a sample of this network containing millions of users and their social relationships; then, we unveiled the communities representing the aggregation units among which users gather and interact; finally, we analyzed the statistical features of such a network of communities, discovering and characterizing some specific organization patterns followed by individuals interacting in online social networks, that emerge considering different sampling techniques and clustering methodologies. This study provides some clues of the tendency of individuals to establish social interactions in online social networks that eventually contribute to building a well-connected social structure, and opens space for further social studies.
    EPJ Data Science. 11/2012; 1(9):1-30.

Full-text (2 Sources)

Available from
May 31, 2014