Conference Paper

Classifying Documents According to Locational Relevance.

DOI: 10.1007/978-3-642-04686-5_49 Conference: Progress in Artificial Intelligence, 14th Portuguese Conference on Artificial Intelligence, EPIA 2009, Aveiro, Portugal, October 12-15, 2009. Proceedings
Source: DBLP


This paper presents an approach for categorizing documents according to their implicit locational relevance. We report a thorough
evaluation of several classifiers designed for this task, built by using support vector machines with multiple alternatives
for feature vectors. Experimental results show that using feature vectors that combine document terms and URL n-grams, with
simple features related to the locality of the document (e.g. total count of place references) leads to high accuracy values.
The paper also discusses how the proposed categorization approach can be used to help improve tasks such as document retrieval
or online contextual advertisement.

  • Source
    • "Second, URLs can contain important hints about the document content. In this context, related work on URL analytics on the Web shows that URLs can provide accurate estimates of the document language [3], location relevance [2] and topic classification [9]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Long-term Web archives comprise Web documents gathered over longer time periods and can easily reach hundreds of terabytes in size. Semantic annotations such as named entities can facilitate intelligent access to the Web archive data. However, the annotation of the entire archive content on this scale is often infeasible. The most efficient way to access the documents within Web archives is provided through their URLs, which are typically stored in dedicated indexes. The URLs of the archived Web documents can contain semantic information and can offer an efficient way to obtain initial semantic annotations for the archived documents. In this paper, we analyse the applicability of semantic analysis techniques such as named entity extraction to the URLs in a Web archive. We evaluate the precision of the named entity extraction from the URLs in the Popular German Web dataset and analyse the proportion of the archived URLs from 1,444 popular domains in the time interval from 2000 to 2012 to which these techniques are applicable. Our results demonstrate that named entity recognition can be successfully applied to a large number of URLs in our Web archive and provide a good starting point to efficiently annotate large scale collections of Web documents.
    Full-text · Conference Paper · Sep 2015
    • "They analyse the textual content, extract spatial references, and generate a graph on which they apply the PageRank algorithm to assign the given web page to a geographic location. In another study, Anastacio and coworkers classify the context of a given web page as local or global, based on the textual content, locational references and URLs occurring in the page [4]. In [19], Pan and Mitra treat the textual content, and spatial and temporal features of a news article as first class objects, and utilize them all for event detection. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Event detection from microblogs and social networks, especially from Twitter, is an active and rich research topic. By grouping similar tweets in clusters, people can extract events and follow the happenings in a community. In this work, we focus on estimating the geographical locations of events that are detected in Twitter. An important novelty of our work is the application of evidential reasoning techniques, namely the Demspter-Shafer Theory (DST), for this problem. By utilizing several features of tweets, we aim to produce belief intervals for a set of possible discrete locations. DST helps us deal with uncertainties, assign belief values to subsets of solutions, and combine pieces of evidence obtained from different tweet features. The initial results on several real cases suggest the applicability and usefulness of DST for the problem.
    No preview · Conference Paper · Nov 2013
  • Source

    Preview · Article ·
Show more