Conference Paper

Classifying Documents According to Locational Relevance.

DOI: 10.1007/978-3-642-04686-5_49 Conference: Progress in Artificial Intelligence, 14th Portuguese Conference on Artificial Intelligence, EPIA 2009, Aveiro, Portugal, October 12-15, 2009. Proceedings
Source: DBLP

ABSTRACT This paper presents an approach for categorizing documents according to their implicit locational relevance. We report a thorough
evaluation of several classifiers designed for this task, built by using support vector machines with multiple alternatives
for feature vectors. Experimental results show that using feature vectors that combine document terms and URL n-grams, with
simple features related to the locality of the document (e.g. total count of place references) leads to high accuracy values.
The paper also discusses how the proposed categorization approach can be used to help improve tasks such as document retrieval
or online contextual advertisement.

  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Geotargeting is a specialization of contextual advertising where the objective is to target ads to Website visitors concentrated in well-defined areas. Current approaches involve targeting ads based on the physical location of the visitors, estimated through their IP addresses. However, there are many situations where it would be more interesting to target ads based on the geographic scope of the target pages, i.e., on the general area implied by the locations mentioned in the textual contents of the pages. Our proposal applies techniques from the area of geographic information retrieval to the problem of geotargeting. We address the task through a pipeline of processing stages, which involves (i) determining the geographic scope of target pages, (ii) classifying target pages according to locational relevance, and (iii) retrieving ads relevant to the target page, using both textual contents and geographic scopes. Experimental results attest for the adequacy of the proposed methods in each of the individual processing stages.
    Proceedings of the 6th Workshop on Geographic Information Retrieval, GIR 2010, Zurich, Switzerland, February 18-19, 2010; 02/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.
    ACM Transactions on the Web 07/2011; 5:15. DOI:10.1145/1993053.1993057 · 1.60 Impact Factor


Available from