Conference Paper

Classifying Documents According to Locational Relevance.

DOI: 10.1007/978-3-642-04686-5_49 Conference: Progress in Artificial Intelligence, 14th Portuguese Conference on Artificial Intelligence, EPIA 2009, Aveiro, Portugal, October 12-15, 2009. Proceedings
Source: DBLP

ABSTRACT This paper presents an approach for categorizing documents according to their implicit locational relevance. We report a thorough
evaluation of several classifiers designed for this task, built by using support vector machines with multiple alternatives
for feature vectors. Experimental results show that using feature vectors that combine document terms and URL n-grams, with
simple features related to the locality of the document (e.g. total count of place references) leads to high accuracy values.
The paper also discusses how the proposed categorization approach can be used to help improve tasks such as document retrieval
or online contextual advertisement.

  • [Show abstract] [Hide abstract]
    ABSTRACT: News sources around the world generate constant streams of information, but effective streaming news retrieval requires an intimate understanding of the geographic content of news. This process of understanding, known as geotagging, consists of first finding words in article text that correspond to location names (toponyms), and second, assigning each toponym its correct lat/long values. The latter step, called toponym resolution, can also be considered a classification problem, where each of the possible interpretations for each toponym is classified as correct or incorrect. Hence, techniques from supervised machine learning can be applied to improve accuracy. New classification features to improve toponym resolution, termed adaptive context features, are introduced that consider a window of context around each toponym, and use geographic attributes of toponyms in the window to aid in their correct resolution. Adaptive parameters controlling the window's breadth and depth afford flexibility in managing a tradeoff between feature computation speed and resolution accuracy, allowing the features to potentially apply to a variety of textual domains. Extensive experiments with three large datasets of streaming news demonstrate the new features' effectiveness over two widely-used competing methods.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the “wrong” language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.
    ACM Transactions on the Web 03/2013; 7(1). DOI:10.1145/2435215.2435218 · 1.60 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Event detection from microblogs and social networks, especially from Twitter, is an active and rich research topic. By grouping similar tweets in clusters, people can extract events and follow the happenings in a community. In this work, we focus on estimating the geographical locations of events that are detected in Twitter. An important novelty of our work is the application of evidential reasoning techniques, namely the Demspter-Shafer Theory (DST), for this problem. By utilizing several features of tweets, we aim to produce belief intervals for a set of possible discrete locations. DST helps us deal with uncertainties, assign belief values to subsets of solutions, and combine pieces of evidence obtained from different tweet features. The initial results on several real cases suggest the applicability and usefulness of DST for the problem.
    Proceedings of the 7th Workshop on Geographic Information Retrieval; 11/2013


Available from