Classifying Documents According to
Ivo Anast´ acio, Bruno Martins, and P´ avel Calado
Instituto Superior T´ ecnico, INESC-ID,
Av. Professor Cavaco Silva, 2744-016 Porto Salvo, Portugal
Abstract. This paper presents an approach for categorizing documents
according to their implicit locational relevance. We report a thorough
evaluation of several classifiers designed for this task, built by using
support vector machines with multiple alternatives for feature vectors.
Experimental results show that using feature vectors that combine docu-
ment terms and URL n-grams, with simple features related to the locality
of the document (e.g. total count of place references) leads to high ac-
curacy values. The paper also discusses how the proposed categorization
approach can be used to help improve tasks such as document retrieval
or online contextual advertisement.
Key words: Document Classification, Geographic Text Mining
Automated document classification is a well studied problem, with many ap-
plications in text mining and information retrieval . A recent trend in text
mining applications relates to extracting geographic context information from
documents. It has been noted that the combination of techniques from text
mining and geographic information systems can provide the means to integrate
geographic data and services, such as topographic maps and street directories,
with the implicit geographic information available in Web documents [2,6,9].
In this work, we propose that textual documents can be characterized accord-
ing to their implicit locational relevance. For example, a document on the subject
of computer programing can be considered global, as it is likely to be of interest
to a geographically broad audience. In contrast, a document listing pharmacies
or take-away restaurants in a specific city can be regarded as a local, i.e., likely
to be of interest only to an audience in a relatively narrow region. Somewhere
in between is a document describing touristic attractions in a specific city, likely
to be of interest to both the inhabitants of that city and to potential visitors
from other parts of the world. In the context of this work, locational relevance is,
therefore, a score that reflects the probability of a given document being either
This work was partially supported by the FCT (Portugal), through project grant
2 Ivo Anast´ acio, Bruno Martins, P´ avel Calado
global (i.e., users interested in the document are likely to have broad geographic
interests) or local (i.e., users interested in the document are likely to have a sin-
gle narrow geographic interest). This score can be produced from the confidence
estimates assigned by a binary classifier such as a Support Vector Machine .
Previous research has addressed the problem of automatically computing ge-
ographic scopes of Web documents [1,2]. Techniques have also been proposed for
detecting locationaly relevant search engine queries [3,4]. However, to the best
of our knowledge, no description has ever been published on techniques for clas-
sifying documents according to locational relevance (i.e., classifying documents
as either local or global). This is a significantly different problem from that of
assigning documents to geographic scopes, since two documents can have the
same scope but different locational relevances. For instance, the Web page of a
research group in Lisbon and the Web page of a local restaurant in Lisbon have
the same geographic scope, nonetheless, people visiting the restaurant’s page are
most probably taking into consideration the location, while people visiting the
researcher’s page are most probably interested in their studies, regardless from
where the group is physically located.
To solve this problem, we propose an approach for categorizing documents
according to their implicit locational relevance, using state-of-the-art machine
learning techniques. We report a thorough evaluation of several classifiers, built
using support vector machines, and explore many alternative features for repre-
senting documents. In addition, we also discuss how our classifier can be used
to help improve tasks such as document retrieval or online advertisement.
The rest of the paper is organized as follows: Section 2 presents related work.
Section 3 describes our classification approach, detailing the proposed features.
Section 4 presents and discusses the experimental validation, also describing
applications for locational relevance classifiers. Finally, Section 5 presents our
conclusions and directions for future work.
2 Related Work
Traditional Information Retrieval and Machine Learning research has studied
how to classify documents according to predefined categories [10,11]. The sub-
area of Geographic Information Retrieval has addressed issues related to the
exploitation of geographic context information mined from textual documents.
In this section, we survey relevant past research on these topics.
2.1 Document Classification
Document classification is the task of assigning documents to topic classes, on the
basis of whether or not they share some features. This is one of the main problems
studied in fields such as text mining, information retrieval, or machine learning,
with many approaches described in the literature [10,11]. Some methods suitable
for learning document classifiers include decision trees , logistic regression 
Classifying Documents AccordiLng to Locational Relevance11
the area of overlap between the area corresponding to the Internet address of
the server hosting the document and the geographic scope of the document, as
an additional feature for classifying documents as either local or global.
Moreover, some of the characteristics that make a document either local or
global may not be directly observable in the document itself, but rather in other
contextual information related to the document. Previous research on Web doc-
ument classification has shown that better performance can be achieved trough
combinations of content-based features with additional features derived from the
neighboring documents in the link structure of the web graph [21,22]. Previous
experiments dealing with geographic context information have already accounted
with similar ideas, as for instance Gravano et al. , in classifying search engine
queries as either local or global, used a sample of the search results returned
for a given query rather than the words of the query itself. For assigning geo-
graphic scopes to Web documents, Ding et. al proposed to use the distributional
characteristics of the locations associated with HTML in-links . It would be
interesting to integrate, into our feature vectors, information about the distri-
butional characteristics of locations in related documents, having this notion of
relatedness coming from either textual similarity or from linkage information.
Our currently ongoing work is addressing these ideas, aiming at the application
of locational relevance classifiers in geographical IR and contextual advertising.
1. Ding, J., Gravano, L., and Shivakumar, N. (2000) Computing Geographical Scopes
of Web Resources. In Proceedings of the 26th international Conference on Very
Large Data Bases, 545-556.
2. Amitay, E., Har’El, N., Sivan, R., and Soffer, A. (2004) Web-a-where: geotagging
web content. In Proceedings of the 27th international ACM SIGIR Conference on
Research and Development in information Retrieval, 273-280.
3. Gravano, L., Hatzivassiloglou, V., and Lichtenstein, R. (2003) Categorizing web
queries according to geographical locality. In Proceedings of the 12th international
Conference on information and Knowledge Management, 325-333.
4. Zhuang, Z., Brunk, C., and Giles, C. L. (2008) Modeling and visualizing geo-sensitive
queries based on user clicks. In Proceedings of the 1st international Workshop on
Location and the Web, 73-76.
5. Witten, I. H., and Frank, E. (2000) Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations. Morgan Kaufmann, San Francisco
6. Woodruff, A. G. and Plaunt, C. (1994) GIPSY: Automated geographic indexing
of text documents. Journal of the American Society for Information Science 45, 9,
7. Cristianini, N. and Shawe-Taylor, J. (2000) An Introduction to Support Vector
Machines and other kernel-based learning methods. Cambridge University Press.
8. Johansson, M. and Harrie, L. (2002) Using Java Topology Suite for real-time data
generalisation and integration. Proceedings of the 2002 workshop of the Interna-
tional Society for Photogrammetry and Remote Sensing.
9. Leidner, J. L. (2008). Toponym Resolution: a Comparison and Taxonomy of Heuris-
tics and Methods.
12 Ivo Anast´ acio, Bruno Martins, P´ avel Calado
10. Yang, Y. (1999) An Evaluation of Statistical Approaches to Text Categorization.
Information Retrieval 1, 1-2, 69-90.
11. Sebastiani, F. (2002) Machine learning in automated text categorization. ACM
Computer Surveys 34, 1, 1-47.
12. Joachims, T. (1998) Text Categorization with Suport Vector Machines: Learning
with Many Relevant Features. In Proceedings of the 10th European Conference on
Machine Learning, 137-142.
13. Forman, G. (2003) An extensive empirical study of feature selection metrics for
text classification. Journal of Machine Learning Research. 3, 1289-1305.
14. Apt´ e, C., Damerau, F., and Weiss, S. M. (1994) Automated learning of decision
rules for text categorization. ACM Transactions on Information Systems 12 (3),
15. Genkin, A., Lewis, D. D. and Madigan, D. (2004) Large-Scale Bayesian Logistic
Regression for Text Categorization. Rutgers University Technical Report.
16. Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., and Mahoney, M. W. (2007)
Feature selection methods for text classification. In Proceedings of the 13th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining, 230-239.
17. Sang, E. T. K. and De Meulder, F. (2003) Introduction to the CoNLL-2003 shared
task: Language-Independent Named Entity Recognition. Proceedings of the 7th
Conference on Natural Language Learning, 142-147.
18. Kornai A. (2003) Proceedings of the HLT-NAACL 2003 workshop on the analysis
of geographic references.
19. Garbin, E. and Mani, I. (2005) Disambiguating toponyms in news. In Proceed-
ings of the Conference on Human Language Technology and Empirical Methods in
Natural Language Processing, 363-370.
20. Rauch, E., Bukatin, M., and Baker, K. (2003) A confidence-based framework for
disambiguating geographic terms. In Proceedings of the HLT-NAACL 2003 Work-
shop on Analysis of Geographic References, 50-54.
21. Chakrabarti, S., Dom, B., and Indyk, P. (1998) Enhanced hypertext categorization
using hyperlinks. In Proceedings of the 1998 ACM SIGMOD international Confer-
ence on Management of Data, 307-318.
22. Qi, X. and Davison, B. D. (2006) Knowing a web page by the company it keeps. In
Proceedings of the 15th ACM international Conference on information and Knowl-
edge Management, 228-237.
23. Baykan, E., Henzinger, M., Marian, L. and Weber, I. (2009) Purely URL-based
Topic Classification. In Proceedings of the 18th international World Wide Web
Conference, Alternate Track Papers and Posters, 1109-1109
24. Baykan, E., Henzinger, M., and Weber, I. (2008) Web page language identification
based on URLs. In Proceedings of the VLDB Endowment, 1 (1), 176-187.
25. Jones, R., Zhang, W. V., Rey, B., Jhala, P., and Stipp, E. (2009) Geographic
intention and modification in web search. International Journal of Geographical
Information Science, 22 (3), 229-246
26. Yu, B. and Cai, G. (2007) A query-aware document ranking method for geographic
information retrieval. In Proceedings of the 4th ACM workshop on Geographical
information retrieval, 49-54
27. Cai G. (2002) GeoVSM: An Integrated Retrieval Model for Geographic Informa-
tion, GIScience, 65-79
28. Anast´ acio, I., Martins, B., and Calado, P. (2009) A Comparison of Different Ap-
proaches for Assigning Geographic Scopes to Documents. In Proceedings of the 1st
INForum - Simp´ osio de Inform´ atica.