Classifying Documents According to
Ivo Anast´ acio, Bruno Martins, and P´ avel Calado
Instituto Superior T´ ecnico, INESC-ID,
Av. Professor Cavaco Silva, 2744-016 Porto Salvo, Portugal
Abstract. This paper presents an approach for categorizing documents
according to their implicit locational relevance. We report a thorough
evaluation of several classifiers designed for this task, built by using
support vector machines with multiple alternatives for feature vectors.
Experimental results show that using feature vectors that combine docu-
ment terms and URL n-grams, with simple features related to the locality
of the document (e.g. total count of place references) leads to high ac-
curacy values. The paper also discusses how the proposed categorization
approach can be used to help improve tasks such as document retrieval
or online contextual advertisement.
Key words: Document Classification, Geographic Text Mining
Automated document classification is a well studied problem, with many ap-
plications in text mining and information retrieval . A recent trend in text
mining applications relates to extracting geographic context information from
documents. It has been noted that the combination of techniques from text
mining and geographic information systems can provide the means to integrate
geographic data and services, such as topographic maps and street directories,
with the implicit geographic information available in Web documents [2,6,9].
In this work, we propose that textual documents can be characterized accord-
ing to their implicit locational relevance. For example, a document on the subject
of computer programing can be considered global, as it is likely to be of interest
to a geographically broad audience. In contrast, a document listing pharmacies
or take-away restaurants in a specific city can be regarded as a local, i.e., likely
to be of interest only to an audience in a relatively narrow region. Somewhere
in between is a document describing touristic attractions in a specific city, likely
to be of interest to both the inhabitants of that city and to potential visitors
from other parts of the world. In the context of this work, locational relevance is,
therefore, a score that reflects the probability of a given document being either
This work was partially supported by the FCT (Portugal), through project grant
2 Ivo Anast´ acio, Bruno Martins, P´ avel Calado
global (i.e., users interested in the document are likely to have broad geographic
interests) or local (i.e., users interested in the document are likely to have a sin-
gle narrow geographic interest). This score can be produced from the confidence
estimates assigned by a binary classifier such as a Support Vector Machine .
Previous research has addressed the problem of automatically computing ge-
ographic scopes of Web documents [1,2]. Techniques have also been proposed for
detecting locationaly relevant search engine queries [3,4]. However, to the best
of our knowledge, no description has ever been published on techniques for clas-
sifying documents according to locational relevance (i.e., classifying documents
as either local or global). This is a significantly different problem from that of
assigning documents to geographic scopes, since two documents can have the
same scope but different locational relevances. For instance, the Web page of a
research group in Lisbon and the Web page of a local restaurant in Lisbon have
the same geographic scope, nonetheless, people visiting the restaurant’s page are
most probably taking into consideration the location, while people visiting the
researcher’s page are most probably interested in their studies, regardless from
where the group is physically located.
To solve this problem, we propose an approach for categorizing documents
according to their implicit locational relevance, using state-of-the-art machine
learning techniques. We report a thorough evaluation of several classifiers, built
using support vector machines, and explore many alternative features for repre-
senting documents. In addition, we also discuss how our classifier can be used
to help improve tasks such as document retrieval or online advertisement.
The rest of the paper is organized as follows: Section 2 presents related work.
Section 3 describes our classification approach, detailing the proposed features.
Section 4 presents and discusses the experimental validation, also describing
applications for locational relevance classifiers. Finally, Section 5 presents our
conclusions and directions for future work.
Traditional Information Retrieval and Machine Learning research has studied
how to classify documents according to predefined categories [10,11]. The sub-
area of Geographic Information Retrieval has addressed issues related to the
exploitation of geographic context information mined from textual documents.
In this section, we survey relevant past research on these topics.
Document classification is the task of assigning documents to topic classes, on the
basis of whether or not they share some features. This is one of the main problems
studied in fields such as text mining, information retrieval, or machine learning,
with many approaches described in the literature [10,11]. Some methods suitable
for learning document classifiers include decision trees , logistic regression 
Classifying Documents AccordiLng to Locational Relevance3
and support vector machines [7,12]. SVMs can be considered a state-of-the-art
method in binary classification, returning a confidence score in the assigned class.
Previous works have also suggested that in text domains, due to the high-
dimensionality of the feature space, effective feature selection can be used to
make the learning task more efficient and accurate. Empirical comparisons of
different feature selection methods have been made in the past [13,16], with
the results suggesting that either Chi-square or information gain statistics can
provide good results. In this work, we use feature selection methods in order to
examine the most discriminative features.
2.2Mining geographical information from text documents
Previous research in the area of geographic information retrieval has addressed
problems such as the recognition and disambiguation of place references present
in text, and the assignment of documents to encompassing geographic scopes.
Leidner presented a variety of approaches for handling place references on
textual documents . The problem is usually seen as an extension of the named
entity recognition (NER) task, as proposed by the natural language processing
community [17,18]. More than recognizing mentions to places in text, which is
the subject of NER, the task also requires for the place references to be dis-
ambiguated into the corresponding locations on the surface of the Earth, i.e.,
assigning geospatial coordinates to the place references . Place reference dis-
ambiguation usually relies on gazetteer matching, together with heuristics such
as default senses (i.e., disambiguation should be made to the most important
referent, based on population counts) or spatial minimality (i.e., disambiguation
should minimize the convex hull that contains all candidate referents) [9,19].
Metacarta1is a commercial company that sells state-of-the-art geographic infor-
mation retrieval technology. The company also provides a freely-available geo-
tagger Web service that can be used to recognize and disambiguate place refer-
ences in text. An early version of the Metacarta geotagger has been described by
Rauch et al. . Yahoo! Placemaker2is another free Web service which provides
recognition and disambiguation of place references in text. The complementar
Yahoo! GeoPlanet3Web service is an example of an online gazetteer, returning
descriptions of places based on their name.
Anast´ acio et al. surveyed different approaches for assigning documents to
geographic scopes . In one of the pioneering works in the area of geographic
information retrieval, Woodruff and Plaunt proposed a technique with basis on
the place references discovered in the text . Their method was based on disam-
biguating the place references into the bounding polygons that correspond to the
geographic area of the referents. The geographic scope of the document is after-
ward computed by overlapping the areas of all the polygons. More recently, Ding
et al. proposed specific techniques for extracting the geographical scope of web
4 Ivo Anast´ acio, Bruno Martins, P´ avel Calado
pages . For example, the Di´ ario de Coimbra online newspaper has a geograph-
ical scope that consists of the city of Coimbra, while the Publico newspaper has
a geographical scope that includes the entire territory of Portugal. To compute
the geographical scope of a web document, Ding et al. propose two complemen-
tary strategies: (1) a technique based on the geographical distribution of HTML
links to the page, and (2) a technique based on the distribution of geographical
references in the text of the page. Amitay et al. also proposed a technique for
assigning Web documents to the corresponding geographic scope , leveraging
on part-of relations among the recognized place references (i.e. Lisbon and Porto
are both part of Portugal, and documents referring to both these places should
have Portugal as the scope). Looping over the disambiguated references, this
approach aggregates, for each document, the importance of the various levels
in a location hierarchy. The hierarchy levels are then sorted by importance and
results above a given threshold are returned as the geographic scope.
Gravano et al. proposed a technique for classifying search engine queries as
either local or global, using the distributional characteristics of location names
occurring in the results produced by a search engine to the query . There are
many similarities between the work by Gravano et al. and the proposal of this
paper, but here we are instead concerned with classifying documents as global
or local, instead of classifying user queries.
3Classifying documents according to locational relevance
Assigning documents to global and local classes, according to their implicit lo-
cational relevance, is a hard document classification problem. Instead of just
applying a standard classification approach, based on a bag-of-words represen-
tation of the documents, we argue that specific geographic features are also well
suited to reflect the locational characteristics of the documents.
Global documents often do not include any mentions to place names. Con-
sider the home page of the Weka software package4. Users reading this docu-
ment are probably looking for tutorials about machine learning, and they are
not restricted in their interests to a specific geographic scope. Nevertheless, it
is interesting to note that global documents can sometimes include mentions to
place names. Consider a document describing a review of U2’s latest concert in
Lisbon. The location name is clearly distinguishable in the document, but the
readers may have completely different geographic interests.
Local documents are, on the other hand, more likely to contain mentions to
place names, particularly place names associated to small regions. Local docu-
ments are also more likely to contain references to places that are restricted to a
somewhat confined area, whereas global documents can contain place references
to distinct places around the world. Examples of local documents include local
business listings or descriptions of local events.
Classifying Documents AccordiLng to Locational Relevance5
The feature vectors used in the proposed classification scheme combine informa-
tion directly extracted from the full text of the documents, or from the document
URL, with higher level geographic information, mined from the documents us-
ing techniques from the area of geographic information retrieval. We group the
considered features in four classes, namely (1) textual features, (2) URL fea-
tures, (3) simple locative features, and (4) high level locative features. Textual
and URL features are directly extracted from either the text or the URL for the
document, whereas the remaining require geographic text mining.
In the case of the textual features, the idea was to capture the thematic
aspects, encoded in the document’s terminology, that can influence the decision
of assigning a document to either a global or local class. For instance documents
about restaurants or pharmacies are more likely to be local than documents
about programming languages or music downloads.
The Yahoo! Term Extraction5Web service, a state-of-the-art industrial tool
for key term extraction, was used to discover important words in the documents.
Its implementation is available via an open Web service, which takes a text
document as input and returns a list of significant words or phrases extracted
from the document.
The full set of textual features is shown below:
– Word stems occurring in the lowercased document text, weighted according
to the term frequency vs. inverse document frequency scheme (TF/IDF).
Stopwords were removed according to the list provided by the Weka package.
– Lowercased words selected by the Yahoo! Term Extraction service as the
most important in the document, weighted through the TF/IDF scheme.
When classifying Web documents, another source of information that can
be used for classification is their Uniform Resource Locator (URL). Previous
research has shown that classifiers built from features based solely on document
URLs can achieve surprisingly good results on tasks such as language identifi-
cation  or topic attribution . Intuitively, URLs contain information that
can be used to discriminate between local and global pages, such as top level
domains or words such as local or regional. For instance, a document whose URL
has a top level domain .uk is more likely to be local than a document with a
top level domain such as .com. Taking inspiration on the experiments reported
by Baykan et al. , the following features were considered:
– Character n-grams, with n varying between 4 and 8, extracted from the
lower-cased document URLs and weighted according to the TF/IDF scheme.
Simple locative features essentially correspond to counts for locations recog-
nized in the documents, through the use of the geographic text mining services
provided by Yahoo!. The Placemaker text mining service provides functionalities