Ari Pirkola

Ari Pirkola
Tampere University | UTA · School of Information Sciences

About

62
Publications
13,471
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,472
Citations
Introduction
Skills and Expertise

Publications

Publications (62)
Article
Full-text available
We present a novel measure for ranking evaluation, called Twist (τ). It is a measure for informational intents, which handles both binary and graded relevance. x stems from the observation that searching is currently a that searching is currently taken for granted and it is natural for users to assume that search engines are available and work well...
Article
Full-text available
We present a novel measure for ranking evaluation, called Twist (τ). It is a measure for informational intents, which handles both binary and graded relevance. τ stems from the observation that searching is currently a commodity, and it is natural for users to assume that it is available and works well. As a consequence, users may assume the utilit...
Conference Paper
Full-text available
Measuring is a key to scientific progress. This is particularly true for research concerning complex systems. Multilingual and multime-dia information access systems, such as search engines, are increasingly complex: they need to satisfy diverse user needs and support challeng-ing tasks. Their development calls for proper evaluation methodologies t...
Conference Paper
Full-text available
The contributions of this paper are twofold. First, we present a new type of dictionary that is intended as a search assistance in topic-specific Web searching. The method to construct the dictionary is a general method that can be applied to any reasonable topic. The first implementation deals with climate change. The dictionary has the following...
Conference Paper
We describe a topic-specific Web search system focused on quality pages and argue that there is a need for such quality-based topic-specific search tools. The first implementation of the search system is available on the Web and it deals with climate change. The key idea is to crawl (using a focused crawling technique) in known trusted sites and in...
Conference Paper
Focused crawling refers to a process of fetching domain-specific pages from the Web. It is an important method to build domain-specific document collections, but it suffers from low recall due to the local nature of crawling algorithms associated with Web's community structure. In this study, we address the problem of limited crawling scope of focu...
Conference Paper
There is overwhelming evidence suggesting that the real users of IR systems often prefer using extremely short queries (one or two individual words) but they try out several queries if needed. Such behavior is fundamentally different from the process modeled in the traditional test collection-based IR evaluation based on using more verbose queries...
Article
Introduction. Investigates how effectively Web search engines index new sites from different countries. The primary interest is whether new sites are indexed equally or whether search engines are biased towards certain countries. If major search engines show biased coverage it can be considered a significant economic and political problem because o...
Conference Paper
Focused crawlers are programs that selectively download Web documents (pages), restricting the scope of crawling to a specific domain or topic. We investigate different focused crawling strategies including the use of data fusion in focused crawling. Documents in the domains of genomics and genetics were fetched by Nalanda iVia Focused Crawler usin...
Conference Paper
A focused crawler is a program that fetches Web pages that are relevant to a pre-defined domain. In this paper we consider focused crawling in the domains of genomics and genetics. Crawling is often started with seed URLs that point to central North-American and European universities, research institutions, and other organizations in North-America...
Article
Full-text available
Standard ,performance ,tests of information ,retrieval systems ,are based on measuring precision at fixed recall levels and averaging results over a large set of test topics. It has recently been demonstrated,how,seemingly,equal average,performance ,may ,obscure ,important ,differences between ,search methods. The in-depth analysis of individual ,q...
Article
Full-text available
In this study, the effects of query structure and various dictionary-based translation methods on the performance of cross-language information retrieval (CLIR) were tested. Query types studied were concept based, i.e., Boolean queries, and structured and unstructured natural language queries. The structuring of natural language queries was done on...
Article
Full-text available
CLIR resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast vol- umes of data, offers a natural source for this. We experimented with fo- cused crawling as a means to acquire comparable corpora in the...
Article
Introduction. Chemical substance names are long, complex and prone to variation. This study investigates the retrieval effects of the variation. Method. A large set of acronyms and associated text parts was extracted from a subset of the Medline collection and used to construct a full name - acronym index. A longest common subsequence and statistic...
Conference Paper
Modeling the beyond-topical aspects of relevance are currently gaining popularity in IR evaluation. For example, the discounted cumulated gain (DCG) measure implicitly models some aspects of higher-order relevance via diminishing the value of relevant documents seen later during retrieval (e.g., due to information cumulated, redundancy, and effort)...
Article
We propose a method for performing evaluation of relevance feedback based on simulating real users. The user simulation applies a model defining the user’s relevance threshold to accept individual documents as feedback in a graded relevance environment; user’s patience to browse the initial list of retrieved documents; and his/her effort in providi...
Conference Paper
Cross-language Information Retrieval requires good methods for translating cross-lingual spelling variants which are not covered by the available dictionary resources. FITE-TRT is an established method employing frequency-based identification of translation equivalents received from transformation rule based translation. This study further develops...
Article
We devised a novel statistical technique for the identification of the translation equivalents of source words obtained by transformation rule based translation (TRT). The effectiveness of the technique called frequency-based identification of translation equivalents (FITE) was tested using biological and medical cross-lingual spelling variants and...
Article
Full-text available
Experience paper. World Wide Web contains billions of publicly available documents (pages) and it grows and changes rapidly. Web search engines, s u c h a s G o o g le a n d A lta v i s ta , p ro v i d e access to indexable Web documents. An important part of a search engine is a Web crawler whose function is to collect Web pages for the search eng...
Conference Paper
Full-text available
Experiments on the effectiveness of relevance feedback with real users are time-consuming and expensive. This makes simulation for rapid testing desirable. We define a user model, which helps to quantify some interaction decisions involved in simulated relevance feedback. First, the relevance criterion defines the relevance threshold of the user to...
Conference Paper
Full-text available
We devised a novel statistical technique for the identification of the translation equivalents of source words obtained by transformation rule based translation (TRT). The effectiveness of the devised FITE (frequency-based identification of translation equivalents) technique was tested using biological and medical cross-lingual spelling variants an...
Article
Technical terms and proper names constitute a major problem in dictionary-based cross-language information retrieval (CLIR). However, technical terms and proper names in different languages often share the same Latin or Greek origin, being thus spelling variants of each other. In this paper we present a novel two-step fuzzy translation technique fo...
Conference Paper
Full-text available
Article
Full-text available
In this study the basic framework and performance analysis results are presented for the three year long development process of the dictionary-based UTACLIR system. The tests expand from bilingual CLIR for three language pairs Swedish, Finnish and German to English, to six language pairs, from English to French, German, Spanish, Italian, Dutch and...
Conference Paper
Full-text available
We submitted runs for Genomics Track's ad hoc retrieval task. The first official run (utaauto) was an automatic run and the second (utamanu) manual. For utaauto, the main features of query formulation were the removal of performative and marginally topical words from the topics based on average term frequency statistics, the removal of stop-words,...
Article
Full-text available
This study reports on the first experiments ever to apply dictionary-based query translation techniques to Afrikaans queries submitted to an English database. The system was evaluated using 35 topics from the CLEF 2001 English language collection (title and descriptions). To show the performance level of the test queries, the original English queri...
Conference Paper
The UTACLIR system of University of Tampere uses a dictionary-based CLIR approach. The idea of UTACLIR is to recognize distinct source key types and process them accordingly. The linguistic resources utilized by the framework include morphological analysis or stemming in indexing, normalization of topic words, stop word removal, splitting of compou...
Conference Paper
Full-text available
Untranslatable query keys pose a problem in dictionary-based cross-language information retrieval (CLIR). One solution consists of using approximate string matching methods for finding the spelling variants of the source key among the target database index. In such a setting, it is important to select a matching method suited especially for CLIR. T...
Conference Paper
This article deals with both multilingual and bilingual IR. The source language is English, and the target languages are English, German, Finnish, Swedish, Dutch, French, Italian and Spanish. The approach of separate indexes is followed, and four different merging strategies are tested. Two of the merging methods are classical basic methods: the Ra...
Article
We will explore various ways to apply query structuring in cross-language information retrieval. In the first test, English queries were translated into Finnish using an electronic dictionary, and were run in a Finnish newspaper database of 55,000 articles. Queries were structured by combining the Finnish translation equivalents of the same English...
Article
Full-text available
This study focuses on the intellectual accessibility of information in indigenous languages, using Zulu, one of the main indigenous languages in South Africa, as a test case. Both Cross-Lingual Information Retrieval (CLIR) and metadata are discussed as possible means of facilitating access and a bilateral approach combining these two methods is pro...
Conference Paper
We studied the effects of query expansion and query structure on retrieval performance. Two sets of words frequent in relevant documents for Genomics Track's training topics were collected, the first manually and the second automatically. The high frequency words collected and the names of organisms designated in the test topics, were used as expan...
Conference Paper
We will present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first stage, transformation rules are applied to source words to render them more similar to their target language equivalents. The rules are generated automatically using translation dictionaries as source data. In the second stage, the interme...
Article
There are two main translation approaches in a multilingual i nformation retrieval task: either to translate the topics or to translate the datasets. The first one is an easier and more common approach. There are two indexing approaches: either to index the languages in the same index, or to build separate indexes for different languages. If the la...
Article
Effects of ellipsis and anaphor resolution on proximity searching in a text database are analyzed. Anaphora and ellipses are classified into proper names and common nouns of basic words, compound words, and phrases. 28 queries for which document relevance of data was available, were run in a newspaper database of 55.000 articles. The resolution of...
Article
Full-text available
this paper, the translation polysemy and the dictionary coverage problems were attacked by means of the combination of a general language MRD and a domain specific MR D i.e., a medical dictionary. The domain was restricted to medicine and health by choosing as test requests TREC's (see Harman, 1996) health related topics. The performance of transla...
Article
We present a novel n-gram based string matching technique, which we call the targeted s-gram matching technique. In the technique, n-grams are classified into categories on the basis of character contiguity in words. The categories are then utilized in matching. The technique was compared with the conventional n-gram technique using adjacent charac...
Article
This paper presents a dictionary-based query translation and construction method for CLIR. The framework focuses on solving morphological and lexical problems in CLIR generally between many language pairs. An application based on this method, the extendable UTACLIR system, capable of performing query translations between several source and target l...
Article
In an earlier study, we presented a query key goodness scheme, which can be used to separate between good and bad query keys. The scheme is based on the relative average term frequency (RATF) values of query keys. In the present paper, we tested the effectiveness of the scheme in Finnish to English cross-language retrieval in several experiments. Q...
Conference Paper
The Tampere University CLEF research group participated in CLEF2001 with four automated bilingual runs. Our cross-lingual software, UTACLIR, uses an automated method for query construction for cross-language information retrieval (CLIR). This method seeks to automatically extract topical information from request sentences written in one of the sour...
Article
This paper reviews literature on dictionary-based cross-language information retrieval (CLIR) and presents CLIR research done at the University of Tampere (UTA). The main problems associated with dictionary-based CLIR, as well as appropriate methods to deal with the problems are discussed. We will present the structured query model by Pirkola and r...
Article
This paper presents a morphological classification of languages from the IR perspective. Linguistic typology research has shown that the morphological complexity of each language of the world can be described by two variables, index of synthesis and index of fusion. These variables provide a theoretical basis for IR research handling morphological...
Article
Search key resolution power is analyzed in the context of a request, i.e., among the set of search keys for the request. Methods of characterizing the resolution power of keys automatically are studied, and the effects search keys of varying resolution power have on retrieval effectiveness are analyzed. It is shown that it often is possible to iden...
Article
This paper analyzes the features of the Swedish language from the viewpoint of mono- and cross-language information retrieval (CLIR). The study was motivated by the fact that Swedish is known poorly from the IR perspective. This paper shows that Swedish has unique features, in particular gender features, the use of fogemorphemes in the formation of...
Article
Full-text available
We participated in CLEF'2001 with four automated bilingual runs. UTACLIR is an automatic query translation and construction system for cross-language information retrieval. The system automatically extracts topical information from request sentences written in one of the source languages and constructs a target language query, based on translations...
Conference Paper
We designed, implemented and evaluated an automated method for query construction for CLIR from Finnish, Swedish and German to English. This method seeks to automatically extract topical information from request sentences written in one of the source languages and to create a target language query, based on translations given by a translation dicti...
Article
Full-text available
We used a dictionary-based approach, and performed tests in the bilingual track with three language pairs, i.e., Swedish – English (Swe-Eng), Finnish – English (Fin-Eng), and German – English (Ger-Eng). All the source languages are compound languages, i.e., languages rich in compound words. A compound word refers to a multi-word expression where th...
Article
The paper studies concept-based cross-language information retrieval (CLIR). The document collection was a subset of the TREC collection. The test requests were formed from TREC's health related topics. As translation dictionaries the study used a general dictionary and a domain-specific (=medical) dictionary. The effects of translation method, con...
Conference Paper
this paper, the translation polysemy and the dictionary coverage problems were attacked by means of the combination of a general language MRD and a domain specific MR D i.e., a medical dictionary. The domain was restricted to medicine and health by choosing as test requests TREC's (see Harman, 1996) health related topics. The performance of tr...
Article
So far, methods for ellipsis and anaphor resolution have been developed and the effects of anaphor resolution have been analyzed in the context of statistical information retrieval (IR) of scientific abstracts. No significant improvement has been observed. In this study, the effects of ellipsis and anaphor resolution on proximity searching in a ful...
Article
Full-text available
The UTACLIR query translation system was originally designed for the CLEF 2000 and 2001 campaigns. In the two first years the query translation application consisted of separate programs based on common translation principles for the language pairs Finnish -English, German -English and Swedish -English. The idea of UTACLIR is based on recognizing d...
Article
Full-text available
This paper presents results from a study, where fuzzy string matching techniques were used as the sole query translation technique in Cross Language Information Retrieval (CLIR) between the closely related languages Swedish and Norwegian. It is a novel research idea to apply only fuzzy string matching techniques in query translation. Closely relate...
Article
We review the applicability of dictionary-based Cross-Language Information Retrieval (CLIR) from Zulu to English. Due to the lack of electronic resources and in particular tools for morphological analysis, novel approaches had to be found to deal with the processes of CLIR. Approximate string matching, combined with a monolingual Zulu word list was...
Article
Web crawling refers to the process of gathering data from the Web. Focused crawlers are programs that selectively download Web documents (pages), restricting the scope of crawling to a pre-defined domain or topic. The downloaded documents can be indexed for a domain specific search engine or a digital library. In this paper, we describe the focused...

Network

Cited By

Projects

Project (1)
Archived project
Ph.D. thesis (completed 27 Nov 2010)