Tuomo Korenius’s research while affiliated with Tampere University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (3)


On principal component analysis, cosine and Euclidean measures in information retrieval
  • Article

November 2007

·

1,913 Reads

·

167 Citations

Information Sciences

Tuomo Korenius

·

·

Clustering groups document objects represented as vectors. An extensive vector space may cause obstacles to applying these methods. Therefore, the vector space was reduced with principal component analysis (PCA). The conventional cosine measure is not the only choice with PCA, which involves the mean-correction of data. Since mean-correction changes the location of the origin, the angles between the document vectors also change. To avoid this, we used a connection between the cosine measure and the Euclidean distance in association with PCA, and grounded searching on the latter. We applied the single and complete linkage and Ward clustering to Finnish documents utilizing their relevance assessment as a new feature. After the normalization of the data PCA was run and relevant documents were clustered.


Figure 1. Dendrogram produced by the single linkage method. The horizontal and  
Table 1 . Statistics describing the recall, precision, and effectiveness measures of the best cluster produced with the single linkage (SL), complete
Figure 2. Scree plot for the document sample. The first 1,600 out of the total 4,999
Table 2 . Descriptive statistics of the recall, precision, and effectiveness (E) measures for the nearest neighbor (NN) and single linkage (SL) clustering based searches. Graded relevance assessment was collapsed into binary one so that (A) all documents
Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments
  • Article
  • Full-text available

January 2006

·

327 Reads

·

8 Citations

Information Retrieval Journal

Search facilitated with agglomerative hierarchical clustering methods was studied in a collection of Finnish newspaper articles (N = 53,893). To allow quick experiments, clus- tering was applied to a sample (N = 5,000) that was reduced with principal components analysis. The dendrograms were heuristically cut to find an optimal partition, whose clusters were compared with each of the 30 queries to retrieve the best-matching cluster. The four- level relevance assessment was collapsed into a binary one by (A) considering all the relevant and (B) only the highly relevant documents relevant, respectively. Single linkage (SL) was the worst method. It created many tiny clusters, and, consequently, searches enabled with it had high precision and low recall. The complete linkage (CL), average linkage (AL), and Ward's methods (WM) returned reasonably-sized clusters typically of 18 -32 doc uments. Their recall (A: 27-52%, B: 50-82%) and precision (A: 83-90%, B: 18-21%) was higher than and comparable to those of the SL clusters, respectively. The AL and WM clustering had 1- 8% better effectiveness than nearest neighbor searching (NN), and SL and CL were 1-9% less efficient that NN. However, the differences were statistically insignificant. When evaluated with the liberal assessment A, the results suggest that the AL and WM clustering offer better retrieval ability than NN. Assessment B renders the AL and WM clustering better than NN, when recall is considered more important than precision. The results imply that collections in the highly inflectional and agglutinative languages, such as Finnish, may be clustered as the collections in English, provided that documents are appropriately preprocessed.

Download

Stemming and lemmatization in the clustering of Finnish text documents

November 2004

·

3,266 Reads

·

210 Citations

Stemming and lemmatization were compared in the clustering of Finnish text documents. Since Finnish is a highly inflectional and agglutinative language, we hypothesized that lemmatization, involving splitting of the compound words, would be more appropriate normalization approach than the straightforward stemming. The relevance of the documents were evaluated with a four-point relevance assessment scale, which was collapsed into binary one by considering all the relevant and only the highly relevant documents relevant, respectively. Experiments with four hierarchical clustering methods supported the hypothesis. The stringent relevance scale showed that lemmatization allowed the single and complete linkage methods to recover especially the highly relevant documents better than stemming. In comparison with stemming, lemmatization together with the average linkage and Ward's methods produced higher precision. We conclude that lemmatization is a better word normalization method than stemming, when Finnish text documents are clustered for information retrieval.

Citations (3)


... We opted for lemmatization, instead of stemming, considering that the latter reduces words to their common root by removing or replacing word suffixes (e.g., "loading" is stemmed as "load"), while the former identifies the inflected forms of a word and returns its base form (e.g. "better" is lemmatized as "good") [28]. All the details regarding the various configurations used are available at [4]. ...

Reference:

An Empirical Study on the Classification of Bug Reports with Machine Learning
Stemming and lemmatization in the clustering of Finnish text documents

... We use Ward's method [31] to produce our clusters, which intuitively would be well-suited for this task, as it would seek to minimize the distance between the original vector and the pooled outputs. Additionally, it has generally been observed to perform well for text data [12,25]. ...

Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments

Information Retrieval Journal

... The Euclidean coefficient [77,78] is commonly used to compare the differences between the element values of vectors. This coefficient is easy to understand and intuitive, reflecting the actual distance between points in a multidimensional space: ...

On principal component analysis, cosine and Euclidean measures in information retrieval
  • Citing Article
  • November 2007

Information Sciences