
Jorma Laurikkala- Tampere University
Jorma Laurikkala
- Tampere University
About
53
Publications
11,598
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,037
Citations
Current institution
Publications
Publications (53)
The aim of this article is to inquire about potential relationship between change of crime rates and change of gross domestic product (GDP) growth rate, based on historical statistics of Japan. This national-level study used a dataset covering 88 years (1926–2013) and 13 attributes. The data were processed with the self-organizing map (SOM), separa...
Background and objectives Due to development of imaging systems the amount of digital images obtained in the biological field has been growing in recent years. These images contain information that is not directly measurable, e.g. the area covered by a single cell. In most of the current imaging programs the regions of interest (ROI), e.g. individu...
The aim of this article is to inquire about correlations between criminal phenomena and demographic factors. This international-level comparative study used a dataset covering 56 countries and 28 attributes. The data were processed with the Self-Organizing Map (SOM), assisted other clustering methods, and several statistical methods for obtaining c...
Homicide is one of the most serious kinds of offenses. Research on causes of homicide has never reached a definite conclusion. The purpose of this article is to put homicide in its broad range of social context to seek correlation between this offense and other macroscopic socioeconomic factors. This international-level comparative study used a dat...
Calcium cycling is crucial in the excitation-contraction coupling of cardiomyocytes, and therefore has a key role in cardiac functionality. Cardiac disorders and different drugs alter the calcium transients of cardiomyocytes and can cause serious dysfunction of the heart. New insights into this biochemical phenomena can be achieved by studying and...
The authors present the author's results of using saccadic eye movements for biometric user verification. The method can be applied to computers or other devices, in which it is possible to include an eye movement camera system. Thus far, this idea has been little researched. As they have extensively studied eye movement signals for medical applica...
The main target of this paper was to study the influence of training data quality on the text document classification performance of machine learning methods. A graded relevance corpus of ten classes and 957 text documents was classified with Self-Organising Maps (SOMs), learning vector quantisation, k-nearest neighbours searching, naïve Bayes and...
Induced pluripotent stem cell (iPSC) lines derived from skin fibroblasts of patients suffering from cardiac disorders were differentiated to cardiomyocytes and used to generate a data set of Ca(2+) transients of 136 recordings. The objective was to separate normal signals for later medical research from abnormal signals. We constructed a signal ana...
We recently studied the application of saccadic eye movements, measured with video cameras, to biometric verification using subjects who receive identical stimulation. The properties of a subject's saccades may vary between measurements over the course of time, so to be useful as a means of biometric verification, the temporal variability of saccad...
Preprocessing of data is a vital part of any task involving machine learning. In the classification of text documents, the most important aspect of preprocessing is usually the dimensionality reduction of data vectors. This paper focuses on the use of a recent scatter method in the dimensionality reduction of text documents. The effectiveness of th...
Using five medical datasets we detected the influence of missing values on true positive rates and classification accuracy.
We randomly marked more and more values as missing and tested their effects on classification accuracy. The classifications
were performed with nearest neighbour searching when none, 10, 20, 30% or more values were missing. We...
This paper focuses on the use of self-organising maps, also known as Kohonen maps, for the classification task of text documents.
The aim is to effectively and automatically classify documents to separate classes based on their topics. The classification
with self-organising map was tested with three data sets and the results were then compared to...
This research deals with the use of self-organising maps for the classification of text documents. The aim was to classify
documents to separate classes according to their topics. We therefore constructed self-organising maps that were effective
for this task and tested them with German newspaper documents. We compared the results gained to those o...
Purpose – The aim of this paper is to explore the possibility of retrieving information with Kohonen self-organising maps, which are known to be effective to group objects according to their similarity or dissimilarity. Design/methodology/approach – After conventional preprocessing, such as transforming into vector space, documents from a German do...
CLIR resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast vol- umes of data, offers a natural source for this. We experimented with fo- cused crawling as a means to acquire comparable corpora in the...
Clustering groups document objects represented as vectors. An extensive vector space may cause obstacles to applying these methods. Therefore, the vector space was reduced with principal component analysis (PCA). The conventional cosine measure is not the only choice with PCA, which involves the mean-correction of data. Since mean-correction change...
Heterogeneous Euclidean-overlap metric and heterogeneous value difference metric given in machine learning literature are
useful for the consideration of mixed-type data for machine learning, pattern recognition and data mining tasks. Mixed-type
variables are quite common in practical problems, but this property has been taken into account only sel...
We improved the classification ability of multilayer perceptron networks by constructing a set of networks of as many as output classes and investigated the influence of different input variables on the classification. We have developed methods named scattering, spectrum and response analysis to express the classification complexity, especially the...
We studied the efficiency of multilayer perceptron networks to classify eight different medical data sets with typical problems connected to their strongly non-uniform distributions between output classes and relatively small sizes of training sets. We studied especially the possibility mentioned in the literature of balancing a class distribution...
Information retrieval systems' ability to retrieve highly relevant documents has become more and more important in the age of extremely large collections, such as the World Wide Web (WWW). The authors' aim was to find out how corpus-based cross-language information retrieval (CLIR) manages in retrieving highly relevant documents. They created a Fin...
We present a method for creating a comparable text corpus from two document collections in different languages. The collections can be very different in origin: in this study we build a comparable corpus from articles by a Swedish news agency and a U.S. newspaper. The keys with best resolution power were extracted from the documents of one collecti...
Purpose – To present a method for creating a comparable document collection from two document collections in different languages. Design/methodology/approach – The best query keys were extracted from a Finnish source collection (articles of the newspaper Aamulehti) with the relative average term frequency formula. The keys were translated into Engl...
Search facilitated with agglomerative hierarchical clustering methods was studied in a collection of Finnish newspaper articles (N = 53,893). To allow quick experiments, clus- tering was applied to a sample (N = 5,000) that was reduced with principal components analysis. The dendrograms were heuristically cut to find an optimal partition, whose clu...
IR systems' ability to retrieve highly relevant documents has become more and more important in the age of extremely large collec-tions, such as the WWW. Our aim was to find out how corpus-based CLIR manages in retrieving highly relevant documents. We created a Finnish-Swedish comparable corpus and used it as a source of knowledge for query transla...
Proximity functions evaluate distances or similarities between objects. Unlike the Euclidean distance, heterogeneous proximity functions process variables differently according to their scale. The correct evaluation of nominal variables, whose values are unordered, is especially important. We compared five heterogeneous functions with the Euclidean...
Stemming and lemmatization were compared in the clustering of Finnish text documents. Since Finnish is a highly inflectional and agglutinative language, we hypothesized that lemmatization, involving splitting of the compound words, would be more appropriate normalization approach than the straightforward stemming. The relevance of the documents wer...
We studied three different methods to improve identification of small classes, which are also difficult to classify, by balancing an imbalanced class distribution with data reduction. The new method, neighborhood cleaning (NCL) rule, outperformed simple random sampling within classes and one-sided selection method in the experiments with ten real w...
In the next study we consider two distance metrics that were presented in the machine leaming literature for mixed-type variables. We show that they are not really metrics, but pseudometrics. The problem arose from missing values. The metrics can be redefined to satisfy the metricity definition. Distance computation can then be performed reliably w...
We studied pre-processing of a female urinary incontinence data set by removing uninformative variables, outliers, and noise,
to allow hierarchical clustering methods to find partitions that resemble the diagnostic classes. Outliers were identified
with box plots and Mahalanobis distances, while noisy cases were detected with the repeated edited ne...
We evaluated parameters for an expert system which will be designed to aid the differential diagnosis of female urinary incontinence by using knowledge discovered from data. To allow the statistical analysis, we applied means, regression and Expectation-Maximization (EM) imputation methods to fill in missing values. In addition, complete-case analy...
We studied three methods to improve identification of difficult small classes by balancing imbalanced class distribution with
data reduction. The new method, neighborhood cleaning rule (NCL), outperformed simple random and one-sided selection methods
in experiments with ten data sets. All reduction methods improved identification of small classes (...
Machine learning methods such as neural networks, decision trees and genetic algorithms can be useful to aid in the classification of patients. We tested Kohonen artificial neural networks, which are known to be effective for classification tasks. Our sample included patients with six different diseases. The Kohonen network algorithm recognized the...
We studied the use of virtual reality technology as a stimulus in balance examinations. A pilot study was made using a small group of healthy subjects to investigate the effect of alcohol and virtual reality stimulus on the subjects' balance. The tests showed that blood alcohol concentration accounted for almost 50% of the increased lateral body sw...
We investigated the capability of multilayer perceptron neural networks and Kohonen neural networks to recognize difficult otoneurological diseases from each other. We found that they are efficient methods, but the distribution of a learning set should be rather uniform. Also it is important that the number of learning cases is sufficient. If the t...
We have developed an OtoNeurological Expert system (ONE) to aid the diagnostics of vertigo, to assist teaching and to implement the database for research. The database contains detailed information on the patient history, signs and test results necessary for the diagnostic work with vertiginous patients. The pattern recognition method was used in t...
In this paper, machine learning methods based on artificial intelligence theory are applied to the computer-aided decision making of some otoneurological diseases, for example Ménière's disease. Three methods explored are decision trees, genetic algorithms and neural networks. By using such a machine learning method, the decision-making program is...
A novel machine learning system, Galactica, has been developed for knowledge discovery from databases. This system was applied to discover diagnostic rules from a patient database containing 564 cases with vestibular schwannoma, bening paroxysmal positional vertigo, Ménière's disease, sudden deafness, traumatic vertigo and vestibular neuritis diagn...
The usefulness of imputation in the treatment of missing values of an otoneurologic database for the discriminant analysis was evaluated on the basis of the agreement of imputed values and the analysis results. The data consisted of six patient groups with vertigo (N=564). There were 38 variables and 11% of the data was missing. Missing values were...
. Informal box plot identification of outliers in realworld medical data was studied. Box plots were used to detect univariate outliers directly whereas the box plotted Mahalanobis distances identified multivariate outliers. Vertigo and female urinary incontinence data were used in the tests. The removal of outliers increased the descriptive classi...
Heterogeneous proximity functions are similarity or distance functions which process data differently according to the scale of attributes. We compared two Minkowskian distance functions with three heterogeneous proximity functions to test whether these functions were better in data sets with attributes of mixed type. Significant differences in nea...
Data on patients with Meniere's disease, vestibular schwannoma, traumatic vertigo, sudden deafness, benign paroxysmal positional vertigo, or vestibular neuritis were retrieved from the database of otoneurologic expert system ONE for the development and testing of a genetic algorithm (GA). The accuracy of the diagnostic rules in solving the test cas...
Galactica, a newly developed machine-learning system that utilizes a genetic algorithm for learning, was compared with discriminant analysis, logistic regression, k-means cluster analysis, a C4.5 decision-tree generator and a random bit climber hill-climbing algorithm. The methods were evaluated in the diagnosis of female urinary incontinence in te...
We have studied computer-aided diagnosis of otoneurological diseases which are difficult, even for experienced specialists,
to determine and separate from each other. Since neural networks require plenty of training data, we restricted our research
to the commonest otoneurological diseases in our database and to the very most essential parameters u...
Usefulness of imputation in the treatment of missing values in an otologic database was studied. Missing values were filled in with means (ME), regression (LR) and Expectation-Maximization (EM) imputation methods. A random imputation method (RA) provided baseline results. ME, LR and EM methods agreed on 41-42% of the imputed missing values. The lev...
We have developed an otoneurological expert system (ONE) to aid the diagnostics of vertigo, to assist teaching and to implement a database for research. The ONE database is set to harvest data on patient history, signs and test results necessary for diagnostic work with vertiginous patients. A method based on pattern recognition was used in the rea...
Galactica, a newly developed machine-learning system that utilizes a genetic algorithm for learning, was compared with discriminant analysis, logistic regression, k-means cluster analysis, a C4.5 decision-tree generator and a random bit climber hill-climbing algorithm. The methods were evaluated in the diagnosis of female urinary incontinence in te...
A machine learning system named Galactica has been developed which uses a genetic algorithm to discover the rules for an expert system from databases. Galactica devised accurate diagnostic rules for female urinary incontinence from difficult heterogeneous data. The percentages of correctly classified stress, mixed and sensory urge incontinence test...
Population size and quality are parameters which control the performance of genetic algorithms. We researched these parameters in a genetic-based machine learning system Galactica which was used to discover the differential diagnostic rules for female urinary incontinence from case data. The performance of the system was measured with on-line and o...
Female urinary incontinence is a difficult problem for a patient but also for a physician. In the differential diagnosis of female urinary incontinence the physician has to determine a diagnostic class for the patient. This task is complex because of the unreliable patient history and the overlapping class boundaries. In order to develop an expert...