Heikki MannilaAalto University · Department of Computer Science
Heikki Mannila
About
292
Publications
52,463
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
25,961
Citations
Publications
Publications (292)
There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict...
In this note, we discuss the applicability of latent variable models as a tool in analyzing the structure of a research system. We consider whether tensor methods, especially Parallel Factor Analysis, are appropriate for the description of the personnel structure and publication results of different scientific disciplines in different universities....
Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory, 2005; 1(2): 263–76.), the use of the χ2 and log-likelihood ratio tests is problematic in this context...
A chronofauna is a geographically restricted collection of interacting animal populations that maintains its base structure over a long period of time. We describe a simple computational method that can identify candidate chronofaunas on the basis of presence-absence matrices only: A candidate chronofauna is a collection of sites that share an exce...
Large sparse sets of binary transaction data with millions of records and
thousands of attributes occur in various domains: customers purchasing
products, users visiting web pages, and documents containing words are just
three typical examples. Real-time query selectivity estimation (the problem of
estimating the number of rows in the data satisfyi...
Aim Our aims were to test: (1) the extent to which vascular plant associations are related in space to mammalian associations, and (2) whether the plant associations are more closely related than the mammalian associations to climate and to a published environmental stratification of Europe.
Location Europe, as defined by the following boundaries:...
The object of this study was to identify temperament patterns in the Finnish population, and to determine the relationship between these profiles and life habits, socioeconomic status, and health.
A cluster analysis of the Temperament and Character Inventory subscales was performed on 3,761 individuals from the Northern Finland Birth Cohort 1966 an...
Investigation of the environmental influences on human behavioral phenotypes is important for our understanding of the causation of psychiatric disorders. However, there are complexities associated with the assessment of environmental influences on behavior.
We conducted a series of analyses using a prospective, longitudinal study of a nationally r...
Differences in early life measures between male temperament clusters.
(DOC)
Early life measures predicting individual temperament dimensions, as measured by the Temperament and Character Inventory, which survived correction for females.
(DOC)
Early life measures predicting individual temperament dimensions, as measured by the Temperament and Character Inventory, which survived correction for males.
(DOC)
Early life measures predicting group membership of each female temperament cluster separately.
(DOC)
Early life measures predicting group membership of each male temperament cluster separately.
(DOC)
Differences in early life measures between female temperament clusters.
(DOC)
Differences in average grades in adolescence between temperament clusters for females and males.
(DOC)
Clusterings based on NFBC66 four-cluster model vs. YF two-cluster mode.
(DOC)
Self-rated physical capacity, life habits, health and stress reactivity data from the 31-year follow-up in NFBC66 females.
(DOC)
Early life measures predicting temperament dimension scores for females.
(DOC)
Histograms of chi-square values. Chi-square values of 100 experiments using generated data and of results based on cross-tabulation of 4-cluster solutions (presented in Table S2), with green for females and red for males.
(TIF)
Early life measures predicting temperament dimension scores for males.
(DOC)
Clusterings based on NFBC66 four-cluster model vs. YF four-cluster mode.
(DOC)
Clusterings based on NFBC66 two-cluster model vs. YF two-cluster mode.
(DOC)
Self-rated physical capacity, life habits, health and stress reactivity data from the 31-year follow-up in NFBC66 males.
(DOC)
We used 10-km grid data from the Finnish Bird Atlas data and high-resolution data on temperature and rainfall to estimate species richness from climate and environmental variables across spatial scales. We used an ordinary least-squares (OLS) linear-regression model with a quadratic error function to estimate the number of bird species that occur....
Comparing frequency counts over texts or corpora is an important task in many applications and scientific disciplines. Given
a text corpus, we want to test a hypothesis, such as “word X is frequent”, “word X has become more frequent over time”, or
“word X is more frequent in male than in female speech”. For this purpose we need a null model of word...
Finding who and what is “important” is an ever-occurring question. Many methods that aim at characterizing important items
or influential individuals have been developed in areas such as, bibliometrics, social-network analysis, link analysis, and
web search. In this paper we study the problem of attributing influence scores to individuals who accom...
Appendix: Periodicity score calculations. Additional file 1 contains more detailed mathematical derivations of the results in section Analysis of periodicity score distributions.
Modern high-throughput measurement technologies such as DNA microarrays and next generation sequencers produce extensive datasets. With large datasets the emphasis has been moving from traditional statistical tests to new data mining methods that are capable of detecting complex patterns, such as clusters, regulatory networks, or time series period...
Multidimensional 0-1 data occurs in many domains. Typically one assumes that the order of rows and columns has no importance.
However, in some applications, e.g., in ecology, there is structure in the data that becomes visible only when the rows and
columns are permuted in a certain way. Examples of such structure are different forms of nestedness...
Correlation between occurrences of taxa is a fundamental concept in the analysis of presence-absence data. Such correlations can result from ecologically relevant processes, such as existence and evolution of species communities. Correlations are typically quantified by some sort of similarity index based on co-occurrence counts. We argue that the...
We introduce the Boolean inductive query evaluation problem, which is concerned with answering inductive queries that are
arbitrary Boolean expressions over monotonic and anti-monotonic predicates. Boolean inductive queries can be used to address
many problems in data mining and machine learning, such as local pattern mining and concept-learning, a...
We introduce a well-grounded minimum description length (MDL) based quality measure for a clustering consisting of either
spherical or axis-aligned normally distributed clusters and a cluster with a uniform distribution in an axis-aligned rectangular
box. The uniform component extends the practical usability of the model e.g. in the presence of noi...
Assume a network (V,E) where a subset of the nodes in V are active. We consider the problem of selecting a set of k active nodes that best explain the observed activation state, under a given information-propagation model. We call these nodes effectors. We formally define the k-Effectors problem and study its complexity for different types of graph...
Many sorts of structured data are commonly stored in a multi-relational format of interrelated tables. Under this relational model, exploratory data analysis can be done by using relational queries. As an example, in the Internet Movie Database (IMDb) a query can be used to check whether the average rank of action movies is higher than the average...
Background: Fossil data sets are typically point-like, i.e. they provide information about a fossil fauna only for scattered localities. Modern distribution data are typically based on grid cells, and provide a (nearly) full description of the fauna. Question: How good are estimates of the characteristics of the whole fauna that one obtains by look...
We analyzed the dynamics of carbon balance components: gross primary
production (GPP) and total ecosystem respiration (TER), of a boreal Scots
pine forest in Southern Finland. The main focus is on investigations of
environmental drivers of GPP and TER and how they affect the inter-annual
variation in the carbon balance in autumn (September–December...
Segmentation is a general data mining technique for summarising and analysing sequential data. Segmentation can be applied, e.g., when studying large-scale genomic structures such as isochores. Choosing the number of segments remains a challenging question. We present extensive experimental studies on model selection techniques, Bayesian Informatio...
Randomization is an important technique for assessing the significance of data analysis results. Given an input dataset, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is then compared to the measure on the samples to assess...
Interaction graphs are ubiquitous in many fields such as bioinformatics, sociology and physical sciences. There have been many studies in the literature targeted at studying and mining these graphs. However, almost all of them have studied these graphs ...
Data mining research has developed many algorithms for various analysis tasks on large and complex datasets. However, assessing
the significance of data mining results has received less attention. Analytical methods are rarely available, and hence one
has to use computationally intensive methods. Randomization approaches based on null models provid...
This work shows how concepts from the electromagnetic field theory can be efficiently used in clustering with constraints.
The proposed framework transforms vector data into a fully connected graph, or just works straight on the given graph data.
User constraints are represented by electromagnetic fields that affect the weight of the graph’s edges....
While DTNBP1, DISC1, and NRG1 have been extensively studied as candidate genes of schizophrenia, results remain inconclusive. Possible explanations for this are that the genes might be relevant only to certain subtypes of the disease and/or only in certain populations.
We performed unsupervised clustering of individuals from Finnish schizophrenia f...
A bipartite graph G=(U,V,E) is a chain graph [M. Yannakakis, SIAM J. Algebraic Discrete Methods 2, 77–79 (1981; Zbl 0496.68033)] if there is a bijection π:{1,⋯,|U|}→U such that Γ(π(1))⊇Γ(π(2))⊇⋯⊇Γ(π(|U|)), where Γ is a function that maps a node to its neighbors. We give approximation algorithms for two variants of the minimum chain completion probl...
We analyzed the dynamics of carbon balance components: gross primary production (GPP) and total ecosystem respiration (TER), of a boreal Scots pine forest in Southern Finland. Our aim was to study how these dynamics are related to different environmental conditions and how they affect the inter-annual variation in the carbon balance in autumn (Sept...
In recent years, there has been significant interest in the development of ranking functions and efficient top-k retrieval algorithms to help users in ad hoc search and retrieval in databases (e.g., buyers searching for products in a catalog). We introduce a complementary problem: How to guide a seller in selecting the best attributes of a new tupl...
Many sorts of structured data are commonly stored in a multi-relational format of interrelated tables. Under this relational model, exploratory data analysis can be done by using relational queries. As an example, in the Internet Movie Database (IMDb) a query can be used to check whether the average rank of action movies is higher than the average...
Most pattern discovery algorithms easily generate very large numbers of patterns, making the results impossible to un- derstand and hard to use. Recently, the problem of instead selecting a small subset of informative patterns from a large collection of patterns has attracted a lot of interest. In this paper we present a succinct way of representin...
We show that a simple randomized algorithm has an expected constant factor approximation guarantee for fitting bucket orders to a set of pairwise preferences.
A method of complexity control in multinomial mixture modeling of multiple-marker genotype data, imposing the Hardy-Weinberg equilibrium (HWE) between the genotype values, is studied. This is a very natural restriction, and known to hold at population level under modest assumptions. The hypothesis under study is that imposing this restriction will...
An ever larger proportion of Earth's biota is affected by the current accelerating environmental change. The mismatches between organisms and their environments are now increasing in both magnitude and frequency, resulting in lowered fitness and hence the decline of populations. Under this scenario, species with behavioral and/or physiological trai...
There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict...
In Heikinheimo et al. (Journal of Biogeography, 2007, 34, 1053–1064) we used clustering to analyse European land mammal fauna. Gagné & Proulx criticized our choice of the Euclidean distance measure in the analysis, and advocated the use of the Hellinger distance measure, claiming that this leads to very different clustering results. The criticism f...
Matrix decomposition methods represent a data matrix as a product of two factor matrices: one containing basis vectors that represent meaningful concepts in the data, and another describing how the observed data can be expressed as combinations of the basis vectors. Decomposition methods have been studied extensively, but many methods return real-v...
Taxonomies for a set of features occur in many real-world domains. An example is provided by paleontology, where the task is to determine the age of a fossil site on the basis of the taxa that have been found in it. As the fossil record is very noisy and there are lots of gaps in it, the challenge is to consider taxa at a suitable level of aggregat...
Ordering and ranking items of different types (observations, web pages, etc.) are important tasks in various applications,
such as query processing and scientific data mining. We consider different problems of inferring total or partial orders from
data, with special emphasis on applications to the seriation problem in paleontology. Seriation can b...
Consider a 0-1 observation matrix M, where rows correspond to entities and columns correspond to signals; a value of 1 (or 0) in cell (i,j) of M indicates that signal j has been observed (or not observed) in entity i. Given such a matrix we study the problem of inferring the underlying directed links between entities (rows) and finding which entrie...
Event sequences where different types of events often occur close together arise, e.g., when studying potential transcription factor binding sites (TFBS, events) of certain transcription factors (TF, types) in a DNA sequence. These events tend to occur in bursts: in some genomic regions there are more genes and therefore potentially more binding si...
A 0--1 matrix has a banded structure if both rows and columns can be permuted so that the non-zero entries exhibit a staircase pattern of overlapping rows. The concept of banded matrices has its origins in numerical analysis, where entries can be viewed as descriptions between the problem variables; the bandedness corresponds to variables that are...
Sequence data are abundant in application areas such as computational biology, environmental sciences, and telecommunications.
Many real-life sequences have a strong segmental structure, with segments of different complexities. In this paper we study
the description of sequence segments using variable length Markov chains (VLMCs), also known as tre...
In recent years, there has been significant interest in development of ranking functions and efficient top-k retrieval algorithms to help users in ad-hoc search and retrieval in databases (e.g., buyers searching for products in a catalog). In this paper we focus on a novel and complementary problem: how to guide a seller in selecting the best attri...
Do large mammals evolve faster than small mammals or vice versa? Because the answer to this question contributes to our understanding of how life-history affects long-term and large-scale evolutionary patterns, and how microevolutionary rates scale-up to macroevolutionary rates, it has received much attention. A satisfactory or consistent answer to...
The discovery of recurring patterns in databases is one of the main topics in data mining and many efficient solutions have been developed for relatively simple classes of patterns and data collections. Indeed, most frequent pattern mining or association rule mining algorithms work on so called transaction databases. Not only for itemsets, but also...
Given a 0-1 dataset, we consider the redescription mining task introduced by Ramakrishnan, Parida, and Zaki. The problem is to find subsets of the rows that can be (approxi- mately) defined by at least two different Boolean formulae on the attributes. That is, we search for pairs (α, β )o f Boolean formulae such that the implications α → β and β →...
Randomization is an important technique for assessing the significance of data mining results. Given an input data set, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is then compared to the measure on the samples to assess...
We studied how well the European CEU samples used in the Haplotype Mapping Project (HapMap) represent five European populations by analyzing nuclear family samples from the Swedish, Finnish, Dutch, British and Australian (European ancestry) populations. The number of samples from each population (about 30 parent-offspring trios) was similar to that...
Current status of the IMIS project (IntelligentManagement Information System) is described. Theproject aims at constructing an adaptive system to help inexecutive decision making. Major features of the currentimplementation concern support for discussingvisualizations of business data and algorithms for datamining from time series data. Future plan...
The problem of assessing the signicance of data mining re- sults on high-dimensional 0{1 data sets has been studied extensively in the literature. For problems such as mining frequent sets and nding correlations, signicance testing can be done by, e.g., chi-square tests, or many other meth- ods. However, the results of such tests depend only on the...
Partial rankings are totally ordered subsets of a set of items. For example, the sequence in which a user browses through
different parts of a website is a partial ranking. We consider the following problem. Given a set D of partial rankings, find items that have strongly different status in different parts of D. To do this, we first compute a clus...
Many sequential data sets have a segmental structure, and similar types of segments occur repeatedly. We consider sequences
where the underlying phenomenon of interest is governed by a small set of models that change over time. Potential examples
of such data are environmental, genomic, and economic sequences. Given a target sequence and a (possibl...
Introduction A Hidden Markov Model for Recombinant Haplotypes Learning the HMM from Unphased Genotype Data Haplotype Reconstruction Experimental Results Discussion Acknowledgments References
The isochore structure of a genome is observable by variation in the G+C (guanine and cytosine) content within and between the chromosomes. Describing the isochore structure of vertebrate genomes is a challenging task, and many computational methods have been developed and applied to it. Here we apply a well-known least-squares optimal segmentation...
Aim To produce a spatial clustering of Europe on the basis of species occurrence data for the land mammal fauna.
Location Europe defined by the following boundaries: 11°W, 32°E, 71°N, 35°N.
Methods Presence/absence records of mammal species collected by the Societas Europaea Mammalogica with a resolution of 50 × 50 km were used in the analysis. Aft...
Significant pairs. Significant pairs according to the FL and FL(r) null models and C score in 10 Mbp regions from chromosome 1–10, with window length w = 300, minimum distance d = 20, and empirical p-value p ≤ 0.001. Each row contains the following 5 columns: chromosome, TF 1 (e.g., 45 corresponds to Jaspar matrix MA0045), TF 2, C score, 0 if the p...
Randomization results for MHC isochore segmentations. Randomization results for the MHC region in chromosome 6 are shown w.r.t. three alternative ground truth segmentations T from [14,20,23].
We consider the following problem: given a set of clusterings, find a single clustering that agrees as much as possible with the input clusterings. This problem, clustering aggregation , appears naturally in various contexts. For example, clustering categorical data is an instance of the clustering aggregation problem; each categorical attribute ca...
There exist many segmentation techniques for genomic sequences, and the segmentations can also be based on many different biological features. We show how to evaluate and compare the quality of segmentations obtained by different techniques and alternative biological features.
We apply randomization techniques for evaluating the quality of a given...
Haplotype Reconstruction is the problem of resolving the hidden phase information in genotype data obtained from laboratory measurements. Solving this problem is an important intermediate step in gene association studies, which seek to uncover the genetic basis of complex diseases. We propose a novel approach for haplotype reconstruction based on c...
Sequence data are abundant in application areas such as computational biology, environmental sciences, and telecommunication. Many real-life sequences have a strong segmental structure, with segments of different complexities. In this paper we study the description of sequence segments using variable length Markov chains (VLMCs), also known as tree...
A large part of the data on the World Wide Web is hidden behind form-like interfaces. These interfaces interact with a hidden back-end database to provide answers to user queries. Generating a uniform random sample of this hidden database by using only the publicly available interface gives us access to the underlying data distribution. In this pap...
The discovery of subsets with special properties from bi- nary data has been one of the key themes in pattern dis- covery. Pattern classes such as frequent itemsets stress the co-occurrence of the value 1 in the data. While this choice makes sense in the context of sparse binary data, it disre- gards potentially interesting subsets of attributes th...
Consider each row of a 0-1 dataset as the subset of the columns for which the row has an 1. Then a dataset is nested, if for all pairs of rows one row is either a superset or subset of the other. The concept of nestedness has its ori- gins in ecology, where approximate versions of it has been used to model the species distribution in different loca...
Estimating the relative frequencies of linguistic features is a fundamental task in linguistic computation. As the amount
of text or speech that is available from a given user of the language typically varies greatly, and the sample sizes tend
to be small, the most straightforward methods do not always give the most informative answers. Bootstrap a...
Many 0/1 datasets have a very large number of vari- ables; however, they are sparse and the dependency struc- ture of the variables is simpler than the number of vari- ables would suggest. Defining the effective dimensionality of such a dataset is a nontrivial problem. We consider the problem of defining a robust measure of dimension for 0/1 datase...
Matrix decomposition methods represent a data matrix as a product of two smaller matrices: one containing basis vectors that
represent meaningful concepts in the data, and another describing how the observed data can be expressed as combinations of
the basis vectors. Decomposition methods have been studied extensively, but many methods return real-...