
Timo HonkelaUniversity of Helsinki | HY · Department of Modern Languages
Timo Honkela
PhD
About
154
Publications
28,956
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,922
Citations
Citations since 2017
Publications
Publications (154)
This paper outlines a pilot study on multi-dimensional and multilingual sentiment analysis of social media content. We use parallel corpora of movie subtitles as a proxy for colloquial language in social media channels and a multilingual emotion lexicon for fine-grained sentiment analyses. Parallel data sets make it possible to study the preservati...
The poster for CoLing 2016: Challenges in Multidimensional Sentiment Analysis Across Languages
In this article, automatically generated and manually crafted semantic representations are compared. The comparison takes place under the assumption that neither of these has a primary status over the other. While linguistic resources can be used to evaluate the results of automated processes, data-driven methods are useful in assessing the quality...
We present a novel Bayesian reinforcement learning algorithm that addresses model bias and exploration overhead issues. The algorithm combines different aspects of several state-of-the-art reinforcement learning methods that use Gaussian Processes model-based approaches to increase the use of the online data samples. The algorithm uses a smooth rew...
Surveys are widely conducted as a means to obtain information on thoughts, opinions and feelings of people. The representativeness of a sample is a major concern in using surveys. In this article, we consider meaning variation which is another potentially remarkable but less studied source of problems. We use Grounded Intersubjective Concept Analys...
In this paper, we study how to analyze and improve the quality of a large historical newspaper
collection. The National Library of Finland has digitized millions of newspaper pages. The quality of
the outcome of the OCR process is limited especially with regard to the oldest parts of the collection.
Approaches such as crowdsourcing has been used in...
Sentiment analysis has become a widely used approach to assess the emotional content of written documents such as customer feedback. In positive psychology research, the typical one-dimensional analysis framework has been extended to include five dimensions. This five-dimensional model, PERMA, enables a fine-grained analysis of written texts. We pr...
Wikipedia Animal Dataset is a dataset created during December 2010 and January 2011 with data retrieved from Wikipedia. It is available for research purposes.
Statistics
-----------
This dataset is made up by 498 unique URLs corresponding to articles about animals. For each animal the article was collected in English, Finnish and Spanish, fulfilli...
Emotional semantic image retrieval systems aim at incorporating the user’s affective states for responding adequately to the user’s interests. One challenge is to select features specific to image affect detection. Another challenge is to build effective learning models or classifiers to bridge the so-called “affective gap”. In this work, we study...
In this article, we consider how semantics of action verbs can be grounded on motion tracking data. We present the basic principles and requirements for grounding of verbs through case studies related to human movement. The data includes high-dimensional movement patterns and linguistic expressions that people have used to name these movements. We...
Statistical machine learning methods can provide help when developing preventative services and tools that support the empowerment of individuals. We explore how the self-organizing map could be utilized as a tool for analyzing, visualizing and browsing heterogeneous survey data on wellbeing that contains both quantitative (numeric) and qualitative...
An ideal verbally controlled virtual actor would allow the same interaction as instructing a real actor with a few words. Our goal is to create virtual actors that can be controlled with natural language instead of a predefined set of commands. In this paper, we present results related to a questionnaire where people described videos of human locom...
We present an approach for comparing human-made and automatically generated semantic representations with an assumption that neither of these has a primary status over the other. In the experimental part, we compare the results gained by using independent component analysis and the self-organizing map algorithm on word context analysis with a seman...
On the web, a huge variety of text collections contain knowledge in different expertise domains, such as technology or medicine. The texts are written for different uses and thus for people having different levels of expertise on the domain. Texts intended for professionals may not be understandable at all by a lay person, and texts for lay people...
Mobile proximity information provides a rich and detailed view into the social interactions of mobile phone users, allowing novel empirical studies of human behavior and context-aware applications. In this study, we apply a statistical anomaly detection method based on multivariate binomial mixture models to mobile proximity data from 106 users. Th...
In this article, we explore an application in an area of research called wellbeing informatics. More specifically, we consider how to build a system that could be used for searching stories that relate to the interest of the user (content relevance), and help the user in his or her developmental process by providing encouragement, useful experience...
We propose a probabilistic model class for the analysis of three-way count data, motivated by studying the subjectivity of lan-guage. Our models are applicable for instance to a data tensor of how many times each subject used each term in each context, thus revealing individual variation in natural language use. As our main goal is ex-ploratory ana...
It is generally accepted that there are cross-linguistic universal tendencies in the naming of colours. This is due in large part to the findings of Berlin and Kay. Recently, however, these universalist findings have been challenged, on both methodological and substantive grounds. Nisbett’s research on cultural cognition offers another interesting...
A substantial amount of subjectivity is involved in how people use language and conceptualize the world. Computational methods and formal representations of knowledge usually neglect this kind of individual variation. We have developed a novel method, Grounded Intersubjective Concept Analysis (GICA), for the analysis and visualization of individual...
Speech-to-speech machine translation is in some ways the peak of natural language pro- cessing, in that it deals directly with our original, oral mode of communication (as opposed to derived written language). As such, it presents challenges that are not to be taken lightly. Although existing technology covers each of the steps in the process, from...
We present a methodology for learning a taxonomy from a set of text documents that each describes one concept. The taxonomy is obtained by clustering the concept definition documents with a hierarchical approach to the Self-Organizing Map. In this study, we compare three different feature extraction approaches with varying degree of language indepe...
n this paper, we study fundamental properties of the Self-Organizing Map (SOM) and the Generative Topographic Mapping (GTM), ramifications of the initialization of the algorithms and properties of the algorithms in presence of missing data. We show that the commonly used principal component analysis (PCA) initialization of the GTM does not guarante...
In this article, we introduce the concept of pathways
of wellbeing and examine how such paths can be
discovered from large data sets using the
self-organizing map. Data sets used in the
illustrative experiments include measurements of
physical fitness and subjective assessments related
to diagnosing work stress.
In document clustering, semantically similar documents are grouped together. The dimensionality of document collections is often very large, thousands or tens of thousands of terms. Thus, it is common to reduce the original dimensionality before clustering for computational reasons. Cosine distance is widely seen as the best choice for measuring th...
In this work, we study people’s emotions evoked by viewing abstract art images based on traditional low-level image features within a binary classification framework. Abstract art is used here instead of artistic or photographic images because those contain contextual information
that influences the emotional assessment in a highly individual manne...
In this article, we present an analysis of the impact of nutrition and lifestyle on health at a global level. We have used
Self-organizing Maps (SOM) algorithm as the analysis technique. SOM enables us to visualize the relative position of each
country against a set of the variables related to nutrition, lifestyle and health. The positioning of the...
We present a selection of results produced in a project called Media Map. The project aims at developing an intuitive user
interface to a library information system containing data on projects and publications. The user interface is a two-dimensional
visual display created with the Self-Organizing Map algorithm. The map has been computed using the...
In this article, we introduce a method to make visible the differences among people regarding how they conceptualize the world. The Grounded Intersubjective Concept Analysis (GICA) method first employs a conceptual survey designed to elicit particular ways in which concepts are used among participants, aiming to exclude the level of opinions and va...
In this review and tutorial article, new developments towards extended use of information and communications technologies in science are discussed. The focus is in human and social sciences, specifically in linguistics and economics. Some challenging epistemological issues are handled in detail including the subjective and intersubjective nature of...
We study the combination of symbol frequence analysis and negative selection for anomaly detection of discrete sequences where
conventional negative selection algorithms are not practical due to data sparsity. Theoretical analysis on ergodic Markov
chains is used to outline the properties of the presented anomaly detection algorithm and to predict...
In this paper, we explore the possibility of applying a text mining method on a large qualitative source material concerning the history of information technology in one nation. This data was collected in the Swedish documentation project “From Computing Machines to IT.” We apply text mining on the interview transcripts of this Swedish documentatio...
This paper presents a methodology for learning taxonomic relations from a set of documents that each explain one of the concepts. Three different feature extraction approaches with varying degree of language independence are compared in this study. The first feature extraction scheme is a language-independent approach based on statistical keyphrase...
In this paper, we consider how to represent world knowledge using the self-organizing map (SOM), how to use a simple recurrent
network (SRN) to device sentence comprehension, and how to use the SOM output space to represent situations and facilitate
grounded logical reasoning.
In this article, we study the scale-dependent dimensionality properties and overall structure of text data with a method that
measures correlation dimension in different scales. As experimental results, we present the analysis of text data sets with
the Reuters and Europarl corpora, which are also compared to artificially generated point sets. A co...
In this article, we use the model adjectives using a vector space model. We further employ three different dimension reduction
methods, the Principal Component Analysis (PCA), the Self-Organizing Map (SOM), and the Neighbor Retrieval Visualizer (NeRV)
in the projection and visualization task, using antonym test for evaluation. The results show tha...
Our aim is to find syntactic and semantic relationships and roles of words based on the analysis of corpora. We study three methods for analyzing words in contexts as potential methods for solving this task. The methods are latent semantic anal-ysis, self-organizing map and independent component analysis. Latent semantic analysis is a simple method...
In this paper, we propose tensor based Maximum Margin Criterion algorithm (TMMC) for supervised dimensionality reduction. In TMMC, an image object is encoded as an nth-order tensor, and its 2-D representation is directly treated as matrix. Meanwhile, ...
The article provides an introduction to and a demonstration of the self-organizing map (SOM) method for organizational researchers interested in the use of qualitative data. The SOM is a versatile quantitative method very commonly used across many disciplines to analyze large data sets. The outcome of the SOM analysis is a map in which entities are...
The self-organizing map (SOM) is related to the classical vector quantization (VQ). Like in the VQ, the SOM represents a distribution
of input data vectors using a finite set of models. In both methods, the quantization error (QE) of an input vector can be
expressed, e.g., as the Euclidean norm of the difference of the input vector and the best-mat...
In this paper, we discuss problems related to the basic Semantic Web methodologies that are based on predicate logic and related
formalisms. We discuss complementary and alternative approaches. In particular, we suggest how the Self-Organizing Map can
be a basis for making the Semantic Web more semantic.
The complex phenomena of political science are typically studied using qualitative approach, potentially supported by hypothesis- driven statistical analysis of some numerical data. In this article, we present a complementary method based on data mining and specifically on the use of the self-organizing map. The idea in data mining is to explore th...
In this article, we consider contemporary theories of concepts, and Bayesian and self-organizing models of concept formation. After introducing the differ-ent models, we present our own experiment. It utilizes a multi-agent simulation framework, in which the emergence of a common vocabulary can be studied. In the experiment, we use jointly the self...
In time series prediction, one does often not know the properties of the underlying system generating the time series. For example, is it a closed system that is generating the time series or are there any external factors influencing the system? As a result of this, you often do not know beforehand whether a time series is stationary or nonstation...
The purpose of the present article is to examine the implications of the pragmatic web for the research and development of educational technology. It is argued that, beyond knowledge acquisition and social participation, technology-mediated learning environments based on a semantic and pragmatic web have the potential for facilitating creation and...
Finding ways in which communities of experts can benefit from each other is a question shared by the machine learning community and social sciences alike. Considerable research in machine learning methods has shown that communities of experts can provide consistently better classifications and decisions than single experts in various tasks and doma...
Latent semantic analysis (LSA) can be used to create an implicit semantic vectorial rep-resentation for words. Independent compo-nent analysis (ICA) can be derived as an extension to LSA that rotates the latent se-mantic space so that it becomes explicit, that is, the features correspond more with those resulting from human cognitive activ-ity. Thi...
This paper presents a method for creating interlingual word-to-word or phrase-to-phrase mappings between any two languages
using the self-organizing map algorithm. The method can be used as a component in a statistical machine translation system.
The conceptual space created by the self-organizing map serves as a kind of interlingual representation...
We propose a theoretical framework for modeling communication between agents that have different conceptual models of their current context. We describe how the emergence of subjective models of the world can be simulated and what the role of language and communication in that process is. We consider, in particular, the role of unsupervised learnin...
We propose a method for inferring semantic information from textual data in content-based multimedia retrieval. Training examples of images and videos belonging to a specific semantic class are associated with their low-level visual and aural descriptors augmented with textual features such as frequencies of significant words. A fuzzy mapping of a...
Biological systems have been an inspiration in the development of prototype-based clustering and vector quantization algorithms.
The two dominant paradigms in biologically motivated clustering schemes are neural networks and, more recently, biological
immune systems. These two biological paradigms are discussed regarding their benefits and shortcom...
In this article, we are studying the differences between the European Union languages using statistical and unsupervised methods. The analysis is conducted in the different levels of language: the lexical, morphological and syntactic. Our premise is that the difficulty of the translation could be perceived as differences or similarities in differen...
We present Likey, a language-independent keyphrase extraction method based on sta- tistical analysis and the use of a reference corpus. Likey has a very light-weight pre- processing phase and no parameters to be tuned. Thus, it is not restricted to any sin- gle language or language family. We test Likey having exactly the same configura- tion with...
In this article we approach neural networks as computational templates that travel across various sciences. Traditionally, it has been thought that models are primarily models of some target systems: they are assumed to represent partially or completely their target systems. We argue, instead, that many computational models cannot easily be conceiv...
Serious efforts to develop computerized systems for natural language understanding and machine translation have taken place for more than half a century. Some successful systems that translate texts in limited domains such as weather forecasts have been implemented. However, the more general the domain or complex the style of the text the more diff...
We present a probabilistic approach for detecting and analyzing changes in natural language motivated by biological immune systems. Contrary to traditional methods based on message-digest algorithms and line-by-line comparisons of two files, the proposed algorithm employs an implicit negative representation of text segments in the form of detector...
We show that independent component analysis (ICA) can be used to find distributed representations for words that can be further processed by thresholding to produce sparse representations. The applicability of the thresholded ICA representation is compared to singular value decomposition (SVD) in a multiple choice vocabulary task with three data se...
We present the results of an analysis of a text corpus of 129,000 abstracts of NSF-sponsored basic research projects between years 1990 and 2003. The methods used in the analysis include term extraction based on a reference corpus and an entropy measure, and the Self-Organizing Map algorithm for the formation of a term map and a document map. Metho...
A symbol as such is disassociated from the world. In addition, as a discrete entity a symbol does not mirror all the details
of the portion of the world that it is meant to refer to. Humans establish the association between the symbols and the referenced
domain – the words and the world – through a long learning process in a community. This paper s...
We propose a method of content-based multimedia retrieval of objects with visual, aural and textual properties. In our method, train- ing examples of objects belonging to a specific semantic class are associ- ated with their low-level visual descriptors (such as MPEG-7) and textual features such as frequencies of significant keywords. A fuzzy mappi...
An art installation was on display in the Centre Pompidou National Museum of Modern Art in Paris, where visitors could contribute
with their own personal objects, adding keyword descriptions and quantified semantic features such as age or hardness. The data was projected in real-time onto a Self-Organizing Map (SOM) which was shown in the gallery....
A vital mechanism of high‐level natural cognitive systems is the anticipatory capability of making decisions based on predicted events in the future. While in some cases the performance of computational cognitive systems can be improved by modeling anticipatory behavior, it has been shown that for many cognitive tasks anticipation is mandatory. In...
Quality of Internet health information is essential because it has the potential to benefit or harm a large number of people and it is therefore essential to provide consumers with some tools to aid them in assessing the nature of the information they are accessing and how they should use it without jeopardizing their relationship with their doctor...
In this article, we study the emergence of associations between words and concepts using the self-organizing map. In particular, we explore the meaning negotiations among communicating agents. The self-organizing map is used as a model of an agent's conceptual memory. The concepts are not explicitly given but they are learned by the agent in an uns...
In this article, we are studying the differences between the European languages using statistical and unsupervised methods. The analysis is conducted in different levels of language, lexical, morphological and syntactic. Our prem- ise is that the difficulty of the translation could be perceived as differences or similarities in different levels of...
In this position paper, we discuss some problems related to those semantic web methodologies that are straightforwardly based on predicate logic and related for malisms. We also discuss complementary and alternative approaches and provide some examples of such.