Theodoros Giannakopoulos

Theodoros Giannakopoulos
National Center for Scientific Research Demokritos | ncsr · Insititute of Informatics and Telecommunications

PhD

About

112
Publications
51,730
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,913
Citations
Citations since 2016
46 Research Items
1518 Citations
2016201720182019202020212022050100150200250
2016201720182019202020212022050100150200250
2016201720182019202020212022050100150200250
2016201720182019202020212022050100150200250
Introduction
I am a Principal Researcher at the Computational Intelligence Lab at NCSR Demokritos, Greece, working in the field of multimodal machine learning. I particularly focus on speech analytics, music information retrieval, and multimodal video characterization
Additional affiliations
June 2012 - present
National Center for Scientific Research Demokritos
Position
  • PostDoc Position
March 2012 - November 2015
Athena-Research and Innovation Center in Information, Communication and Knowledge Technologies
Position
  • PostDoc Position
Description
  • Post-doc research member
July 2009 - March 2011
National Center for Scientific Research Demokritos
Position
  • PostDoc Position
Description
  • Post-doc research member in the EU funded research project Computer-Aided Semantic Annotation of Multimedia - CASAM.
Education
December 2004 - July 2009
National and Kapodistrian University of Athens
Field of study
  • Audio analysis, pattern recognition, multimedia analysis
October 2002 - September 2004
University of Patras
Field of study
  • Signal and Image Processing
September 1998 - September 2002

Publications

Publications (112)
Preprint
Full-text available
Machine learning methodologies can be adopted in cultural applications and propose new ways to distribute or even present the cultural content to the public. For instance, speech analytics can be adopted to automatically generate subtitles in theatrical plays, in order to (among other purposes) help people with hearing loss. Apart from a typical sp...
Chapter
Visual information contains the most important characteri-stics of a movie regarding the related content and filming techniques. Especially the way the camera moves to capture the scene is vital to define the director’s aesthetics. However, most of the machine learning tasks existing in the literature treat the movie as shallow content, rather than...
Preprint
Multimodal Language Analysis is a demanding area of research, since it is associated with two requirements: combining different modalities and capturing temporal information. During the last years, several works have been proposed in the area, mostly centered around supervised learning in downstream tasks. In this paper we propose extracting unsupe...
Article
Full-text available
This work reviews the state of the art in multimodal speech emotion recognition methodologies, focusing on audio, text and visual information. We provide a new, descriptive categorization of methods, based on the way they handle the inter-modality and intra-modality dynamics in the temporal dimension: (i) non-temporal architectures (NTA), which do...
Article
Full-text available
The exponential growth of user-generated content has increased the need for efficient video summarization schemes. However, most approaches underestimate the power of aural features, while they are designed to work mainly on commercial/professional videos. In this work, we present an approach that uses both aural and visual features in order to cre...
Preprint
Full-text available
We examine the use of linear and non-linear dimensionality reduction algorithms for extracting low-rank feature representations for speech emotion recognition. Two feature sets are used, one based on low-level descriptors and their aggregations (IS10) and one modeling recurrence dynamics of speech (RQA), as well as their fusion. We report speech em...
Chapter
Full-text available
The need for low-cost health monitoring is increasing with the continuous increase of the elderly population. In this context, unobtrusive audiovisual monitoring methods can be of great importance. More particularly, the diameter of the pupil is a valuable source of information, since, apart from pathological cases, it can reveal the emotional stat...
Conference Paper
Full-text available
In this paper we present an approach for the recognition of human activity that combines handcrafted features from 3D skeletal data and contextual features learnt by a trained deep Convolutional Neural Network (CNN). Our approach is based on the idea that contextual features, i.e., features learnt in a similar problem are able to provide a diverse...
Conference Paper
Soundscape can be regarded as the auditory landscape, conceived individually or at collaborative level. This paper presents a method for automatic recognition of the soundscape quality of urban recordings. Towards this end, the ATHens Urban Soundscape has been used, which is a dataset of audio recordings of ambient urban sounds, annotated in terms...
Article
Full-text available
In this paper we present an approach towards real-time hand gesture recognition using the Kinect sensor, investigating several machine learning techniques. We propose a novel approach for feature extraction, using measurements on joints of the extracted skeletons. The proposed features extract angles and displacements of skeleton joints, as the lat...
Chapter
Full-text available
This paper proposes a method for recognizing audio events in urban environments that combines handcrafted audio features with a deep learning architectural scheme (Convolutional Neural Networks, CNNs), which has been trained to distinguish between different audio context classes. The core idea is to use the CNNs as a method to extract context-aware...
Chapter
As smart interconnected sensing devices are becoming increasingly ubiquitous, more applications are becoming possible by re-arranging and re-connecting sensing and sensor signal analysis in different pipelines. Naturally, this is best facilitated by extremely thin services that expose minimal functionality and are extremely flexible regarding the w...
Preprint
Full-text available
This paper proposes a method for recognizing audio events in urban environments that combines handcrafted audio features with a deep learning architectural scheme (Convolutional Neural Networks, CNNs), which has been trained to distinguish between different audio context classes. The core idea is to use the CNNs as a method to extract context-aware...
Conference Paper
A novel approach for feature learning using deep learning is presented. More specifically, a Convolutional Neural Network that is trained using feature correspondences learns to map a given image patch to a descriptor. Therefore, descriptors are directly learned from examples instead of being hand-crafted. The proposed approach is evaluated in a ch...
Conference Paper
In this paper we present an approach for speaker verification, based on the the extraction of deep features. More specifically, we propose a scheme that is based on a convolutional neural network. For audio representation we opt for spectrograms, i.e., images that result from the spectral content of voices. Our network is trained to extract visual...
Article
Full-text available
Speech music discrimination is a traditional task in audio analytics, useful for a wide range of applications , such as automatic speech recognition and radio broadcast monitoring, that focuses on segmenting audio streams and classifying each segment as either speech or music. In this paper we investigate the capabilities of Convolutional Neural Ne...
Chapter
Due to the significant increase of tanker traffic from and to the Black Sea that pass through narrow straits formed by the 1600 Greek islands, the Aegean Sea is characterized by an extremely high marine environmental risk. Therefore it is vital to all socio-economic and environmental sectors to reduce the risk of a ship accident in that area. In th...
Conference Paper
Full-text available
Smart wearable devices have lead to an increased need for processing and sharing large streams of physiological data in real-time. Modern Human-Machine Interaction (HMI) systems, especially applications designed for user training and assessment (e.g., educational or smart-rehabilitation systems), should be able to track and monitor those signals an...
Preprint
Full-text available
Speech music discrimination is a traditional task in audio analytics, useful for a wide range of applications (automatic speech recognition, radio broadcast monitoring, etc), that focuses on segmenting audio streams and classifying each segment as either speech or music. In this paper, we exploit the capabilities of convolu-tional neural networks (...
Article
Full-text available
In this paper we examine the ability of low-level multimodal features to extract movie similarity, in the context of a content-based movie recommendation approach. In particular, we demonstrate the extraction of multimodal representation models of movies, based on textual information from subtitles, as well as cues from the audio and visual channel...
Code
A novel approach for Speech-Music Discrimination using CNNs and raw spectrogram representations
Chapter
Full-text available
Emotion recognition plays an important role in several applications, such as human computer interaction and understanding affective state of users in certain tasks, e.g., within a learning process, monitoring of elderly, interactive entertainment etc. It may be based upon several modalities, e.g., by analyzing facial expressions and/or speech, usin...
Article
Visual attributes, from simple objects (e.g., backpacks, hats) to soft-biometrics (e.g., gender, height, clothing) have proven to be a powerful representational approach for many applications such as image description and human identification. In this paper, we introduce a novel method to combine the advantages of both multi-task and curriculum lea...
Article
Visual attributes, from simple objects (e.g., backpacks, hats) to soft-biometrics (e.g., gender, height, clothing) have proven to be a powerful representational approach for many applications such as image description and human identification. In this paper, we introduce a novel method to combine the advantages of both multi-task and curriculum lea...
Conference Paper
A framework that utilizes audio information for recognition of activities of daily living (ADLs) in the context of a health monitoring environment is presented in this chapter. We propose integrating a Raspberry PI single-board PC that is used both as an audio acquisition and analysis unit. So Raspberry PI captures audio samples from the attached m...
Conference Paper
We present a method that recognizes exercising activities performed by a single human in the context of a real home environment. Towards this end, we combine sensorial information stemming from a smartphone accelerometer, with visual information from a simple web camera. Low-level features inspired from the audio analysis domain are used to represe...
Article
Full-text available
Emotion recognition from speech may play a crucial role in many applications related to human–computer interaction or understanding the affective state of users in certain tasks, where other modalities such as video or physiological parameters are unavailable. In general, a human’s emotions may be recognized using several modalities such as analyzi...
Conference Paper
Full-text available
This paper presents a method for recognizing activities taking place in a home environment. Audio is recorded and analysed realtime, with all computation taking place on a low-cost Raspberry PI. In this way, data acquisition, low-level signal feature calculation, and low-level event extraction is performed without transferring any raw data out of t...
Article
Full-text available
In this paper we examine the existence of correlation between movie similarity and low level features from respective movie content. In particular, we demonstrate the extraction of multi-modal representation models of movies based on subtitles, audio and metadata mining. We emphasize our research in topic modeling of movies based on their subtitles...
Conference Paper
This paper proposes a deep learning classification method for frame-wise recognition of human activities , using raw color (RGB) information. In particular, we present a Convolutional Neural Network (CNN) classification approach for recognising three basic motion activity classes, that cover the vast majority of human activities in the context of a...
Chapter
As smart interconnected sensing devices are becoming increasingly ubiquitous, more applications are becoming possible by re-arranging and re-connecting sensing and sensor signal analysis in different pipelines. Naturally, this is best facilitated by extremely thin services that expose minimal functionality and are extremely flexible regarding the w...
Chapter
The purpose of this paper is to present a stochastic model able to predict the probability of ship collisions and groundings taking place in the Aegean Sea. The present work has been motivated by the significant rise in the traffic density of potentially hazardous vessels through the Aegean Sea waters that, in the case of an accident, may have high...
Conference Paper
Full-text available
In this paper, we predict a human's depression level in the BDI-II scale, using facial and voice features. Active orientation models (AOM) and several voice features were extracted from the video and audio modalities. Long-term and mid-term features were computed and a fusion is performed in the feature space. Videos from the Depression Recognition...
Conference Paper
Research on robot perception mostly focuses on visual information analytics. Audio-based perception is mostly based on speech-related information. However, non-verbal information of the audio channel can be equally important in the perception procedure, or at least play a complementary role. This paper presents a framework for audio signal analysis...
Conference Paper
Full-text available
In this paper we examine the existence of correlation between movie content similarity and low level textual features from respective subtitles. In addition, we demonstrate the extraction of topical representation of movies based on subtitles mining. Using natural language processing and a topic modeling algorithm, namely Latent Dirichlet Allocatio...
Conference Paper
Full-text available
In this paper, we present an architecture for recognizing events related to activities of daily living in the context of a health monitoring environment. The proposed approach explores the integration of a Raspberry PI singleboard PC both as an audio acquisition and analysis unit. A set of real-time feature extraction and classification procedures...
Conference Paper
Demographic and epidemiologic transitions have brought forward a new health care paradigm with the presence of both growing elderly population and chronic diseases. Recent technological advances can support elderly people in their domestic environment assuming that several ethical and clinical requirements can be met. This paper presents an archite...
Article
This paper presents a probabilistic model predicting the risk of a possible ship accident occurrence in the Aegean Sea. Two types of accident scenarios (collision and grounding) have been studied using the Bayesian networks methodology. The model takes into account the static information of the vessel, namely the vessel type, size, age and flag. Th...
Conference Paper
Changing clothes is a basic activity of daily living (ADL) which may be used as a measurement of the functional status of e.g. an elderly person, or a person with certain disabilities. In this paper we propose a methodology for the detection of when a human has changed clothes. Our non-contact unobtrusive monitoring system is built upon the Microso...
Article
Full-text available
Audio information plays a rather important role in the increasing digital content that is available today, resulting in a need for methodologies that automatically analyze such content: audio event recognition for home automations and surveillance systems, speech recognition, music information retrieval, multimodal analysis (e.g. audio-visual analy...
Conference Paper
Full-text available
In this paper we present an approach for counting and tracking people participating in a meeting that takes place in a smart room. The sensing and processing modules are incorporated within the context of an IoT framework, that follows a message oriented architecture. The proposed algorithm consists of a motion detection module, a background subtra...
Article
Full-text available
The need for low-cost health monitoring is increasing with the continuous increase of the elderly population. In this context, unobtrusive audiovisual monitoring methods can be of great importance. More particularly, the diameter of the pupil is a valuable source of information, since, apart from pathological cases, it can reveal the emotional stat...
Conference Paper
Integrating robotic platforms in smart home environments can improve the monitoring quality of daily activities. In this study, we explore a scenario where a robot provides a service to the users, which in our case is delivering a cup of coffee. The users place their order via an application, which at the same time captures a short video from their...
Conference Paper
Full-text available
In this paper, we present an Adaptive Multimodal Dialogue System for Depressive and Anxiety Disorders Screening (DADS). The system interacts with the user through verbal and non-verbal communication to elicit the information needed to make referrals and recommendations for depressive and anxiety disorders while encouraging the user and keeping them...
Conference Paper
Full-text available
The Aegean Sea is characterized by an extremely high marine safety risk, mainly due to the significant increase of the traffic of tankers from and to the Black Sea that pass through narrow straits formed by the 1600 Greek islands. Reducing the risk of a ship accident is therefore vital to all socio-economic and environmental sectors. This paper pre...
Conference Paper
Aegean Sea is an extremely sensitive marine area anticipating a catastrophic event to occur any time now, owing both to hazardous vessel crossing its waters and the significant rise of the intensive traffic. This paper aims to present a probabilistic Bayesian model predicting the probability of a collision, contact or grounding occurrence in the Ae...
Conference Paper
The huge growth of population size along with all the accompanying impacts, like traffic flow, commercial and industrial activities have led to a respective increase of noise pollution in the urban environments. In most cases, noise pollution in big cities is characterized by low-frequency and continuous background sounds. This ever-growing environ...
Conference Paper
Full-text available
In this paper we present SYNAISTHISI, i.e., a cloud-based platform, that provides the necessary infrastructure in order to interconnect heterogeneous devices and services over heterogeneous networks. SYNAISTHISI facilitates the orchestration of a collective functionality allowing several services to be managed through agents that dynamically alloca...
Conference Paper
Authors of scientific publications and books use images to present a wide spectrum of information. Despite the richness of the visual content of scientific publications the figures are usually not taken into consideration in the context of text mining methodologies towards the automatic indexing and retrieval of scientific corpora. In this work, we...
Conference Paper
Full-text available
Demographic and epidemiologic transitions in Europe have brought a new health care paradigm where life expectancy is increasing as well as the need for long-term care. To meet the resulting challenge, European healthcare systems need to take full advantage of new opportunities offered by technical advancements in ICT. The RADIO project explores a n...
Article
Full-text available
The Aegean Sea is characterized by an extremely high marine safety risk, mainly due to the significant increase of the traffic of tankers from and to the Black Sea that pass through narrow straits formed by the 1600 Greek islands. Reducing the risk of a ship accident is therefore vital to all socio-economic and environmental sectors. This paper pre...
Conference Paper
Full-text available
The purpose of the research described in this paper is to examine the existence of correlation between low level audio, visual and textual features and movie content similarity. In order to focus on a well defined and controlled case, we have built a small dataset of movie scenes from three sequel movies. In addition, manual annotations have led to...
Chapter
This chapter describes efficient visual analysis methods towards the unobtrusive monitoring of elderly people using common low-cost hardware. In particular, two real-time approaches are demonstrated for (a) Pupil Size Estimation and (b) Pulse Rate Estimation. In both cases, a main concern in the development has been to keep low the overall complexi...
Chapter
This appendix presents several 3d-party audio analysis libraries and methodologies, covering various programming languages (including MATLAB-based software). Furthermore, non-audio libraries and packages from the fields of pattern recognition, signal processing, etc, are presented. Although our primary focus is on MATLAB-based code, we also provide...
Chapter
This chapter focuses on a vital stage of audio analysis, the audio segmentation stage, which focuses on splitting an uninterrupted audio signal into segments of homogeneous content. The chapter describes two general categories of audio segmentation: those that employ supervised knowledge and those that are unsupervised or semi-supervised. In this p...
Chapter
This chapter presents methods towards the representation of audio signals in the frequency domain. Special emphasis has been placed on the description of the Discrete Fourier Transform because a lot of material in later chapters of this book assumes that the reader is familiar with this particular transform. Furthermore, the chapter aims at present...
Chapter
This chapter has an introductory purpose. A chapter outline is provided, along with general notes on the book’s exercises and the companion software. Before we proceed, it is important to note that, although in this book the term audio does not exclude the speech signal, we are not focusing on traditional speech-related problems that have been stud...
Chapter
The purpose of this chapter is to provide basic knowledge and techniques related to the creation, representation, playback, recording and storing of audio signals using MATLAB. In addition, short-term audio analysis is introduced here.
Chapter
This chapter focuses on presenting a wide range of audio features. Apart from the theoretical background of these features and respective MATLAB code, their discrimination ability is also demonstrated for particular audio types.
Chapter
This chapter gives a short description of the most important MATLAB functions implemented in the Audio Analysis Library which serves as a companion of this book.
Chapter
This chapter focuses on audio analysis methods that take into account the temporal evolution of the audio phenomena. This is done by preserving the short-term nature of the feature sequences, in order to either create methods that align two feature sequences or build temporal audio representations using Hidden Markov Models.
Chapter
This chapter describes the task of classifying unknown audio segments of “homogeneous content” to a set of predefined audio classes. In particular, theoretical background is provided regarding popular classification methods, including Support Vector Machines, Decision Trees and the -Nearest-Neighbor method. The reader is also introduced to generic...
Chapter
This chapter provides descriptions and implementations of some basic Music Information Retrieval tasks, so that the reader can gain a deeper understanding of the field. In particular, we focus on the tasks of music thumbnailing, meter/tempo induction and music content visualization.
Chapter
This appendix provides a list of datasets which are available on the Web, that can be used as training and evaluation data for several audio analysis tasks.
Article
Text visualization is a rather important task related to scientific corpora, since it provides a way of representing these corpora in terms of content, leading to reinforcement of human cognition compared to abstract and unstructured text. In this paper, we focus on visualizing funding-specific scientific corpora in a supervised context and discove...
Article
Speaker diarization aims to automatically answer the question “who spoke when” given a speech signal. In this work, we have focused on applying the FLsD approach, a semi-supervised version of Fisher Linear Discriminant analysis, both in the audio and the video signals to form a complete multimodal speaker diarization system. Extensive experiments h...