• Home
  • Taras Kucherenko
Taras Kucherenko

Taras Kucherenko
Electronic Arts SEED · SEED

Doctor of Philosophy
Research Scientist at EA

About

40
Publications
7,729
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
462
Citations
Citations since 2016
40 Research Items
462 Citations
2016201720182019202020212022050100150200
2016201720182019202020212022050100150200
2016201720182019202020212022050100150200
2016201720182019202020212022050100150200
Introduction
Taras Kucherenko currently works as a Research Scientist at Electronic Arts (EA). Recently he finished his PhD at KTH Royal Institute of Technology in Stockholm. His research is on machine learning models for non-verbal behavior generation, such as hand gestures and facial expressions.
Additional affiliations
January 2017 - December 2021
KTH Royal Institute of Technology
Position
  • PhD Student
Description
  • My main research interest is in making robots behave in a human-like way, including non-verbal behavior. I want to enable humanoid robots to have body language in a data-driven way.
Education
October 2014 - September 2016
RWTH Aachen University
Field of study
  • Computer Science

Publications

Publications (40)
Conference Paper
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, cro...
Preprint
Full-text available
Embodied Conversational Agents that make use of co-speech gestures can enhance human-machine interactions in many ways. In recent years, data-driven gesture generation approaches for ECAs have attracted considerable research attention, and related methods have continuously improved. Real-time interaction is typically used when researchers evaluate...
Conference Paper
Embodied Conversational Agents (ECAs) that make use of co-speech gestures can enhance human-machine interactions in many ways. In recent years, data-driven gesture generation approaches for ECAs have attracted considerable research attention, and related methods have continuously improved. Real-time interaction is typically used when researchers ev...
Preprint
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, cro...
Thesis
Full-text available
A large part of our communication is non-verbal: humans use non-verbal behaviors to express various aspects of our state or intent. Embodied artificial agents, such as virtual avatars or robots, should also use non-verbal behavior for efficient and pleasant interaction. A core part of non-verbal communication is gesticulation: gestures communicate...
Conference Paper
Embodied agents benefit from using non-verbal behavior when communicating with humans. Despite several decades of non-verbal behavior-generation research, there is currently no well-developed benchmarking culture in the field. For example, most researchers do not compare their outcomes with previous work, and if they do, they often do so in their o...
Conference Paper
In many research areas, for example motion and gesture generation, objective measures alone do not provide an accurate impression of key stimulus traits such as perceived quality or appropriateness. The gold standard is instead to evaluate these aspects through user studies, especially subjective evaluations of video stimuli. Common evaluation para...
Conference Paper
While automatic performance metrics are crucial for machine learning of artificial human-like behaviour, the gold standard for evaluation remains human judgement. The subjective evaluation of artificial human-like behaviour in embodied conversational agents is however expensive and little is known about the quality of the data it returns. Two appro...
Conference Paper
We propose a new framework for gesture generation, aiming to allow data-driven approaches to produce more semantically rich gestures. Our approach first predicts whether to gesture, followed by a prediction of the gesture properties. Those properties are then used as conditioning for a modern probabilistic gesture-generation model capable of high-q...
Preprint
While automatic performance metrics are crucial for machine learn-ing of artificial human-like behaviour, the gold standard for eval-uation remains human judgement. The subjective evaluation ofartificial human-like behaviour in embodied conversational agentsis however expensive and little is known about the quality of thedata it returns. Two approa...
Preprint
Full-text available
Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and...
Preprint
Full-text available
We propose a new framework for gesture generation, aiming to allow data-driven approaches to produce more semantically rich gestures. Our approach first predicts whether to gesture, followed by a prediction of the gesture properties. Those properties are then used as conditioning for a modern probabilistic gesture-generation model capable of high-q...
Conference Paper
Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Gesticulation – hand and arm movements accompanying speech – is an essential part of non-verbal behavior. Gesture generation models have been developed for several decades: starting with rule-based and ending with mainly data-dri...
Conference Paper
Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards da...
Article
Full-text available
Non-invasive automatic screening for Alzheimer’s disease has the potential to improve diagnostic accuracy while lowering healthcare costs. Previous research has shown that patterns in speech, language, gaze, and drawing can help detect early signs of cognitive decline. In this paper, we describe a highly multimodal system for unobtrusively capturin...
Preprint
Full-text available
Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Gesticulation - hand and arm movements accompanying speech - is an essential part of non-verbal behavior. Gesture generation models have been developed for several decades: starting with rule-based and ending with mainly data-dri...
Preprint
Full-text available
Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards da...
Article
Full-text available
This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures a...
Preprint
In many research areas, for example motion and gesture generation, objective measures alone do not provide an accurate impression of key stimulus traits such as perceived quality or appropriateness. The gold standard is instead to evaluate these aspects through user studies, especially subjective evaluations of video stimuli. Common evaluation para...
Preprint
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly con...
Conference Paper
During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems...
Conference Paper
To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. One key aspect of this is generation of appropriate non-verbal behavior for the agent, for example facial gestures, here defined as facial expressions and head movements. Most existing gesture-generating systems do not utiliz...
Conference Paper
Conducting user studies is a crucial component in many scientific fields. While some studies require participants to be physically present, other studies can be conducted both physically (e.g. in-lab) and online (e.g. via crowdsourcing). Inviting participants to the lab can be a time-consuming and logistically difficult endeavor, not to mention tha...
Chapter
Automatic gesture generation is a field of growing interest, and a key technology for enabling embodied conversational agents. Re- search into gesture generation is rapidly gravitating towards data- driven methods. Unfortunately, individual research efforts in the field are difficult to compare: there are no established benchmarks, and each study t...
Conference Paper
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly con...
Preprint
Conducting user studies is a crucial component in many scientific fields. While some studies require participants to be physically present, other studies can be conducted both physically (e.g. in-lab) and online (e.g. via crowdsourcing). Inviting participants to the lab can be a time-consuming and logistically difficult endeavor, not to mention tha...
Preprint
Full-text available
This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures a...
Preprint
To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. One key aspect of this is generation of appropriate non-verbal behavior for the agent, for example facial gestures, here defined as facial expressions and head movements. Most existing gesture-generating systems do not utiliz...
Article
Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off‐line applications, novel tools can alter the role of an animator to that of a director, who provides only high‐level input for the desired animation; a learned network then translates these instructions into an appropria...
Preprint
During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current data-driven co-speech gesture generation systems use a single modality for representing speech: either audio or text. These system...
Conference Paper
Full-text available
In this paper, we present a user study on generated beat gestures for humanoid agents. It has been shown that Human-Robot Interaction can be improved by including communicative non-verbal behavior, such as arm gestures. Beat gestures are one of the four types of arm gestures, and are known to be used for emphasizing parts of speech. In our user stu...
Conference Paper
Full-text available
Non-verbal behavior is crucial for positive perception of humanoid robots. If modeled well it can improve the interaction and leave the user with a positive experience, on the other hand, if it is modelled poorly it may impede the interaction and become a source of distraction. Most of the existing work on modeling non-verbal behavior show limited...
Conference Paper
This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input a...
Poster
Full-text available
This paper presents a novel framework for automatic speech-driven gesture generation applicable to human-agent interaction, including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech features a...
Conference Paper
This paper presents a novel framework for automatic speech-driven gesture generation applicable to human-agent interaction, including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech features a...
Preprint
Full-text available
This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input a...
Conference Paper
Social robots need non-verbal behavior to make an interaction pleasant and efficient. Most of the models for generating non-verbal behavior are rule-based and hence can produce a limited set of motions and are tuned to a particular scenario. In contrast, data-driven systems are flexible and easily adjustable. Hence we aim to learn a data-driven mod...
Preprint
Full-text available
Optical motion capture systems have become a widely used technology in various fields, such as augmented reality, robotics, movie production, etc. Such systems use a large number of cameras to triangulate the position of optical markers. The marker positions are estimated with high accuracy. However, especially when tracking articulated bodies, a f...
Article
Full-text available
This paper presents the EACare project, an ambitious multi-disciplinary collaboration with the aim to develop an embodied system, capable of carrying out neuropsychological tests to detect early signs of dementia, e.g., due to Alzheimer's disease. The system will use methods from Machine Learning and Social Robotics, and be trained with examples of...

Questions

Questions (4)
Question
We know that body language, in general, and gesticulation, in particular, is culturaly dependent. It is especially clear for different languages since co-speech gestures are a part of the language.
But do we know how much gesticulation differs between different cultures, which are using the same language, such as the USA and UK?
Question
I am looking for a good dataset containing a lot of representational gestures with limited domain (in terms of text), which was recorded in an interaction. Just having videos would be enough.
Ideally, the dataset would be in English. But German or Spanish would be fine as well.
Here are some examples of such datasets:
1. The Bielefeld speech and gesture alignment corpus (SaGA). 2010
2. "Verbal or visual: How information is distributed across speech and gesture in spatial dialog." 2006
3. Natural Media Motion-Capture Corpus(NM-MoCap-Corpus) 2014
Are there some more datasets I am not aware of yet?
Question
I have a neural network (NN) implemented in Keras and I would like to make it more complex. Things I want to add go beyond what Keras interface offers. (For example, I want to have an auto-regressive connection).
I wonder: can I make my NN more complex without re-implementing it in TensorFlow? Is there a convenient way to do it? It would be something like creating your custom layer and custom operation.
Thank you for your time!
Question
I am going to build a system which will produce non-verbal behavior in a data-driven way, based on the speech signal and its transcription.
I need to decide which speech feature to use, so that I can train it on the human recordings and then use it on the humanoid robot NAO.
Main problem is that the robot's speech will not have the variability of the natural speech,as it is produced by the text-to-speech system. So I need to be carefull in not learning smth, that can work only for humans and will not work on my robot.

Network

Cited By

Projects

Projects (2)
Project
In this project, we want to enable humanoid robots, such as NAO, to accompany their speech by the upper-body gestures in a natural way. While most of the systems in HRI are rule-based, we are going to use a data-driven approach. Our system will be trained on the human examples and will not require expert knowledge on how gestures should be generated.
Project
To develop a system (based on deep learning) for reconstructing missing data: it can be either randomly missing markers, missing a specific limb or noise data from the sensors.