Konstantin Markov

Konstantin Markov
The University of Aizu · Department of Computer and Information Systems

PhD

About

84
Publications
24,480
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
978
Citations
Citations since 2017
9 Research Items
410 Citations
2017201820192020202120222023020406080
2017201820192020202120222023020406080
2017201820192020202120222023020406080
2017201820192020202120222023020406080
Additional affiliations
April 2007 - March 2009
National Institute of Information and Communications Technology
Position
  • Senior Researcher
April 2000 - March 2006
Advanced Telecommunications Research Institute
Position
  • Senior Researcher
Education
April 1996 - March 1999
Toyohashi University of Technology
Field of study
  • Information Technology

Publications

Publications (84)
Article
Full-text available
Despite the progress of deep neural networks over the last decade, the state-of-the-art speech recognizers in noisy environment conditions are still far from reaching satisfactory performance. Methods to improve noise robustness usually include adding components to the recognition system that often need optimization. For this reason, data augmentat...
Chapter
Full-text available
Deep Learning image processing methods are gradually gaining popularity in a number of areas including medical imaging. Classification, segmentation, and denoising of images are some of the most demanded tasks. In this study, we aim at enhancing optic nerve head images obtained by Optical Coherence Tomography (OCT). However, instead of directly app...
Article
The use of a psychoacoustic roughness model as a predictor of creaky voice is reported. We found that the roughness temporal profile of vocalic segments can predict the presence of creakiness in speech. Using a simple bi-directional Recurrent Neural Network (RNN), we were able to predict the presence of creakiness in vocalic segments from only roug...
Chapter
Full-text available
In this study, we investigate the problem of apparent personality recognition using person’s voice, or more precisely, the way he or she speaks. Based on the style transfer idea in deep neural net image processing, we developed a system capable of speaking style extraction from recorded speech utterances, which then uses this information to estimat...
Conference Paper
Full-text available
It has been shown that by combining the acoustic and artic-ulatory information significant performance improvements in automatic speech recognition (ASR) task can be achieved. In practice, however, articulatory information is not available during recognition and the general approach is to estimate it from the acoustic signal. In this paper, we prop...
Article
Full-text available
Automatic emotion recognition from speech has been focused mainly on identifying categorical or static affect states, but the spectrum of human emotion is continuous and time-varying. In this paper, we present a recognition system for dynamic speech emotion based on state-space models (SSMs). The prediction of the unknown emotion trajectory in the...
Conference Paper
Automatic emotion recognition from speech has been focused mainly on identifying categorical or static affect states, but the spectrum of human emotion is continuous and time-varying. In this paper, we present a recognition system for dynamic speech emotion based on state-space models (SSMs). The prediction of the unknown emotion trajectory in the...
Chapter
Full-text available
Gaussian Processes (GPs) are Bayesian nonparametric models that are becoming more and more popular for their superior capabilities to capture highly nonlinear data relationships in various tasks ranging from classical regression and classification to dimension reduction, novelty detection and time series analysis. Here, we introduce Gaussian proces...
Article
Full-text available
In the paper, selection of best phoneme set for Russian automatic speech recognition is described. For the acoustic modeling, we describe a method based on combination of knowledge-based and statistical approaches to create several different phoneme sets. Applying this method to the Russian phonetic set of the IPA (International Phonetic Alphabet)...
Conference Paper
Full-text available
Music as a form of art is intentionally composed to be emotionally expressive. The emotional features of music are invaluable for music indexing and recommendation. In this paper we present a cross-comparison of automatic emotional analysis of music. We created a public dataset of Creative Commons licensed songs. Using valence and arousal model, th...
Conference Paper
Full-text available
Music as a form of art is intentionally composed to be emo-tionally expressive. The emotional features of music are in-valuable for music indexing and recommendation. In this paper we present a cross-comparison of automatic emotional analysis of music. We created a public dataset of Creative Commons licensed songs. Using valence and arousal model,...
Article
Full-text available
Speech is the most natural way of human communication and in order to achieve convenient and efficient human–computer interaction implementation of state-of-the-art spoken language technology is necessary. Research in this area has been traditionally focused on several main languages, such as English, French, Spanish, Chinese or Japanese, but some...
Article
Full-text available
Gaussian Processes (GPs) are Bayesian nonparametric models that are becoming more and more popular for their superior capabilities to capture highly nonlinear data relationships in various tasks, such as dimensionality reduction, time series analysis, novelty detection, as well as classical regression and classification tasks. In this paper, we inv...
Article
This paper describes the temporal music emotion recogni- tion system developed at the University of Aizu for the Emo- tion in Music task of the MediaEval 2014 benchmark evalua- tion campaign. The arousal-valence trajectory prediction is cast as a time series ltering task and is modeled by a state- space models. These models include standard linear...
Article
Full-text available
Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervised learning research. In most works, however, it is assumed that labeled and unlabeled data come from the same distribution. This restriction is removed in the self-taught learning algorithm where unlabeled data can be different, but nevertheless have s...
Conference Paper
Full-text available
The Russian language is characterized by very flexible word order, which limits the ability of the standard n-grams to capture important regularities in the data. Moreover, Russian is highly inflectional language with rich morphology, which leads to high out-of-vocabulary word rates. Recently factored language model (FLM) was proposed with the aim...
Conference Paper
Full-text available
In this paper we introduce Gaussian Process (GP) models for music genre classification. Gaussian Processes are widely used for various regression and classification tasks, but there are relatively few studies where GPs are applied in the audio signal processing systems. The GP models are non-parametric discriminative classifiers similar to the well...
Conference Paper
The Russian language is characterized by very flexible word order, which limits the ability of the standard n-grams to capture important regularities in the data. Moreover, it is highly inflectional language with rich morphology, which leads to high out-of-vocabulary (OOV) word rates. In this paper, we present comparison of two advanced language mo...
Conference Paper
Full-text available
Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervised learning re-search. In most works, however, it is assumed that labeled and unlabeled data come from the same distribution. This restriction is removed in the self-taught learning approach where unlabeled data can be different, but nevertheless have s...
Article
Full-text available
In this paper, we present a review of the latest developments in the Russian speech recognition research. Although the underlying speech technology is mostly language-independent, differences between languages with respect to their structure and grammar have substantial effect on the recognition systems performance. The Russian language has a compl...
Conference Paper
Full-text available
Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervised learning research. In most works, however, it is assumed that labeled and unlabeled data come from the same distribution. This restriction is removed in the self-taught learning approach where unlabeled data can be different, but nevertheless have si...
Conference Paper
Full-text available
In this paper, we describe a method for phoneme set selection based on combination of phonological and statistical information and its application for Russian speech recognition. For Russian language, currently used phoneme sets are mostly rule-based or heuristically derived from the standard SAMPA or IPA phonetic alphabets. However, for some other...
Conference Paper
In this paper, we present a review of the latest developments in the Russian speech recognition research. Although the underlying speech technology is mostly language independent, differences between languages with respect to their structure and grammar have substantial effect on the recognition systems performance. Russian language has a complicat...
Conference Paper
Full-text available
Speaker diarization is the process of annotating an audio document with information about the speaker identity of speech segments along with their start and end time. Assuming that audio input consists of speech only or that non-speech segments have been already identified by another method, the task of speaker diarization is to find ¿who spoke wh...
Conference Paper
Full-text available
Segmentation of multi-speaker meeting audio data recorded with several microphones into speech/silence frames is one of the first tasks at development of the speaker diarization system. Energy normalization techniques and signal correlation methods are used in order to avoid the crosstalk problem, in which participant's speech appears on other part...
Article
In this chapter, we introduce the design of our proposed framework, the so called GFIKS (graphical framework to incorporate additional knowledge sources). It is based on a graphical model representation that makes use of additional knowledge sources in a statistical model as shown in Figure 3.1. This approach is meant to be broadly useful in the se...
Article
In this chapter, we demonstrate how the statistical speech recognition system may incorporate additional sources by utilizing GFIKS at different levels, HMM state and phonetic-unit. We also present some experimental results of incorporating various knowledge sources, including environmental variability (i.e., background noise information), speaker...
Chapter
This chapter describes the state-of-the-art technology for statistical ASR based on the pattern recognition paradigm. The most widely used core technology is the hidden Markov model (HMM). This is basically a Markov chain that characterizes a speech signal in a mathematically tractable way. Section 2.1 provides an overview of pattern recognition. I...
Chapter
The continuous growth of information technology is having an increasingly large impact on many aspects of our daily lives. The issue of communication via speech between human beings and information-processing machines is also becoming more important (Holmes and Holmes, 2001). A common dream is to realize a technology that allows humans to communica...
Chapter
In this last chapter, we draw our conclusions and discuss future directions toward developing a spoken language dialog system. This book over a solution to enhance the robustness of a statistical automatic speech recognition system by incorporating various additional knowledge sources while keeping the training and recognition effort feasible. A ne...
Conference Paper
This paper reports on an ongoing study on modeling pronunciation variation for conversational speech recognition, in which the mapping from canonical pronunciations (baseforms) to the actual/realized phoneme (surface forms) is modeled by a Bayesian network. The advantage of this graphical model framework is that the probabilistic relationship betwe...
Conference Paper
Full-text available
In this paper, we describe new language identification system based on the recently developed dynamic hidden Markov network (DHMnet). The DHMnet is a never-ending learning system and provides high resolution model of the speech space. Speech patterns are represented by paths through the network, and these paths when properly labeled with language I...
Conference Paper
Full-text available
In this paper, we describe new high-performance on-line speaker diarization system which works faster than real-time and has very low latency. It consists of several modules including voice activity detection, novel speaker detection, speaker gender and speaker identity classification. All modules share a set of Gaussian mixture models (GMM) repres...
Conference Paper
Full-text available
The paper outlines the development of a large vocabulary continuous speech recog- nition (LVCSR) system for the Indonesian language within the Asian speech transla- tion (A-STAR) project. An overview of the A-STAR project and Indonesian language characteristics will be briefly described. We then focus on a discussion of the develop- ment of Indones...
Article
Full-text available
This paper introduces a general framework for incorporating additional sources of knowledge into an HMM-based statistical acoustic model. Since the knowledge sources are often derived from different domains, it may be difficult to formulate a probabilistic function of the model without learning the causal dependencies between the sources. We utiliz...
Conference Paper
Full-text available
We introduce a method of incorporating additional knowledge sources into an HMM-based statistical acoustic model. The probabilistic relationship between information sources is first learned through a Bayesian network to easily integrate any additional knowledge sources that might come from any domain and then the global joint probability density fu...
Conference Paper
Full-text available
Current automatic speech recognition systems have two distinctive modes of operation: training and recognition. After the training, system parameters are fixed, and if a mismatch between training and testing conditions occurs, an adaptation procedure is commonly applied. However, the adaptation methods change the system parameters in such a way tha...
Conference Paper
Full-text available
Most current automatic speech recognition (ASR) systems use statistical data-driven methods based on hidden Markov models (HMMs). Although such approaches have proved to be efficient choices, ASR systems often still perform much worse than human listeners, especially in the presence of un-expected acoustic variability. Only a limited level of succe...
Conference Paper
Full-text available
We propose a new method of incorporating the additional knowledge of accent, gender, and wide-context dependency information into ASR systems by utilizing the advantages of Bayesian networks. First, we only incorporate pentaphone-context dependency information. After that, accent and gender information are also integrated. In this method, we can ea...
Conference Paper
This paper presents a new method of modeling pentaphone-context units using the hybrid HMM/BN acoustic modeling. Rather than modeling pentaphones explicitly, in this approach we extend the modeled phonetic context within the triphone framework, since the probabilistic dependencies between the triphone context unit and the second preceding/following...
Article
Full-text available
In this paper, we describe the ATR multilingual speech-to-speech translation (S2ST) system, which is mainly focused on translation between English and Asian languages (Japanese and Chinese). There are three main modules of our S2ST system: large-vocabulary continuous speech recognition, machine text-to-text (T2T) translation, and text-to-speech syn...
Article
Full-text available
Over the last decade, the Bayesian approach has increased in popularity in many application areas. It uses a probabilistic framework which encodes our beliefs or actions in situations of uncertainty. Information from several models can also be combined based on the Bayesian framework to achieve better inference and to better account for modeling un...
Article
Full-text available
The most widely used acoustic unit in current automatic speech recognition systems is the triphone, which includes the immediate preceding and following phonetic contexts. Although triphones have proved to be an efficient choice, it is believed that they are insufficient in capturing all of the coarticulation effects. A wider phonetic context seems...
Article
Full-text available
In recent years, the number of studies investigating new directions in speech modeling that goes beyond the conventional HMM has increased considerably. One promising approach is to use Bayesian Networks (BN) as speech models. Full recognition systems based on Dynamic BN as well as acoustic models using BN have been proposed lately. Our group at AT...
Article
Full-text available
In this paper, we describe a parallel decoding-based ASR system developed of ATR that is robust to noise type, SNR and speaking style. It is difficult to recognize speech affected by various factors, especially when an ASR system contains only a single acoustic model. One solution is to employ multiple acoustic models, one model for each different...
Article
Full-text available
Most of the current state-of-the-art speech recognition systems are based on speech signal parametrizations that crudely model the behavior of the human auditory system. However, little or no use is usually made of the knowledge on the human speech production system. A data-driven statistical approach to incorporate this knowledge into ASR would re...
Conference Paper
Full-text available
This paper presents a method for improving acoustic model precision by incorporating wide phonetic context units in speech recognition. The wide phonetic context model is constructed from several narrower context-dependent models based on the Bayesian framework. Such a composition is performed in order to avoid the crucial problem of a limited avai...
Conference Paper
Full-text available
Most of the current state-of-the-art speech recognition sys-tems use the Hidden Markov Model (HMM) for modeling acoustical characteristics of a speech signal. In the first-order HMM, speech data are assumed to be independently and identically distributed (i.i.d.), meaning that there is no dependency between neighboring feature vectors. Another assu...
Conference Paper
Full-text available
This paper describes the speech recognition module of the speech-to-speech translation system being currently developed at ATR. It is a multi-lingual large vocabu-lary continuous speech recognition system supporting Japanese, English and Chinese languages. A corpus-based statistical approach was adopted for the system design. The database we collec...
Conference Paper
Full-text available
Most of the current state-of-the-art speech recognition systems are based on HMMs which usually use mix-ture of Gaussian functions as state probability distribu-tion model. It is a common practice to use EM algo-rithm for Gaussian mixture parameter learning. In this case, the learning is done in a "blind", data-driven way without taking into accoun...
Conference Paper
Full-text available
In this paper, we describe automatic speech recognition system where features extracted from human speech production system in form of articulatory movements data are effectively integrated in the acoustic model for improved recognition performance. The system is based on the hybrid HMM/BN model, which allows for easy integration of different speec...
Conference Paper
Full-text available
In current HMM based speech recognition systems, it is difficult to supplement acoustic spectrum features with additional information such as pitch, gender, articulator positions, etc. On the other hand, dynamic Bayesian networks (DBN) allow for easy combination of different features and make use of conditional dependencies between them. However, l...
Article
Full-text available
In current HMM based speech recognition systems , it is difficult to supplement acoustic spectrum features with additional information such as pitch, gender, articulator positions , etc. On the other hand, Bayesian Networks (BN) allow for easy combination of different continuous as well as discrete features by exploring conditional dependencies bet...
Article
Full-text available
This paper presents the ATR speech recognition system designed for the DARPA SPINE2 evaluation task. The system is capable of dealing with speech from highly variable , real-world noisy conditions and communication channels. A number of robust techniques are implemented, such as differential spectrum mel-scale cepstrum features, on-line MLLR adapta...
Article
Full-text available
This paper presents a study on modeling inter-word pauses to improve the robustness of acoustic models for recognizing noisy conversational speech. When precise contextual modeling is used for pauses, the frequent appearances and varying acoustics of pauses in noisy conversational speech make it a problem to automatically generate an accurate phone...