Branislav Gerazov

Branislav Gerazov
Ss. Cyril and Methodius University in Skopje · Faculty of Electrical Engineering and Information Techologies

Doctor of Engineering

About

103
Publications
17,578
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
242
Citations
Citations since 2017
44 Research Items
142 Citations
2017201820192020202120222023010203040
2017201820192020202120222023010203040
2017201820192020202120222023010203040
2017201820192020202120222023010203040

Publications

Publications (103)
Article
Full-text available
Computational approaches have an important role to play in understanding the complex process of speech acquisition, in general, and have recently been popular in studies of vocal learning in particular. In this article we suggest that two significant problems associated with imitative vocal learning of spoken language, the speaker normalisation and...
Conference Paper
Full-text available
Evoc-Learn is a system for simulating early vocal learning of spoken language in ways that can overcome some of the major bottlenecks in vocal learning. The system consists of VocalTractLab, a geometrical three-dimensional vocal tract model for simulating aeroacoustics and articulatory dynamics, a coarticulation model for controlling the temporal d...
Chapter
Detection and marking of road accident “black spots” are of paramount importance for road safety. Their detection and localization are based on accident statistics for certain road segments. Following established safety standards today, this approach of “waiting for accidents to happen” in order to prevent future events is unacceptable. Behind any...
Conference Paper
Full-text available
While numerous studies on automatic speech recognition have been published in recent years describing data augmentation strategies based on time or frequency domain signal processing, few works exist on the artificial extensions of training data sets using purely synthetic speech data. In this work, the German KIEL corpus was augmented with synthet...
Conference Paper
Full-text available
High-quality articulatory speech synthesis has many potential applications in speech science and technology. However, developing appropriate mappings from linguistic specification to articulatory gestures is difficult and time consuming. In this paper we construct an optimisation-based framework as a first step towards learning these mappings witho...
Preprint
p>This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the main measure of learning success. Thereby, a novel approach for artificial early vocal learning is presented that utilizes deep neural network-...
Preprint
Full-text available
p>This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the main measure of learning success. Thereby, a novel approach for artificial early vocal learning is presented that utilizes deep neural network-...
Preprint
Full-text available
Speech technology is becoming ever more ubiquitous with the advance of speech enabled devices and services. The use of speech synthesis in Augmentative and Alternative Communication tools, has facilitated inclusion of individuals with speech impediments allowing them to communicate with their surroundings using speech. Although there are numerous s...
Preprint
Full-text available
The nature of English diphthongs has been much disputed. Bynow, the most influential account argues that diphthongs arephoneme entities rather than vowel combinations. However,mixed results have been reported regarding whether the rate offormant transition is the most reliable attribute in the perceptionand production of diphthongs. Here, we used c...
Article
Full-text available
The nature of English diphthongs has been much disputed. By now, the most influential account argues that diphthongs are phoneme entities rather than vowel combinations. However, mixed results have been reported regarding whether the rate of formant transition is the most reliable attribute in the perception and production of diphthongs. Here, we u...
Preprint
Full-text available
High-quality articulatory speech synthesis has many potential applications in speech science and technology. However, developing appropriate mappings from linguistic specification to articulatory gestures is difficult and time consuming. In this paper we construct an optimisation-based framework as a first step towards learning these mappings witho...
Article
Full-text available
In this paper we revisited a database with measurements of the dielectric properties of rat muscles. Measurements were performed both in vivo and ex vivo; the latter were performed in tissues with varying levels of hydration. Dielectric property measurements were performed with an open-ended coaxial probe between the frequencies of 500 MHz and 50 G...
Conference Paper
Full-text available
The paper presents initial results in the design and development of a system for automatic conversion of text to sign language in Macedonian. The system will be an essential part of a larger system for the automatic generation of Macedonian sign language based on text. This system will facilitate the digital inclusion and will ease communication wi...
Conference Paper
Full-text available
MDD or major depressive disorder is a psychiatric disorder which is present in today's society on a large scale. The complexity of the symptoms may lead to misdiagnosis. Automatic detection of depression is state of the art problem which would objectify the diagnosis. The electroencephalogram (EEG) as a medical device, records the human brain activ...
Conference Paper
Full-text available
Spoken language is what distinguishes humans from other species. It is the easiest, yet the most complex action we perform. Precise and coordinated movement of multiple articulators is required in order to generate spoken sounds or phonemes. The organization of phonemes into complex semantic structures is currently a system of communication that pe...
Conference Paper
Full-text available
While the acoustic vowel space has been extensively studied in previous research, little is known about the high-dimensional articulatory space of vowels. The articulatory imaging techniques are limited to tracking only a few key articulators, leaving the rest of the articulators unmonitored. In the present study, we attempted to develop a detailed...
Preprint
Full-text available
Music transcription is the process of transcribing music audio into music notation. It is a field in which the machines still cannot beat human performance. The main motivation for automatic music transcription is to make it possible for anyone playing a musical instrument, to be able to generate the music notes for a piece of music quickly and acc...
Article
When pitch is explicitly modelled for parametric speech synthesis, microprosodic variations of the fundamental frequency f0 are usually disregarded by current intonation models. While there are numerous studies dealing with the nature and the origin of microprosody, little research has been done on its audibility and its effect on the naturalness o...
Preprint
Full-text available
The labelling of speech corpora is a laborious and time-consuming process. The ProsoBeast Annotation Tool seeks to ease and accelerate this process by providing an interactive 2D representation of the prosodic landscape of the data, in which contours are distributed based on their similarity. This interactive map allows the user to inspect and labe...
Conference Paper
Swarming is a natural event that causes economic losses for beekeepers and impacts the economic and ecological balance. The sound emitted by honeybees can be used for detection of swarming. The aim of this study is to find an approach for early detection of swarming using sound analysis. The Short Time Fourier Transform (STFT) was investigated base...
Preprint
Full-text available
Everyone has a right to a voice. This paper describes a tool for smart devices, intended to simplify communication, for those for whom speaking is a challenge every single day. The tool is aimed for children but it can be used by everyone who finds it useful. The application is divided into categories of words. Each category contains a different nu...
Preprint
Full-text available
Intelligent Traffic Surveillance systems have helped improve road safety through ensuring timely response to events such as traffic accidents and congestion. Our aim is to devise a robust system capable of traffic audio events detection in a real-life environment. At the core of this system is a deep learning model capable of detecting anomalous ev...
Conference Paper
Full-text available
The statistical representation of phonemes and phoneme sequences in the Macedonian language is interesting both from the linguistic aspect and from the aspect of speech technologies. Linguistically, it can be used to determine the rules of phonotactics in the Macedonian language and its comparative analysis in the continuum of world languages. In s...
Conference Paper
Full-text available
In this study, a state-of-the-art articulatory speech synthesiser was used as the basis for simulating the exploration of CV sounds imitating speech stimuli. By adopting a relevant kinematic model and systematically reducing the search space of consonant articulatory targets, intelligible CV sounds can be found. Derivative-free optimisation strateg...
Preprint
Full-text available
The way infants use auditory cues to learn to speak despite the acoustic mismatch of their vocal apparatus is a hot topic of scientific debate. The simulation of early vocal learning using articulatory speech synthesis offers a way towards gaining a deeper understanding of this process. One of the crucial parameters in these simulations is the choi...
Conference Paper
Full-text available
Prosody in speech is used to communicate a variety of linguistic, paralinguistic and non-linguistic information via multiparametric contours. The Superposition of Functional Contours (SFC) model is capable of extracting the average shape of these elementary contours through iterative analysis-by-synthesis training of neural network contour generato...
Conference Paper
Full-text available
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural net...
Preprint
Full-text available
The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. More traditional intonation models have given way to the overwhelming performance of artificial intelligence (AI) techniques for training model-free, end-...
Preprint
Full-text available
The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural net...
Conference Paper
Full-text available
The Superposition of Functional Contours (SFC) prosody model decomposes the intonation and duration contours into elementary contours that encode specific linguistic functions. It can be used to extract these functional contours at multiple linguistic levels. The PySFC system, which incorporates the SFC, can thus be used to analyse the significance...
Article
The detection of prosodic events, prosodic stress, and speech segmentation based on prosody have received much attention in the research community in the past decades. Prosody is relevant for both main areas of speech technology, text-to-speech synthesis and automatic speech recognition and understanding, and is exploited increasingly: besides prov...
Article
We propose a physiologically based intonation model using perceptual relevance. Motivated by speech synthesis from a speech-to-speech translation (S2ST) point of view, we aim at a language independent way of modelling intonation. The model presented in this paper can be seen as a generalisation of the command response (CR) model, albeit with the sa...
Article
Full-text available
The paper presents an analysis of the voicing of the phoneme /v/ in modern spoken Macedonian. The phoneme /v/ in the standard Macedonian language is classified as a fricative, but some of its characteristics separate it from the other phonemes in this group. This is due to the fact that this phoneme was once a sonorant. In a part of the Macedonian...
Conference Paper
Full-text available
The Weighted Correlation based Atom Decomposition (WCAD) is a recently proposed physiological intonation model that decomposes the pitch contour into elementary components — atoms. Since these atoms are said to correspond to laryngeal muscle activation, in theory they could be used to infer higher linguistic meaning from the pitch contour. One such...
Article
Full-text available
The decreasing cost per processing power in commercial off-the-shelf components and the decreasing size of the processing units enables Automatic Speech Recognition to be a trending topic in Embedded Systems, Internet of Things and Smart Home Applications. One of the first and most intuitive algorithms used for recognizing spoken words is Dynamic T...
Article
Full-text available
The prosody of the speech signal carries both linguistic and paralinguistic information. As such, there is a necessity of its modelling for the purpose of integrating it in speech technology systems. So far, there has been a multitude of proposed models focusing mainly on intonation, but a few also on energy and duration. The paper proposes an inte...
Conference Paper
Full-text available
This paper presents an overview of automatic speaker recognition with a focus on text-independent speaker identification. Speaker recognition is an important field of speech research that has been actively studied for the last five decades. The overview follows the progress made in speaker recognition since its beginnings up to today's state-of-the...
Conference Paper
Full-text available
Weighted Correlation based Atom Decomposition (WCAD) algorithm is a technique for intonation modelling that uses a matching pursuit framework to decompose the F0 contour into a set of basic components, called atoms. The atoms attempt to model the physiological activation of the laryngeal muscles responsible for changes in F0. Recently, WCAD has bee...
Conference Paper
Full-text available
Prosody is a phenomenon that is crucial for numerous fields of speech research, accenting the importance of having a robust prosody model. A class of intonation models based on the physiology of pitch production are especially attractive for their inherent multilingual support. These models rely on an accurate model of muscle activation. Traditiona...
Conference Paper
Full-text available
Since the prosody of a spoken utterance carries information about its discourse function, salience, and speaker attitude, prosody models and prosody generation modules have played a crucial part in text-to-speech (TTS) synthesis systems from the beginning, especially those set not only on sounding natural, but also on showing emotion or particular...
Article
Full-text available
One of the most recently proposed techniques for modeling the prosody of an utterance is the decomposition of its pitch, duration and/or energy contour into physiologically motivated units called atoms, based on matching pursuit. Since this model is based on the physiology of the production of sentence intonation, it is essentially language indepen...
Conference Paper
Detection of emphatic words is important in speech research because of its implications in generating appropriate translation and prosody in speech-to-speech translation systems. These systems require the detection of the emphasized word in the input language, so that they can adequately place the emphasis in the output language. One of the most si...
Conference Paper
Full-text available
The modeling of intonation is crucial in the field of Text to Speech synthesis (TTS), where it makes the output sound natural, improving its intelligibility. Recently, we have proposed a generalization of the Command Response model that offers improvements in consistency and physiological plausibility. Our model is based on the decomposition of the...
Conference Paper
Full-text available
Emphatic word detection is important in the field of speech technology. It allows for improvements in human-machine dialogue systems directing their focus on the word emphasized by the user. Our previous research in emphasis detection was focused on the relative lengthening of syllable segments. The final syllable in the phrase has a large relative...
Conference Paper
Full-text available
Prosody is a crucial aspect of the speech signal and its modelling is of great importance for various speech technologies. Intonation models based on physiology rely on an accurate model of muscle activation. Although most of them are based on the spring-damper-mass (SDM) muscle model, the more complex Hill type model offers a more accurate represe...
Article
Full-text available
Emphatic word detection is important in the field of speech research, as it allows human-machine dialogue systems to focus on the word emphasized by the user. This problem has also sparked increasing interest with the advent of speech-to-speech translation systems. This paper presents methods of emphatic word detection based on syllable duration. T...
Article
Full-text available
Emphatic word detection is an area of speech research that has been made popular by the development of speech-to-speech translation systems. These systems require the detection of the emphasized word in the input language, so that they can generate the appropriate translation and prosody in the output. An algorithm for emphatic word detection is de...
Article
Full-text available
Prosody modelling is an important subject of research in speech recognition field. It has found its application in systems for automatic speech recognition, text-to-speech synthesis and recently speech-to-speech translation. Emphatic expression mostly depends from the dynamics of speaking, expressed through intonation and energy during the time of...
Conference Paper
Full-text available
Intonation modelling is an integral part of text-to-speech systems from their very beginnings. This has led to the proliferation of various intonation models, each with its own relative strengths and weaknesses. Only a few of these intonation models are based on physiology, despite the advantage that such models are language independent. We propose...
Conference Paper
Full-text available
Speech emotion recognition (SER) is an area gaining increasing interest in the scientific community. Indeed, emphatic speech containing human emotion remains the holy grail in both automatic speech recognition (ASR) and text to speech (TTS) synthesis. Systems that understand human emotion could be of greater help to the user, but also could aid the...
Article
Full-text available
Noise-robustness has become a crucial parameter in Automatic Speech Recognition (ASR) systems today with their increased use in noise-filled real-world environments. One way to address this issue is to develop features that are innately noise-robust. The Kernel Power flow Orientation Coefficients (KPOCs) are a novel feature set based on spectro-tem...
Article
Full-text available
The digitization of the cultural heritage is of paramount importance for its preservation for the generations to come. At the same time, the digitization of audio and video materials, as well as archive photos, allows easy access to this content both for the national research community and its international partners. For proper preservation of the...
Conference Paper
Full-text available
Prosody has been largely neglected in current speech technologies due to its complexity and the lack of an all encompassing ubiquitous interlingual prosody model. Current prosody models used in Text-to-Speech (TTS) systems are proficient enough to give natural sounding prosody to the synthesized speech, but fail to offer a deeper understanding of p...
Conference Paper
Full-text available
Spectro-temporal features have shown a great promise in respect to improving the noise-robustness of Automatic Speech Recognition (ASR) systems. The common approach uses a bank of 2D Gabor filters to process the speech signal spectrogram and generate the output feature vector. This approach suffers from generating a large number of coefficients, th...
Conference Paper
Full-text available
This is an overview of a Joint Research Project within the Scientific co-operation between Eastern Europe and Switzerland (SCOPES) Program of the Swiss National Science Foundation (SNFS) and Swiss Agency for Development and Cooperation (SDC). Within the SP2 SCOPES Project on Speech Prosody, in the course of the following two years, the four partner...
Conference Paper
Full-text available
Automatic Speech Recognition (ASR) systems have been built for most of the developed world's languages. There still remains a large body of languages that lack a dedicated ASR system. A speaker independent small-vocabulary ASR system for Macedonian is built for a smart-phone Human-Machine Interface application scenario. The system is constructed wi...
Conference Paper
Full-text available
The significance of using noise-robust features in automatic speech recognition (ASR) systems is emphasized with their increasing deployment in mobile intelligent platforms used in the noisy scenarios of everyday life. The choice of the filter bank (FB) in the feature extraction process can have a significant impact on ASR system performance in noi...
Conference Paper
Full-text available
Firearm gunshot acoustics is an issue of interest to researchers in the field of forensics. Firearm identification based on audio recordings of a criminal event is a topic that has not been thoroughly studied. A methodology is proposed to differentiate between firearms solely based on the acoustic signature of their gunshot. The approach includes a...
Conference Paper
Full-text available
The fractal dimension in all of its definitions is often used to indicate the level of complexity of some curve, figure or surface. In this paper we estimated the Minkowski – Bouligand fractal dimension of the music signals produced by several musical instruments, most of them typical for the Macedonian traditional music. The estimation enabled us...