Niksa JakovljevicUniversity of Novi Sad · Faculty of Technical Sciences
Niksa Jakovljevic
PhD
About
65
Publications
16,666
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
647
Citations
Introduction
Skills and Expertise
Publications
Publications (65)
Speech technologies such as text-to-speech (TTS) and speech-to-text (STT) are becoming increasingly applicable. Significant improvements in their quality are driven by advancements in deep machine learning. The ability of devices to deeply understand human speech and generate appropriate responses is a hallmark of AI capabilities. Developing speech...
In Cancer Imaging research, data collection,
integration, and utilization to generate multicentric data
repositories pose a series of challenges such as data
harmonization, quality, and suitability. This work presents the
INCISIVE project approach towards assessing the quality of
cancer imaging data and clinical (meta)data to serve as a map
ensurin...
The increasing breast cancer incidence and importance on early-stage diagnoses drives advances in analysis of screening imaging modalities, assisting efficient patient prioritization and decision support. This work addresses breast lesion segmentation problem in mammograms collected within INCISIVE project from multiple hospital centers, aiming to...
Abstract— In Cancer Imaging research, data collection,
integration, and utilization to generate multicentric data
repositories pose a series of challenges such as data
harmonization, quality, and suitability. This work presents the
INCISIVE project approach towards assessing the quality of
cancer imaging data and clinical (meta)data to serve as a m...
The research presented in the paper addresses challenges related to the development of more flexible systems for speech communication between humans and machines. Specifically, the paper presents the main results of the speech technology research group at the Faculty of Technical Sciences, University of Novi Sad, Serbia, in the development of a mul...
Emotional speech recognition and synthesis of expressive speech are highly dependable on the availability of emotional speech corpora. In this paper, we present the creation and verification of the Serbian Emotional Amateur Cellphone Speech Corpus (SEAC), which was released by the University of Novi Sad, Faculty of Technical Sciences in 2022, as th...
Finding new ways to cost-effectively facilitate population screening and improve cancer diagnoses at an early stage supported by data-driven AI models provides unprecedented opportunities to reduce cancer related mortality. This work presents the INCISIVE project initiative towards enhancing AI solutions for health imaging by unifying, harmonizing,...
Anatomical and dynamical connectivity are essential to healthy brain function. However, quantifying variations in connectivity across conditions or between patient populations and appraising their functional significance are highly non-trivial tasks. Here we show that link ranking differences induce specific geometries in a convenient auxiliary spa...
This paper presents the project Central Audio Library of the University of Novi Sad (CABUNS), aimed at automated creation of audio editions of textbooks, presentations and other course material using the new technology of text-to-speech synthesis in the Serbian language. The paper describes the architecture and the features of the developed system,...
Ovaj rad daje prikaz sistema za detekciju stresa kod pilića tovljenika na osnovu analize zvuka njihovog oglašavanja. Skup obeležja na osnovu kojih ovaj sistem vrši prepoznavanje čine: energija, snaga, kvadratna sredina, džiter, šimer, prosečna visina zvuka, odnos harmonik-šum, izlazi iz mel-filtar banke i mel-frekvencijski kepstralni koeficijenti....
The paper presents performance of our broiler stress detection system evaluated on the extended audio database. The extended database allows evaluation on both 50 ms frame and 1-minute long segment level. The system is based on the features used in speech signal processing (i.e. speech quality evaluation, emotion recognition, and speaker and speech...
The paper presents a system for stress detection in broiler chickens using audio data. The system is consisted of 4 classifiers adapted for 4 age groups of chickens (one for each week). These classifiers are based on support vector machines and as input features they use the features for voice quality evaluation and speech emotion recognition. Feat...
Speech technologies have been developed for decades as a typical signal processing area, while the last decade has brought a huge progress based on new machine learning paradigms. Owing not only to their intrinsic complexity but also to their relation with cognitive sciences, speech technologies are now viewed as a prime example of interdisciplinar...
Objective:
Over the last few decades, there has been significant interest in the automatic analysis of respiratory sounds. However, currently there are no publicly available large databases with which new algorithms can be evaluated and compared. Further developments in the field are dependent on the creation of such databases.
Approach:
This pa...
The paper describes a multi-target speaker detection and identification system based on a fusion of probabilistic linear discriminant analysis (PLDA) and deep neural network (DNN). PLDA is the state-of-the-art approach used in speaker recognition, thus we selected it as our baseline. We tried to develop a DNN based approach, that would be more accu...
The paper presents the project Central Audio-Library of the University of Novi Sad (CABUNS), aimed at automated creation of audio-editions of textbooks, presentation and other course material using the new technology of text-to-speech synthesis in the Serbian language. The paper describes the architecture and the features of the developed system, f...
A reliable automatic categorization of respiratory effort is paramount for sleep-disordered breathing characterization from polysomnography. A respiratory effort related arousal (RERA) is a subtle breathing obstruction associated with an arousal. For identification
of RERAs we focused on: chest and abdomen EMGs, airflow, and EEG; monitoring changes...
This paper presents a method based on hidden Markov models in combination with Gaussian mixture models for classification of respiratory sounds into normal, wheeze and crackle classes. Input features are mel-frequency cepstral coefficients extracted in the range between 50 Hz and 2000 Hz in combination with their first derivatives. The audio files...
У овом раду је дат приказ једног решења за унос текста на српском помоћу гласа на паметном телефону заснованом на Андроид говорном интерфејсу апликативног програма. Имплементирана улазна метода има два режима рада: режим за унос текста (помоћу гласа) и режим за корекцију погрешно препознатих речи (помоћу QWERTY тастатуре). Изворни код ове апликациј...
U radu su opisani izabrani aspekti specifikacije i dizajna centralne audio-biblioteke Univerziteta u Novom Sadu (CABUNS). Cilj projekta je unapređenje obrazovnog procesa kreiranjem audio-izdanja knjiga i predavanja. Aplikacija je realizovana kao veb-portal koji primenjuje automatsku sintezu govora za srpski jezik i generiše adekvatne audio-vizuelne...
The paper presents results of an evaluation of covariance matrix and i-vector based speaker identification methods on Serbian S70W100s120 database. Open set speaker identification evaluation scheme was adopted. The number of target speakers and the number of impostors were 20 and 60 respectively. Additional utterances from 41 speakers were used for...
Although the importance of contextual information in speech recognition has been acknowledged for a long time now, it has remained clearly underutilized even in state-of-the-art speech recognition systems. This article introduces a novel, methodologically hybrid approach to the research question of context-dependent speech recognition in human–mach...
The paper reports on the objective evaluation and comparison of the two noise estimation algorithms for noisy speech signals. Both algorithms are based on observation that local minima in noisy speech spectrogram are close to the power level of the noise signal. The first algorithm directly searches spectrogram for the local minima and those values...
In this paper, a novel variant of an automatic phonetic segmentation procedure is presented, especially useful if data is scarce. The procedure uses the Kaldi speech recognition toolkit as its basis, and combines and modifies several existing methods and Kaldi recipes. Both the specifics of model training and test data alignment are explained in de...
This paper presents a Voice Assistant, an Android based personal assistant application for mobile phones, allowing voice control for the Serbian language. The native interface is provided for a large vocabulary continuous speech recognition system based on the open-source Kaldi speech recognition toolkit. Several acoustic models were trained using...
This paper presents a deep neural network (DNN) based large vocabulary continuous speech recognition (LVCSR) system for Serbian, developed using the open-source Kaldi speech recognition toolkit. The DNNs are initialized using stacked restricted Boltzmann machines (RBMs) and trained using cross-entropy as the objective function and the standard erro...
Voice Assistant is a personal assistant mobile phone application for the Serbian language that allows natural communication between the phone and the user [1]. The application provides an array of essential commands for a fast and efficient usage of the device in a number of tasks, e.g., messaging, calling, handling of contacts, changing settings,...
This paper proposes a model which approximates full covariance matrices in Gaussian mixture models (GMM) with a reduced number of parameters and computations required for likelihood evaluations. In the proposed model inverse covariance (precision) matrices are approximated using sparsely represented eigenvectors, i.e. each eigenvector of a covarian...
Regression trees are used in speaker and continuous speech recognition systems for the purpose of adaptation of speaker-independent models according to the acoustic properties of adaptation data for a given speaker. This paper compares the performances of a large vocabulary continuous speech recognition system for the Serbian language using variati...
The paper presents a speaker detection system based on phoneme specific hidden Markov model in combination with Gaussian mixture model. Our motivation stems from the fact that the phoneme specific HMM system can model temporal variations and provides possibility to ponder the scores of specific phonemes as well as efficient pruning. The performance...
This paper considers the research question of developing user-aware and adaptive conversational agents. The conversational agent is a system which is user-aware to the extent that it recognizes the user identity and his/her emotional states that are relevant in a given interaction domain. The conversational agent is user-adaptive to the extent that...
In the automatic speech recognition task, the dominant approach is the statistical framework based on hidden Markov models in combination with Gaussian mixture models. The issues which should be solved are: how to obtain a statistically efficient estimation of model parameters, especially covariance matrix, whose number of parameters is proportiona...
Speech recognition systems are commonly modelled by hidden Markov models with Gaussian mixture models as observation density functions. These models have a significant number of parameters, which usually leads to the problem of data sparsity, especially for under-resourced languages such as Serbian. One of the ways to overcome the problem of data s...
Unlike other new technologies, most speech technologies are heavily language dependent and have to be developed separately for each language. The paper gives a detailed description of speech and language resources for Serbian
and kindred South Slavic languages developed during the last decade within joint projects of the Faculty of Technical Scienc...
The chapter presents the results of a comparison of the two existing decision tree based algorithms for modeling of prosodic features of Serbian, namely, intonation contour and phonetic segment durations, within text-to-speech systems. The first approach is used within the AlfaNum text-to-speech system based on concatenation of waveforms selected a...
The paper presents a novel split-and-merge algorithm for hierarchical clustering of Gaussian mixture models, which tends to improve on the local optimal solution determined by the initial constellation. It is initialized by local optimal parameters obtained by using a baseline approach similar to k-means, and it tends to approach more closely to th...
This paper presents a study of speaker recognition accuracy depending on the choice of features, window width and model complexity. The standard features were considered, such as linear and perceptual prediction coefficients (LPC and PLP) and mel-frequency cepstral coefficients (MFCC). Gaussian mixture model (GMM), with the use of HTK tools, was ch...
Technical solution.
In the Automatic Speech Recognition (ASR) systems, the capability to evaluate result reliability is extremely important for speech application usage. Thus, in many real-world applications with ambient noise, speaker variations, channel distortions etc, confidence measure must be computed for any recognition decision made by ASR systems. In this pap...
In this paper, the impact of the pitch on the variability of MFCC, and their influence on the performance of the automatic speech recognition system, is analyzed. In case that a speaker has a high pitch, the distance between adjacent harmonics in the spectrum of voiced phonemes is larger, which results in poorer description of the spectral envelope...
The paper presents the module for automatic generation of prosodic features of synthesized speech, namely, f 0 targets and phonetic segment durations, within the speech synthesizer AlfaNumTTS, the most sophisticated speech synthesis system for Serbo-Croatian language to date. The module is based on regression trees trained on a studio recorded sing...
This paper describes a decoder for large vocabulary continuous speech recognition developed at the Faculty of Technical Sciences, University of Novi Sad. The decoder is an open source solution written in the C++ programming language. The structure of the decoder is modular, allowing relatively simple modification and expansion of the code. It can b...
The purpose of Gaussian selection is to increase the speed of a Continuous Speech Recognition (CSR) system, without degrading its recognition accuracy. In this paper we expose some improvements towards finding the optimal setting for the previously developed Gaussian selection scheme. We have targeted the following: type of clustering in the means...
This paper was performed by examining the accuracy of speaker identification on telephone quality voice signals. Speaker recognizer was implemented using HTK. Influence of the considered telephone channels on transmitted voice signal is seen through its basic characteristics, types of the applied codecs and the effects caused by the condition of th...
In this paper a novel algorithm for Gaussian Selection (GS) of mixtures used in a continuous speech recognition system is
presented. The system is based on hidden Markov models (HMM), using Gaussian mixtures with full covariance matrices as output
distributions. The purpose of Gaussian selection is to increase the speed of a speech recognition syst...
Both ASR and TTS systems described in this chapter have been originally developed for the Serbian language. However, linguistic similarities among South Slavic languages have allowed the adaptation of this system to other South Slavic languages, with various degrees of intervention needed. As for ASR, adaptation to Bosnian and Croatian was very sim...
The number of observations which are the basis for parameter estimation plays an important role in the quality of acoustic models. HMM based automatic speech recognition (ASR) systems generally have to cope with an insufficient number of observations for a good estimate. One way of tackling this problem is a well known procedure of state-tying, whi...
In this paper performances of automatic speech recognition systems which use vocal tract length normalization (VTN) are presented. Beside standard procedure for VTN coefficient estimation several variants based on robust statistic methods are introduced. All systems which use VTN performed better than referent systems, while the best performance wa...
In this paper a novel method for energy normalization is presented. The objective of this method is to remove unwanted energy
variations caused by different microphone gains, various loudness levels across speakers, as well as changes of single speaker
loudness level over time. The solution presented here is based on principles used in automatic ga...
This paper gives a brief review of the development of systems for automatic speech recognition and text-to-speech synthesis in Serbian, Croatian and Macedonian language, at the Faculty of Engineering, University of Novi Sad, Serbia. The systems developed within this project enable two-way communication between humans and machines. These systems are...
This paper presents our initial results in a new approach to vocal tract normalization (VTN). In experiments based on continuous automatic speech recognition (ASR) the VTN procedure is in general carried out in both training and test phase. In the training phase it is used to obtain speaker independent acoustic models of phones. In the test phase i...
This paper presents training procedure for our continuous speech recognition system in Serbian, based on hidden Markov models. In this paper, we focus on solutions of the problem of insufficient training data, and improvements obtained by maximum likelihood train algorithm
This paper presents our study on different phonetic segmentation methods based on hidden Markov models evaluated against a Hebrew speech corpus. We investigated methods for fully automatic phonetic segmentation using only the corpus which should be segmented and automatically generated phonetic transcriptions. A new method for phonetic boundary cor...