
Zdravko Kacic- Dr.
- Head of Department at University of Maribor
Zdravko Kacic
- Dr.
- Head of Department at University of Maribor
About
190
Publications
23,123
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,494
Citations
Introduction
Current institution
Publications
Publications (190)
In order to recreate viable and human-like conversational responses, the artificial entity, i.e., an embodied conversational agent, must express correlated speech (verbal) and gestures (non-verbal) responses in spoken social interaction. Most of the existing frameworks focus on intent planning and behavior planning. The realization, however, is lef...
We know how important health is and that it is associated with a healthy lifestyle. That is why we understand the meaning of the Latin proverb: “If you lack a physician, let your physicians be a light-hearted soul, rest and moderate life”. However, nowadays in the urban environment of a developed society, when almost everything is available to us,...
When human-TV interaction is performed by remote controller and mobile devices only, the interactions tend to be mechanical, dreary and uninformative. To achieve more advanced interaction, and more human-human like, we introduce the virtual agent technology as a feedback interface. Verbal and co-verbal gestures are linked through complex mental pro...
This paper outlines a novel framework that has been designed to create a repository of “gestures” for embodied conversational agents. By utilizing it, the virtual agents can sculpt conversational expressions incorporating both verbal and non-verbal cues. The 3D representations of gestures are captured in EVA Corpus, and then stored as a repository...
This paper presents a method of binocular visual stimulation for brain–computer interfaces (BCIs) based on steady-state visual evoked potentials (SSVEPs) using phase-coded symbols. The proposed method’s emphasis is on a binocular phase-coded visual stimulus, which is based on the phase differences between the left- and right-eye stimuli, and a symb...
This paper outlines a novel framework that has been designed to create a repository of ''gestures'' for embodied conversational agents. By utilizing it, the virtual agents can sculpt conversational expressions incorporating both verbal and non-verbal cues. The 3D representations of gestures are captured in EVA Corpus, and then stored as a repositor...
In order to engage with a human user on more personal level, natural HCI is starting to virtualize itself and is utilizing the potential of entities resembling human collocutors in interaction. In particular through human-likeness, these entities represent the multimodal interaction models, which are capable to adapt to user’s context and to facili...
Conversation is becoming one of the key interaction modes in HMI. As a result, the conversational agents (CAs) have become an important tool in various everyday scenarios. From Apple and Microsoft to Amazon, Google, and Facebook, all have adapted their own variations of CAs. The CAs range from chatbots and 2D, carton-like implementations of talking...
The demand for solving large-scale complex problems continues to grow. Many real-world problems are described by a large number of variables that interact with each other in a complex way. The dimensionality of the problem has a direct impact on the computational cost of the optimization. During the last two decades, differential evolution has been...
WSEAS Transactions on Environment and Development
Embodied conversational agents are virtual entities that tend to imitate as many features of face-face dialogs as possible. In order to achieve this goal, the ability to reproduce synchronized verbal and co-verbal signals coupled into conversational behavior becomes essential. Further, signals such as social cues, attitude (emotions), personality,...
Multimodality and multimodal communication is a rapidly evolving research field addressed by scientists working in various perspectives, from psycho-sociological fields, anthropology and linguistics, to communication and multimodal interfaces, companions, smart homes and ambient assisted living etc. Multimodality in human-machine interaction is not...
This study is a part of an ongoing effort in order to empirically investigate in detail relations between verbal and co-verbal behavior expressed during multi-speaker highly spontaneous and affective face-to-face conversations. The main motivation for this study is to be able to create natural co-verbal resources for automatic synthesis of highly n...
In the paper, a speech-based platform for intelligent ambience and/or supportive environment applications is presented. The platform has a distributed architecture, which enables extended connectivity and support for multiple intelligent ambience services. The mobile unit Genesis is an integral part of the distributed platform, enabling interaction...
The incorporation of grammatical information into speech recognition systems is often used to increase performance in morphologically rich languages. However, this introduces demands for sufficiently large training corpora and proper methods of using the additional information. In this paper, we present a method for building factored language model...
Full access: https://authors.elsevier.com/a/1T~NG3OWJ8hFRu
As a result of the convergence of different services delivered over the internet protocol, internet protocol television (IPTV) may be regarded as the one of the most widespread user interfaces accepted by a highly diverse user domain. Every generation, from children to the elderly, can use...
In this chapter basic concepts of speech recognition are presented. Acoustic processing, acoustic modeling and search algorithms are briefly described. A more detailed explanation is given on language modeling. Afterwards some features of inflective languages are described and how these features are important in the process of designing speech reco...
In this chapter we will present established methods to measure the accuracy of a speech recognition system: word error rate, character error rate, and also lemma error rate and well as their usefulness. We will describe the Levenshtein distance used for sentence alignment and calculating word error rate. Some issues that arise from its use that aff...
In the previous chapters we described the use of language models in speech recognition, morphosyntactic description (MSD) tagging, error rates and error analysis based on information from MSD tags. In this we will present aspects of speech recognition applications that can be related to those factors. We will first describe general consideration on...
In this chapter we will present an example of how some of the techniques and ideas presented in the previous chapters can be applied. We will do this in form of a full example. We will perform LVCSR on a Broadcast news database and search for keywords inside N-best list results. After a first run we will present optimization to improve the performa...
This book covers language modeling and automatic speech recognition for inflective languages (e.g. Slavic languages), which represent roughly half of the languages spoken in Europe. These languages do not perform as well as English in speech recognition systems and it is therefore harder to develop an application with sufficient quality for the end...
Version 1.0.4 released. Version 1.0.4 of Meettell mobile application features the moderator functionality. By entering the moderator session code any participant can become a moderator of a discussion and can efficiently manage the discussion with his/her mobile phone. The Meettell mobile application now supports all the available discussion manage...
An advanced new system for managing discussions on meetings, conferences and other events.
In this paper a new approach for recognizing emotional speech from audio recordings is presented. In order to obtain the optimum processing window width for feature extraction and to achieve the highest level of recognition rates, a trade-off between time and frequency resolution must be made. At this point, we define a new procedure that combines...
This paper presents the problems of implementation and adjustment (calibration) of a metrology engine embedded in NXP's EM773 series microcontroller. The metrology engine is used in a smart metering application to collect data about energy utilization and is controlled with the use of metrology engine adjustment (calibration) parameters. The aim of...
This article presents a new method for detecting objects on waters’ surfaces using colour elimination based on image erosion with a morphological variable. The proposed object detection method includes definitions of the target’s colour space and colour deviation using Euclidean distance. It also introduces a procedure for image erosion using a mor...
Fingerprint image enhancement is a key step in the Automated Fingerprint Identification System (AFIS). Because of different factors that affect the image, such as skin condition (very dry or moist, damaged or worn down skin, etc.), sensor noise, irregular print on the sensor, etc., the fingerprint image needs to be enhanced so that the structures o...
This paper addresses the problem of statistical machine translation between highly inflected languages. Even when dealing with closely-related language pairs, statistical machine translation encounters problems if the parallel corpus is not big enough. To reduce the problem of data sparsity, we use the approach called factored translation, which ha...
In speech recognition systems language models are used to estimate the probabilities of word sequences. In this paper special emphasis is given to numerals–words that express numbers. One reason for this is the fact that in a practical application a falsely recognized numeral can change important content information inside the sentence more than ot...
Fingerprint enhancement is a key step in the Automated Fingerprint Identification System. Because of poor quality of a fingerprint the algorithm for feature extraction may extract features incorrectly, which affects incorrect fingerprint match and consequently inefficient fingerprint-based identity verification. Fingerprint image enhancement techni...
Fingerprint image enhancement is of key importance for the efficiency of the automated fingerprint identification system. Before we can enhance a fingerprint image with contextual filters, we need to enhance fingerprint image contrast and readability with non-directional filters. This article provides an analysis of the influence of non-directional...
Multimodal interfaces incorporating embodied conversational agents enable the development of novel concepts regarding interaction management tactics within responsive human-machine interfaces. Such interfaces provide several additional non-verbal communication channels, such as: natural visualized speech, facial expression, and different body motio...
The paper presents the novel design of a one-pass large vocabulary continuous-speech recognition decoder engine, named SPREAD. The decoder is based on a time-synchronous beam-search approach, including statically expanded cross-word triphone contexts. An approach using efficient tuple structures is proposed for the construction of the complete sear...
Several systems with multimodal interfaces are already available, and they allow for a more natural and more advanced exchange of information between man and a machine. Nevertheless, the television domain is still undergoing an innovation/development phase within which standard linear television is further enhanced with several novel technologies....
The main goal of using non-verbal modalities together with the general text-to-speech (TTS) system is to better emulate human-like course of the interaction between users and the UMB-SmartTV platform. Namely, when human-TV interaction is supported by TTS only, the interactions tend to be still less functional and less human-like. In order to achiev...
This paper evaluates the impact of combined transcoding and packet loss degradation on speech as input for the interactive voice response service (IVR) and proposes a method for classification of user input according to speech quality. Careful optimization of a communication system and all of its segments need to be considered, as the quality of th...
Viseme recognition from speech is one of the methods needed to operate a talking head system, which can be used in various areas, such as mobile services and applications, gaming, the entertainment industry, and so on. This paper proposes a novel method for generating acoustic models for viseme recognition from speech. The viseme acoustic models we...
IPTV services are still evolving and try to bring ICT novelties into IPTV environment. Several initiatives are
focused to provide more personalized interactivity to the standard TV sets and to develop more personalized interactive applications for STBs. Nevertheless, the personalization and interactivity are usually limited towards context-awarene...
When human-TV interaction is performed by remote controller and mobile devices only, the interactions tend
to be mechanical, dreary and uninformative. To achieve more advanced interaction, and more human-human like, we introduce the virtual agent technology as a feedback interface. Verbal and co-verbal gestures are linked through complex mental pr...
Article brings detailed information about procedures of building Slovenian lexica within the LC-STAR project, and also detailed information about the size of that lexica. University of Maribor joined the LC-STAR project in order to provide appropriate language resources for developing speech-to-speech translation technology for Slovenian language....
This paper investigates the influence of hypothesis length in N-best list rescoring. It is theoretically explained why language models prefer shorter hypotheses. This bias impacts on the word insertion penalty used in continuous speech recognition. The theoretical findings are confirmed by experiments. Parameter optimization performed on the Sloven...
This paper presents novel features and an architecture for an automatic on-line acoustic classification and segmentation system. The system includes speech/non-speech segmentation (with the emphasis on accurate speech/music segmentation), gender segmentation, and speech bandwidth segmentation. This automatic segmentation system can be easily integr...
Visual perception, speech perception and the understanding of perceived information are linked through complex mental processes. Gestures, as part of visual perception and synchronized with verbal information, are a key concept of human social interaction. Even when there is no physical contact (e.g., a phone conversation), humans still tend to exp...
One of the main problems with speech recognition for robots is noise. In this paper we propose two methods to enhance the robustness of continuous speech recognition in noisy environment. We show that the accuracy of recognition can be improved by better weighting the language model in the decision process. The second proposed method is based on la...
This paper introduces a nonlinear function into the frequency spectrum that improves the detection of vowels, diphthongs, and semivowels within the speech signal. The lower efficiency of consonant detection was solved by implementing the hangover and hangbefore criteria. This paper presents a procedure for faster definition of those optimal constan...
This paper proposes a new method for calculating acoustic confusability between words for automatic speech recognition. Acoustic confusability is one of the key elements influencing speech recognition accuracy. The proposed method is based on Levenshtein distance, calculated on phonetic transcriptions from the speech recognizer's vocabulary. The me...
This paper presents a framework for the efficient development and representation of morphological and phonetic lexicons, to
be used in speech technology applications. Solutions that would be the most appropriate for developing speech technologies
for specific language have to be analyzed when developing the lexicons. In the paper issues such as the...
Non-verbal behavior performed by embodied conversational agents still appears “wooden” and sometimes even “unnatural”. Annotated corpora and high resolution annotations capturing the expressive details of movement, may improve the gradualness of synthetic behavior. This paper presents a non-functional, form-oriented annotation scheme based on infor...
General extenders are expressions such as in tako naprej ‘and so on’, pa to ‘and such’, pa
tako ‘and like that’, ali pa nekaj takega ‘or something like that’. The paper provides a survey
of the forms and frequency of these expressions in various types of Slovene discourse and
through a qualitative analysis sheds light on discursive roles of the mos...
General extenders are expressions such as in tako naprej 'and so on', pa to 'and such', pa tako 'and like that', ali pa nekaj takega 'or something like that'. The paper provides a survey of the forms and frequency of these expressions in various types of Slovene discourse and through a qualitative analysis sheds light on discursive roles of the mos...
This paper proposes a gradient-descent based unit selection optimization algorithm for the optimization of unit-cost function weights and for improving the overall performance of the unit-selection algorithm, as used in a corpus-based text-to-speech synthesis system. Complex multidimensional and fuzzy-logic based unit-cost functions are used in the...
This study presents a new online method for speaker segmentation and clustering in real-world environments. It analyses and discusses the difficulties of online speaker diarisation and proposes a new segmentation and clustering method, in which the Bayesian information criterion (BIC) and the normalised cross-likelihood ratio (NCLR) are combined in...
In this paper the segmentation of the Aurora 2 database with three different types of models is presented. The segmentation is based on speech recognition results obtained by tests on the Aurora 2 database. Three types of tests are performed. In the first test the speech units are words (16 state HMMs) and in the second test the speech units are mo...
This paper presents speaker gender classification and segmentation. Such classification is frequently used in broadcast news domain. Because pitch is a feature that is difficult to calculate reliably in noisy environment, and because telephone speech is present in broadcast material, we focused on using general acoustic features for gender discrimi...
The ECESS consortium (European Center of Excellence in Speech Synthesis) aims to speed up progress in speech synthesis technology, by providing an appropriate evaluation framework. The key element of the evaluation framework is based on the partition of a text-to-speech synthesis system into distributed TTS modules. A text processing, prosody gener...
Presented paper deals with problems related to automatic recognition of expressive speech states. The analysis was performed using the Slovene part of the emotional speech corpus recorded under the international project Interface. The main focuses of presented emotional multi-level based feature extraction method are not the emotions itself but rat...
This paper presents a novel feature for online speech/music segmentation based on the variance mean of filter bank energy (VMFBE). The idea that encouraged the feature’s construction is energy variation in a narrow frequency sub-band. The energy varies more rapidly, and to a greater extent for speech than for music. Therefore, an energy variance in...
This paper presents a combined pitch frequency (F0) determination and epoch (pitch period) marking procedure CPDMA using merged normalized forward–backward correlation. The algorithm consists of several processing steps: preprocessing of the input speech signal, voicing detection using artificial neural networks, F0 determination stage based on nor...
In this paper the influence of hangover and hangbefore criteria on automatic speech recognition is presented. Voice activity detection (VAD) algorithm is nowadays almost always part of automatic speech recognition systems. Hangover and hangbefore criteria can be integrated into VAD algorithm after basic VAD decision. Hangover and hangbefore criteri...
The paper analyses the influence of speech/non-speech segmentation on on-line and off-line speaker segmentation accuracy. On-line and off-line speaker segmentation approaches together with speaker diarization are shortly reviewed and popular "state of the art" test systems are presented. Both systems are tested on a given test set with and without...
This paper presents novel feature-group for on-line speech/music segmentation for broadcast news domain. The features are based on mel-frequency cepstral coefficients variance (MFCCV). The idea behind the feature-group construction is the energy variation in a narrow frequency sub-band. The variation is bigger for speech than for music. For feature...
This paper presents the results of a study on modeling the highly inflective Slovenian language. We focus on creating a language model for a large vocabulary speech recognition system. A new data-driven method is proposed for the induction of inflectional morphology into language modeling. The research focus is on data sparsity, which results from...
The paper presents platform for web based TTS modules and systems evaluation named RES (Remote Evaluation System). It is being
developed within the European Centre of Excellence for Speech Synthesis (ECESS, www.ecess.eu). The presented platform will
be used for web based online evaluation of various text-to-speech (TTS) modules, and even complete T...
This paper describes the ECESS evaluation campaign of voice activity and voicing detection. Standard VAD classifies signal into speech and non-speech, we extend it to VAD+ so that it classifies a signal as a sequence of non-speech, voiced and unvoiced segments. The evaluation is performed on a portion of the Spanish SPEECON database with manually l...
This paper addresses the topic of online unsupervised speaker segmentation in a complex audio environment as it is present in the Broadcast News databases. A new two stage speaker change detection algorithm is proposed, which combines the Bayesian Information Criterion with an ABLS-SCD statistical framework where adapted Gaussian mixture models are...
This paper addresses the topic of unsupervised speaker segmentation for automatic speech recognition in a complex real life environment like broadcast news domain. A statistical approach where a universal background model (UBM) is applied for online speaker segmentation was compared with the widely used Bayesian information criterion (BIC) approach...
In this paper we present research work that was carried out on Slovenian BNSI Broadcast News database regarding the speech bandwidth classification. Speech recorded in studio environment has frequency bandwidth of 8 kHz, while speech recorded over telephone channel has the bandwidth of 3.1 kHz. Speech bandwidth classification enables us to use sepa...
The consortium ECESS (European Center of Excellence for Speech Synthesis) has set up a framework for evaluation of software modules and tools relevant for speech synthesis. Till now two lines of evaluation campaigns have been established: (1) Evaluation of the ECESS TTS modules (text processing, prosody, acoustic synthesis). (2) Evaluation of ECESS...
The paper presents a new Slovenian language resource, the Slovenian BNSI Broadcast News database. Its main goal is to produce the necessary language resources for the Slovenian large vocabulary continuous speech recognition in an unrestricted domain. The BNSI Broadcast News database is a result of cooperation between the Faculty of electrical Lngin...
A language model is a description of language. Although grammar has been the prevalent tool in modelling language for a long time, interest has recently shifted towards statistical modelling. This chapter refers to speech recognition experiments, although statistical language models are applicable over a wide-range of applications: machine translat...
This paper concerns the problem of automatic speech recognition in noise-intense and adverse environments. The main goal of the proposed work is the definition, implementation, and evaluation of a novel noise robust speech signal parameterization algorithm. The proposed procedure is based on time-frequency speech signal representation using wavelet...
This article presents a new unified approach to modeling grapheme-to-phoneme conversion for the PLATTOS Slovenian text-to-speech system. A cascaded structure consisting of several successive processing steps is proposed for the aim of grapheme-to-phoneme conversion. Processing foreign words and rules for the post-processing of phonetic transcriptio...
This paper presents a noise robust feature extraction algorithm NRFE using joint wavelet packet decomposition (WPD) and autoregressive (AR) modeling of a speech signal. In opposition to the short time Fourier transform (STFT)-based time–frequency signal representation, wavelet packet decomposition can lead to better representation of non-stationary...
In this article, we focus on creating a large vocabulary speech recognition system for the Slovenian language. Currently, state-of-the-art recognition systems are able to use vocabularies with sizes of 20,000 to 100,000 words. These systems have mostly been developed for English, which belongs to a group of uninflectional languages. Slovenian, as a...
This paper proposes a time and space-efficient architecture for a text-to-speech synthesis system (TTS). The proposed architecture can be efficiently used in those applications with unlimited domain, requiring multilingual or polyglot functionality. The integration of a queuing mechanism, heterogeneous graphs and finite-state machines gives a power...
Edge detection plays an important role in image analysis systems. We present a color selective edge detection technique, which consists of two image processing steps. The first step represents pixel-based color detection and the second progressive block-oriented edge detection. The combination of these two steps defines a selective edge detection t...
In this paper, we analyse three statistical models for the machine translation of Slovenian into English. All of them are based on the IBM Model~4, but differ in the type of linguistic knowledge they use. Model 4a uses only basic linguistic units of the text, i.e., words and sentences. In Model 4b, lemmatisation is used as a preprocessing step of t...
Embodied conversational agents employed in multimodal interaction applications have the potential to achieve similar properties
as humans in face-to-face conversation. They enable the inclusion of verbal and nonverbal communication. Thus, the degree
of personalization of the user interface is much higher than in other human-computer interfaces. Thi...
This paper focuses on acoustic modeling in speech recognition. A novel approach how to build grapheme based acoustic models with conversion from existing phoneme based acoustic models is proposed. The grapheme based acoustic models are created as weighted sum from monophone acoustic models. The influence of particular monophone is determined with t...
This paper presents a rule-based method to determine emotion-dependent features, which are defined from high-level features derived from the statistical measurements of prosodic parameters of speech. Emotion-dependent features are selected from high-level features using extraction rules. The ratio of emotional expression similarity between two spea...
Word based statistical machine translation has emerged as a robust method for building machine translation systems. Inflective languages point out some problems with the approach. Data sparsity is one of them. It can be partly solved by enlarging the training corpus and/or including richer linguistic information: lemmas and morpho-syntactic feature...
Special purpose cameras are generally used in systems for image analysis because of their high quality captured images. But their big drawback is high price, which makes them inaccessible for low cost image analysis systems. In this paper we present a pixel-based method for color image segmentation, focused on segmentation of color images captured...
This paper deals with statistical machine translation. The quality of translation system strongly depends on characteristics of the training corpus. In this paper we address the problem of very sparse training corpora. In languages with a very rich morphology, learning methods suffer from a significant sparseness problem. We present and compare var...
This paper addresses the topic of defining phonetic broad classes needed during acoustic modeling for speech recognition in the procedure of decision tree based clustering. The usual approach is to use phonetic broad classes which are defined by an expert. This method has some disadvantages, especially in the case of multilingual speech recognition...
This paper describes a large scale experiment in which eight research institutions have tested their audio partitioning and la- beling algorithms on the same data, a multi-lingual database of news broadcasts, using the same evaluation tools and protocols. The experiments have provide more insight in the cross-lingual robustness of the methods and t...
This paper focuses on the estimation of the Tilt intonation model (1). Usually, Tilt events are detected using a first estima- tion which is improved using gradient descent techniques. To speed up the search we propose to use a closed form expression for some of the Tilt parameters. The gradient descent search is used only for the time related para...
This paper presents a novel computationally efficient voice activity detection (VAD) algorithm and emphasizes the importance of such algorithms in distributed speech recognition (DSR) systems. When using VAD algorithms in telecommunication systems, the required capacity of the speech transmission channel can be reduced if only the speech parts of t...
This paper presents a comparison between three different types of acoustic basic units in a speech recogniser. If statistical modeling is applied for speech recognition, allophones are usually used as basic units. In such a system, a module for grapheme to phoneme conversion is necessary, due to the inflectional nature of the Slovenian language. Ma...
Statistical language models encapsulate varied information, both grammatical and semantic, present in a language. This paper investigates various techniques for overcoming the difficulties in modelling highly inflected languages. The main problem is a large set of different words. We propose to model the grammatical and semantic information of word...
In multilingual text-to-speech synthesis systems, many external extensive natural language resources are used, especially in the text processing part. Therefore it is very important that representation of these resources is time and space efficient. It is also very important that language resources for new languages can be easily incorporated into...
This paper presents the ongoing work on crosslingual speech recognition in the MASPER initiative. Source acoustic models were transferred to two different target languages - Hungarian and Slovenian. Beside the monolingual source acoustic models, also a semi-multilingual set was defined. An expert-knowledge approach and a data-driven method were app...