
Stanisław Kacprzak- PhD
- Professor (Assistant) at AGH University of Krakow
Stanisław Kacprzak
- PhD
- Professor (Assistant) at AGH University of Krakow
About
25
Publications
51,944
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
281
Citations
Introduction
Current institution
Additional affiliations
April 2020 - present
March 2017 - March 2020
Education
October 2012 - January 2020
October 2005 - March 2011
Publications
Publications (25)
In this paper we suggest to apply a new feature, called Minimum Energy Density (MED), in discrimination of audio signals between speech and music. Our method is based on the analysis of local energy for 1 or 2.5 seconds audio signals. An elementary analysis of the probability for the power distribution is an effective tool supporting the decision m...
Hallucinations of deep neural models are amongst key challenges in automatic speech recognition (ASR). In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that a...
Prediction of speaker's height is of interest for voice forensics, surveillance, and automatic speaker profiling. Until now, TIMIT has been the most popular dataset for training and evaluation of the height estimation methods. In this paper, we introduce HeightCeleb, an extension to VoxCeleb, which is the dataset commonly used in speaker recognitio...
The aim of speech enhancement is to improve speech signal quality and intelligibility from a noisy microphone signal. In many applications, it is crucial to enable processing with small computational complexity and minimal requirements regarding access to future signal samples (look-ahead). This paper presents signal-based causal DCCRN that improve...
GlossoVR is a virtual reality (VR) application that combines training in public speaking in front of a virtual audience and in voice emission in relaxation exercises. It is accompanied by digital signal processing (DSP) and artificial intelligence (AI) modules which provide automatic feedback on the vocal performance as well as the behavior and psy...
In classification tasks, the classification accuracy diminishes when the data is gathered in different domains. To address this problem, in this paper, we investigate several adversarial models for domain adaptation (DA) and their effect on the acoustic scene classification task. The studied models include several types of generative adversarial ne...
A new VR application for voice and speech training has emerged from a problem observable in everyday life: an anxiety of public speaking. In the design process, we incorporated both domain knowledge of experts as well as research with end-users in order to explore the needs and the context of the problem. Functionalities of the prototype are the ef...
Phones for 239 non-annotated languages were selected by automatic segmentation based on changes of energy in the time-frequency representation of speech signals. Phone boundaries were set at location of relatively major changes in energy distribution between seven frequency bands. A vector of average energies calculated for eleven frequency bands w...
In this paper, we examine the use of i-vectors both for age
regression as well as for age classification. Although i-vectors
have been previously used for age regression task, we extend
this approach by applying fusion of i-vectors and acoustic features
regression to estimate the speaker age. By our fusion we
obtain a relative improvement of 12.6%...
The automatic segmentation and parametrization based on the frequency analysis was used to compare with manually annotated phones. The phones boundaries were fixed in places of relatively large changes in the energy distribution between the frequency bands. Frequency parametrization and clustering enabled the division of phones into groups (cluster...
A comparative analysis of multi-language speech samples is conducted using acoustic characteristics of phoneme realisations in spoken languages. Different approaches to investigation of phonemic diversity in the context of language evolution are compared and discussed. We introduced our approach (materials and methods) and presented preliminary res...
The paper presents the possibility of automatic speech processing in order to determine the acoustic similarity between phones. Subsequent processing steps of recorded speech signal result in phones’ segmentation, even without prior knowledge of their boundaries. The use of frequency signal parameterization and clustering algorithms facilitates a d...
The results of investigation of the differences among the phonemes of 574 languages all over the world
are presented. We attempt to verify the hypothesis of African origin for all languages and gradual languages
diversification on other parts of the globe. The obtained results justify the languages classification by applying
the methods used in evolu...
Questions
Questions (6)
I'm interested in good quality speech with no noise.
From what I found (according to comparision from 2007) the best ratio is achieved by YULS. Is this still actual?
As I'm new to the topic, I'm looking for information on benchmark corpora that can be obtained (not necessary free) for audio events classification or computational auditory scene analysis.
I'm especially interested in house/street sounds.
I'm looking for fast DPGMM (Dirichlet Process Guassian Mixture Model) implementation intended for high number of observations?
I'm doing research on clustering speech utterances based on language. It seems to me that the only article dealing with such problem is:
Reynolds, Douglas A., et al. "Blind clustering of speech utterances based on speaker and language characteristics." ICSLP. 1998.
Maybe someone here is familiar with more recent work or is also working on that problem?