Stanisław Kacprzak

Stanisław Kacprzak
  • PhD
  • Professor (Assistant) at AGH University of Krakow

About

25
Publications
51,944
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
281
Citations
Current institution
AGH University of Krakow
Current position
  • Professor (Assistant)
Additional affiliations
April 2020 - present
AGH University of Krakow
Position
  • Professor (Assistant)
March 2017 - March 2020
AGH University of Krakow
Position
  • Research Assistant
Education
October 2012 - January 2020
AGH University of Krakow
Field of study
  • Computer Science
October 2005 - March 2011
Lodz University of Technology
Field of study
  • Computer Science

Publications

Publications (25)
Conference Paper
Full-text available
In this paper we suggest to apply a new feature, called Minimum Energy Density (MED), in discrimination of audio signals between speech and music. Our method is based on the analysis of local energy for 1 or 2.5 seconds audio signals. An elementary analysis of the probability for the power distribution is an effective tool supporting the decision m...
Preprint
Full-text available
Hallucinations of deep neural models are amongst key challenges in automatic speech recognition (ASR). In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that a...
Preprint
Full-text available
Prediction of speaker's height is of interest for voice forensics, surveillance, and automatic speaker profiling. Until now, TIMIT has been the most popular dataset for training and evaluation of the height estimation methods. In this paper, we introduce HeightCeleb, an extension to VoxCeleb, which is the dataset commonly used in speaker recognitio...
Preprint
The aim of speech enhancement is to improve speech signal quality and intelligibility from a noisy microphone signal. In many applications, it is crucial to enable processing with small computational complexity and minimal requirements regarding access to future signal samples (look-ahead). This paper presents signal-based causal DCCRN that improve...
Conference Paper
Full-text available
GlossoVR is a virtual reality (VR) application that combines training in public speaking in front of a virtual audience and in voice emission in relaxation exercises. It is accompanied by digital signal processing (DSP) and artificial intelligence (AI) modules which provide automatic feedback on the vocal performance as well as the behavior and psy...
Preprint
In classification tasks, the classification accuracy diminishes when the data is gathered in different domains. To address this problem, in this paper, we investigate several adversarial models for domain adaptation (DA) and their effect on the acoustic scene classification task. The studied models include several types of generative adversarial ne...
Conference Paper
Full-text available
A new VR application for voice and speech training has emerged from a problem observable in everyday life: an anxiety of public speaking. In the design process, we incorporated both domain knowledge of experts as well as research with end-users in order to explore the needs and the context of the problem. Functionalities of the prototype are the ef...
Article
Full-text available
Phones for 239 non-annotated languages were selected by automatic segmentation based on changes of energy in the time-frequency representation of speech signals. Phone boundaries were set at location of relatively major changes in energy distribution between seven frequency bands. A vector of average energies calculated for eleven frequency bands w...
Conference Paper
Full-text available
In this paper, we examine the use of i-vectors both for age regression as well as for age classification. Although i-vectors have been previously used for age regression task, we extend this approach by applying fusion of i-vectors and acoustic features regression to estimate the speaker age. By our fusion we obtain a relative improvement of 12.6%...
Conference Paper
The automatic segmentation and parametrization based on the frequency analysis was used to compare with manually annotated phones. The phones boundaries were fixed in places of relatively large changes in the energy distribution between the frequency bands. Frequency parametrization and clustering enabled the division of phones into groups (cluster...
Article
Full-text available
A comparative analysis of multi-language speech samples is conducted using acoustic characteristics of phoneme realisations in spoken languages. Different approaches to investigation of phonemic diversity in the context of language evolution are compared and discussed. We introduced our approach (materials and methods) and presented preliminary res...
Article
Full-text available
The paper presents the possibility of automatic speech processing in order to determine the acoustic similarity between phones. Subsequent processing steps of recorded speech signal result in phones’ segmentation, even without prior knowledge of their boundaries. The use of frequency signal parameterization and clustering algorithms facilitates a d...
Conference Paper
Full-text available
The results of investigation of the differences among the phonemes of 574 languages all over the world are presented. We attempt to verify the hypothesis of African origin for all languages and gradual languages diversification on other parts of the globe. The obtained results justify the languages classification by applying the methods used in evolu...

Questions

Questions (6)
Question
From what I found (according to comparision from 2007) the best ratio is achieved by YULS. Is this still actual?
Question
As I'm new to the topic, I'm looking for information on benchmark corpora that can be obtained (not necessary free) for audio events classification or computational auditory scene analysis.
I'm especially interested in house/street sounds.
Question
I'm looking for fast DPGMM (Dirichlet Process Guassian Mixture Model) implementation intended for high number of observations?
Question
I'm doing research on clustering speech utterances based on language. It seems to me that the only article dealing with such problem is:
Reynolds, Douglas A., et al. "Blind clustering of speech utterances based on speaker and language characteristics." ICSLP. 1998.
Maybe someone here is familiar with more recent work or is also working on that problem?

Network

Cited By