György Szaszák

György Szaszák
Budapest University of Technology and Economics · Department of Telecommunications and Media Informatics

PhD

About

82
Publications
15,844
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
546
Citations
Citations since 2017
33 Research Items
345 Citations
2017201820192020202120222023020406080
2017201820192020202120222023020406080
2017201820192020202120222023020406080
2017201820192020202120222023020406080
Additional affiliations
July 2014 - present
Budapest University of Technology and Economics
Position
  • Researcher
October 2012 - May 2014
Idiap Research Institute
Position
  • PostDoc Position

Publications

Publications (82)
Conference Paper
In forensic comparison, document classification techniques are used mainly for authorship classification and author profiling. In the present study, we aim to in-troduce paragraph vector modelling (by Doc2Vec) into the likelihood-ratio framework paradigm of forensic evidence comparison. Transcriptions of sponta-neous speech recording are used as in...
Article
This paper reviews the applied Deep Learning (DL) practices in the field of Speaker Recognition (SR), both in verification and identification. Speaker Recognition has been a widely used topic of speech technology. Many research works have been carried out and little progress has been achieved in the past 5–6 years. However, as Deep Learning techniq...
Conference Paper
Full-text available
Dysphonia can be caused not only by the frequent use voice, but many other reasons, including environmental noise, environmental pollution and dry environment. Dysphonia can serve as an indicator for several serious and less serious diseases. Therefore a system that models the cognitive decision making processes of an expert would be of great value...
Conference Paper
Transformer models have become to state-of-the-art in natural language understanding, their use for language modeling in Automatic Speech Recognition (ASR) is also promising. Albeit Transformer based language models were shown to improve ASR performance, their computational complexity makes their application in real-time systems quite challenging....
Chapter
Advanced neural network models have penetrated Automatic Speech Recognition (ASR) in recent years, however, in language modeling many systems still rely on traditional Back-off N-gram Language Models (BNLM) partly or entirely. The reason for this are the high cost and complexity of training and using neural language models, mostly possible by addin...
Preprint
Full-text available
Recently Deep Transformer models have proven to be particularly powerful in language modeling tasks for ASR. Their high complexity, however, makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of neural network Language Models (LM) can be transferred to t...
Preprint
Full-text available
Advanced neural network models have penetrated Automatic Speech Recognition (ASR) in recent years, however, in language modeling many systems still rely on traditional Back-off N-gram Language Models (BNLM) partly or entirely. The reason for this are the high cost and complexity of training and using neural language models, mostly possible by addin...
Article
In Automatic Speech Recognition (ASR), inserting the punctuation marks into the word chain hypothesis has long been given low priority, as efforts were concentrated on minimizing word error rates. Punctuation, however, also has a high impact on the transcription quality perceived by the users. Prosody, textual context and their combination have sin...
Conference Paper
Full-text available
Napjainkban a mesterséges intelligencia alapú megoldások egy-re inkább a beszélt nyelv gépi megértésére törekednek. Ennek preferált megközelítése az, amikor automatikus beszédfelismerő (ASR) rendsze-rek használatával átiratokat hozunk létre, amelyek további, szövegalapú elemzésen mennek keresztül. A gépi átiratok szóhibákat is tartalmazhatnak; ezen...
Preprint
This paper summarizes the applied deep learning practices in the field of speaker recognition, both verification and identification. Speaker recognition has been a widely used field topic of speech technology. Many research works have been carried out and little progress has been achieved in the past 5-6 years. However, as deep learning techniques...
Conference Paper
Full-text available
Modeling the less constrained grammar and word order of conversational speech poses a great challenge to conventional back-off n-gram language models (BNLMs). Recurrent Neural Network Language Models (RNNLMs) can provide much better predictions, however, in real-time Automatic Speech Recognition (ASR) systems (e.g. speech dictation) the process del...
Conference Paper
Full-text available
Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Recurrent Neural Network Language Model (RNNLM) can provide remedy for the high perplexity of the task; however, two-pass decoding introduces a considerable processing delay. In order to eliminate this delay...
Preprint
Full-text available
Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Recurrent Neural Network Language Model (RNNLM) can provide remedy for the high perplexity of the task; however, two-pass decoding introduces a considerable processing delay. In order to eliminate this delay...
Article
Full-text available
Emerging Artificial Intelligence (AI) technology has brought machines to reach an equal or even superior level compared to human capabilities in several fields; nevertheless, among many other fields, making a computer able to understand human language still remains a challenge. When dealing with speech understanding, Automatic Speech Recognition (A...
Article
Full-text available
In recent years, several techniques to compensate for di-alectal variations of the same base language in ASR has been proposed. Conservative retraining, transfer learning, multi-task training, matrix factorization, i-vector based techniques as well as adversarial and teacher-student training, have been proposed for the specific purpose of ASR deep...
Conference Paper
Full-text available
Napjainkban a rekurrens neurális hálókon alapuló szekvencia-modellezés hatékony eszköznek bizonyult több, a természetesnyelv-feldolgozás (NLP) témaköréhez tartozó probléma megoldásában. Ide sorolhatjuk az írásjelek gépi úton történő visszaállítását, vagyis az automatikus központozást is, melynek során a szó- és/vagy akusztikai eseménysorozathoz írá...
Conference Paper
Full-text available
Punctuation of ASR-produced transcripts has received increasing attention in the recent years; RNN-based sequence modelling solutions which exploit textual and/or acoustic features show encouraging performance. Switching the focus from the technical side, qualifying and quantifying the benefits of such punctuation from end-user perspective have not...
Conference Paper
Full-text available
The sequence-to-sequence modelling paradigm has been successfully used in automatic punctuation of text generated by Automatic Speech Recognizers (ASR), using bidirectional Recurrent Neural Networks (RNN), which map the word and/or acoustic event sequence to the punctuation sequence. The current paper proposes to enhance the word sequence-based sys...
Conference Paper
Full-text available
Automatic document summarization is used to extract or generate short, but rich snippets which represent well the essential meaning and key information contained in documents. Classical approaches of extractive summarization mostly rely on bag-of-words models or a graph representation reflecting word neighbourhood information to obtain a ranking of...
Article
The detection of prosodic events, prosodic stress, and speech segmentation based on prosody have received much attention in the research community in the past decades. Prosody is relevant for both main areas of speech technology, text-to-speech synthesis and automatic speech recognition and understanding, and is exploited increasingly: besides prov...
Conference Paper
Full-text available
Az automatikus beszédfelismerő rendszerek (ASR) kimenete általában nem tartalmaz írásjeleket, pedig ezek döntően befolyásolják egy szöveg értelmét és értelmezhetőségét. Az írásjel-visszaállítás problémájára a közelmúltban eredményesen alkalmaztak rekurrens neurális hálókat (RNN). A valós idejű ASR kimenetekben (pl. televíziós feliratok) történő írá...
Conference Paper
Full-text available
Automatic Speech Recognition and Understanding (ASRU) systems can generally use temporal and situational context information to improve their performance for a given task. This is typically done by rescoring the ASR hypotheses or by dynamically adapting the ASR models. For some domains such as Air Traffic Control (ATC), this context information can...
Conference Paper
Full-text available
Automatic Speech Recognition (ASR) rarely addresses the punctuation of the obtained transcriptions. Recently, Recurrent Neural Network (RNN) based models were proposed in automatic punctuation exploiting wide word contexts. In real-time ASR tasks such as closed captioning of live TV streams, text based punctuation poses two particular challenges: a...
Conference Paper
Full-text available
Automatic classification methods are frequently used in early diagnosis of different diseases that affect speech production. These methods can also be applied to identify speech samples from patients affected by Parkinson's disease (PD) or depressive disorder (DD). This paper is interested in applying automatic stress detection and prosodic phrasin...
Conference Paper
Full-text available
Closed captioning is a common method to improve accessibility of TV programs for people who are hearing impaired or hard of hearing, while representing an application relevant for cognitive infocommunication. However, live captions provided by automatic speech recognition systems usually lack punctuation, making them hard to follow. In this paper,...
Conference Paper
Full-text available
Automatic Speech Recognition (ASR) can introduce higher levels of automation into Air Traffic Control (ATC), where spoken language is still the predominant form of communication. While ATC uses standard phraseology and a limited vocabulary, we need to adapt the speech recognition systems to local acoustic conditions and vocabularies at each airport...
Conference Paper
Full-text available
A betegségek beszéd alapján történő korai diagnosztizálása során gyakori az automatikus osztályozási módszerek alkalmazása. Ezek az eljárások alkalmazhatók arra is, hogy Parkinson-kóros, valamint de-pressziós pácienseket az egészséges kontrollcsoport tagjaitól megkülönböztessük. A patológiás beszéd elemzése több szinten is elvégezhető; ebben a cikk...
Conference Paper
Full-text available
This paper addresses speech summarization of highly spontaneous speech. The audio signal is transcribed using an Automatic Speech Recognizer, which operates at relatively high word error rates due to the complexity of the recognition task and high spontaneity of speech. An analysis is carried out to assess the propagation of speech recognition erro...
Conference Paper
Full-text available
The Weighted Correlation based Atom Decomposition (WCAD) is a recently proposed physiological intonation model that decomposes the pitch contour into elementary components — atoms. Since these atoms are said to correspond to laryngeal muscle activation, in theory they could be used to infer higher linguistic meaning from the pitch contour. One such...
Conference Paper
Full-text available
Weighted Correlation based Atom Decomposition (WCAD) algorithm is a technique for intonation modelling that uses a matching pursuit framework to decompose the F0 contour into a set of basic components, called atoms. The atoms attempt to model the physiological activation of the laryngeal muscles responsible for changes in F0. Recently, WCAD has bee...
Conference Paper
Stress annotations in the training corpus of speech synthesis systems are usually obtained by applying language rules to the transcripts. However, the actual stress patterns seen in the waveform are not guaranteed to be canonical, they can deviate from locations defined by language rules. This is driven mostly by speaker dependent factors. Therefor...
Conference Paper
Full-text available
Since the prosody of a spoken utterance carries information about its discourse function, salience, and speaker attitude, prosody models and prosody generation modules have played a crucial part in text-to-speech (TTS) synthesis systems from the beginning, especially those set not only on sounding natural, but also on showing emotion or particular...
Conference Paper
This paper addresses speech summarization of highly spontaneous speech. Speech is converted into text using an ASR, then segmented into tokens. Human made and automatic, prosody based tokenization are compared. The obtained sentence-like units are analysed by a syntactic parser to help automatic sentence selection for the summary. The preprocessed...
Conference Paper
Full-text available
Cikkünkben egy valós idejű, kis erőforrás-igényű gépi beszéd-szöveg átalakító rendszert mutatunk be, melyet elsősorban televíziós közéleti társalgási beszéd fel-iratozására fejlesztettünk ki. Megoldásunkat összevetjük a tématerületen legelter-jedtebben használt nyílt forráskódú keretrendszer, a Kaldi dekóderével is. Ezen felül különböző adatbázis-m...
Conference Paper
Full-text available
A cikkünkben felvázolt vizsgálat fókuszában az áll, hogy kiderüljön, milyen mértékű szintaktikai elemzést képes végrehajtani a magyarlánc nyelvi elemző a beszédfelismerő által kibocsájtott, hibákkal terhelt szövegeken, és ez az elemzés mennyiben hasonlít a hibátlan referenciaszövegen futtatotthoz, illetve azonosítható-e az elemzésnek olyan szintje,...
Conference Paper
Full-text available
Several studies use idealized, fluent utterances to comprehend spoken language. Disfluencies are often regarded to be just a noise in the speech flow. Other works argue that fragmented structures (disfluencies, silent and filled pauses) are important and can help better understanding. By extending the original concept of speech disfluency, the curr...
Conference Paper
Full-text available
Generating proper and natural sounding prosody is one of the key interests of today’s speech synthesis research. An important factor in this effort is the availability of a precisely labelled speech corpus with adequate prosodic stress marking. Obtaining such a labelling constitutes a huge effort, whereas interannotator agreement scores are usually...
Conference Paper
Full-text available
In this paper, the application of LVCSR (Large Vocabulary Continuous Speech Recognition) technology is investigated for real-time, resource-limited broadcast close captioning. The work focuses on transcribing live broadcast conversation speech to make such programs accessible to deaf viewers. Due to computational limitations, real time factor (RTF)...
Article
Information extraction from written or spoken archives is a challenging infocommunication task, especially if a deep automatic analysis of the information structure is also targeted. The present research investigates focus detection approaching from an automatic analysis point of view for text (NLP) and speech (prosody) modalities. Deep syntactic a...
Article
Full-text available
This letter is interested in assessing the pros and contras of using an overall continuous versus a disrupted, not overall defined F0 estimate and compare formal and informal speech styles in this regard. In an automatic phono-logical phrasing task, three alternatives of F0 post-processing are compared, ranging from a natural F0 contour disrupted a...
Conference Paper
Full-text available
This is an overview of a Joint Research Project within the Scientific co-operation between Eastern Europe and Switzerland (SCOPES) Program of the Swiss National Science Foundation (SNFS) and Swiss Agency for Development and Cooperation (SDC). Within the SP2 SCOPES Project on Speech Prosody, in the course of the following two years, the four partner...
Article
Full-text available
The present paper investigates automatic prosodic phrasing of spontaneous speech: a two-step segmentation technique is presented, based on unsupervised learning. In the first step, the Intonational Phrases (IP) are detected automatically based on speech energy, spectral centroid and a double-thresholding technique. In the second step, Phonological...
Conference Paper
The aim of this research is to segment spontaneous speech using an unsupervised learning technique. We are especially interested from a machine perception or detection point-of-view, and focus on revealing some structure of prosody in spontaneous speech. The BEA spontaneous speech database is used to develop a speech segmentation system. The sponta...
Conference Paper
In this paper, use of intra- and crosslingual adaptation is addressed in cognitive infocommunication, for an ASR application in a bilingual environment. State-of-the-art linear regression based adaptation approaches are evaluated after a brief theoretical overview of the applied techniques. As expected, these contribute to significant improvement w...
Article
This paper investigates the usage of prosody for the improvement of keyword spotting, focusing on the highly agglutinating Hungarian language, where keyword spotting cannot be effectively performed using LVCSR, as such systems are either unavailable or hard to operate due to high OOV rates and poor N-gram language modelling capabilities. Therefore,...
Article
Full-text available
The relation between syntax and prosody is evident, even if the prosodic structure cannot be directly mapped to the syntactic one and vice versa. Syntax-to-prosody mapping is widely used in text-tospeech applications, but prosody-to-syntax mapping is mostly missing from automatic speech recognition/understanding systems. This paper presents an expe...
Conference Paper
Speech prosody and speech syntax are closely related, and this correspondence - syntax to prosody mapping - is exploited in text-to-speech infocommunication applications. However, in automatic speech recognition and understanding based inter-cognitive infocommunication, the use of prosody to syntax mapping is mostly restricted to minimal pair disam...
Conference Paper
Dealing with spontaneous speech constitutes big challenge both for linguistics and engineers of speech technology. For read speech, prosody was assessed as an automatic decomposition for phonological phrases using supervised method (HMM) in earlier experiments. However, when trying to adapt this automatic approach for spontaneous speech, the cluste...
Article
Full-text available
The paper intends to give a brief summary of one the most recent efforts on building the pan-European language technology infrastructure: META-NET – a network of Excellence consisting of 54 research centres from 33 countries – and specifically, its Central and South-European participating project: CESAR. One of the major activities of the project i...
Article
Full-text available
This study presents a preliminary investigation into the automatic assessment of language-impaired children's (LIC) prosodic skills in one grammatical aspect: sentence modalities. Three types of language impairments were studied: autism disorder (AD), pervasive developmental disorder-not otherwise specified (PDD-NOS), and specific language impairme...
Conference Paper
Full-text available
Automatic recognition of spontaneous speech speech to text transformation is one of the most challenging tasks today. Whilst the recognition of read or formal speech (e.g. dictation) is possible for several languages by high accuracy rates allowing also commercial exploitation, the automatic recognition of spontaneous speech is a harder task, yield...
Conference Paper
Full-text available
Prosody and syntax are highly related, even if the prosodic structure cannot be directly mapped to the syntactic one and vice versa. This paper presents an experiment for exploring in what degree a powerful HMM-based automatic prosodic segmentation tool can recover the syntactic structure of an utterance in speech understanding systems. Results sho...
Article
This paper analyzes the nature of the process involved in optional vowel reduction in Hungarian, and the acoustic structure of schwa variants in spontaneous speech. The study focuses on the acoustic patterns of both the basic realizations of Hungarian vowels and their realizations as neutral vowels (schwas), as well as on the design, implementation...
Article
In this paper acoustic processing and modelling of the supra-segmental characteristics of speech is addressed, with the aim of incorporating advanced syntactic and semantic level processing of spoken language for speech recognition/understanding tasks. The proposed modelling approach is very similar to the one used in standard speech recognition, w...
Conference Paper
Full-text available
been prepared, in which we were searching for the possibility to contribute to the higher linguistic processing levels of ASR – at syntactic, and semantic level – by acoustical pre-processing of the supra-segmental (prosodic) features. The subject of our current article is a semantic level processing, built on supra-segmental parameters. HMM models...
Chapter
Full-text available
In this chapter we examine the usage of prosodic features in speech recognition, with a special attention payed to agglutinating and fixed stress languages. Current knowledge in speech prosody exploitation is addressed in the introduction. The used prosodic features, acoustic-prosodic pre-processing, and segmentation in terms of prosodic units are...
Conference Paper
Full-text available
In our paper we examine the usage prosodic features in speech recognition, with a special attention payed to agglutinating and fixed stress languages. The used prosodic features, acoustic-prosodic pre-processing, and segmentation in terms of prosodic units are presented in details. We use the expression ”prosodic unit” in order to make a difference...
Conference Paper
Full-text available
Prosodic Cues for Automatic Phrase Boundary Detection in ASR KLÁRA VICSI AND GYÖRGY SZASZÁK Budapest University for Technology and Economics, Dept. for Telecommunications and Mediainformatics, Budapest, Hungary. ABSTRACT This article presents a cross-lingual study for Hungarian and Finnish about the segmentation of continuous speech on word an...
Article
Full-text available
This article presents a cross-lingual study for Hungarian and Finnish about the segmentation of continuous speech on word and phrasal level by examination of supra-segmental parameters. A word level segmentationer has been developed which can indicate the word boundaries with acceptable precision for both languages. The ultimate aim is to increase...
Conference Paper
Full-text available
Pronunciation variation examinations have two aims: to extent our phonetic, linguistic knowledge and to add variation models to automatic speech recognisers (ASR) to improve recognition accuracy. By examining pronunciation variation in Hungarian language on the corpus of the Hungarian Telephone Speech Database (MTBA), that contains semi-automatical...
Article
Full-text available
This paper presents the ongoing work on crosslingual speech recognition in the MASPER initiative. Source acoustic models were transferred to two different target languages - Hungarian and Slovenian. Beside the monolingual source acoustic models, also a semi-multilingual set was defined. An expert-knowledge approach and a data-driven method were app...
Article
Full-text available
A development tool (MKBF 1.0) for constructing continuous speech recognizers has been created under Windows XP. The system is based on a statistical approach (HMM phoneme models, and bi-gram language models with non linear smoothing) and works in real time. The tool is able to construct a middle sized speech recognizer with a vocabulary of 1000-200...
Article
Full-text available
A Beszédakusztikai Laboratóriumban kifejlesztésre került egy Windows XP alatt működő, statisztikai elvi alapokra épülő, folyamatos beszédfelismerő fejlesztői kör-nyezet (MKBP 1.0), amely alkalmas különböző középszótáras 1000-10 000 szavas szövegek betanítására és felismerésére. A 2004 év elején kezdődött, és 3 évig tartó project keretén belül a Lab...
Article
Full-text available
This paper describes sentence, phrase and word boundary detection based on prosodic features, implemented in a HMM- based prosodic segmentation tool. Integrated into a speech recognizer, an N-best rescoring is performed based on the output of the prosodic segmenter, which determines the prosodic structure of the utterance. In an ultrasonography tas...
Article
Full-text available
This paper presents the work on crosslingual speech recognition carried out by the MASPER initiative that was formed as a part of the COST 278 Action. Two different approaches for transfering monolingual source acoustic models to a new language were compared. The first one was expert-driven, based on the IPA scheme. The second was data-driven, base...