Javier Hernando

Javier Hernando
Universitat Politècnica de Catalunya | UPC · Department of Signal Theory and Communications (TSC)

Ph.D. Telecommunication Engineer

About

260
Publications
27,327
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,673
Citations
Additional affiliations
January 1989 - present
Universitat Politècnica de Catalunya
Position
  • Professor (Full)

Publications

Publications (260)
Preprint
Full-text available
This report describes the submission from Technical University of Catalonia (UPC) to the VoxCeleb Speaker Recognition Challenge (VoxSRC-20) at Interspeech 2020. The final submission is a combination of three systems. System-1 is an autoencoder based approach which tries to reconstruct similar i-vectors, whereas System-2 and -3 are Convolutional Neu...
Article
In the context of object detection, sliding-window classifiers and single-shot convolutional neural network (CNN) meta-architectures typically yield multiple overlapping candidate windows with similar high scores around the true location of a particular object. Non-maximum suppression (NMS) is the process of selecting a single representative candid...
Preprint
Full-text available
The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their...
Preprint
Full-text available
Most state-of-the-art Deep Learning systems for speaker verification are based on speaker embedding extractors. These architectures are commonly composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. In this paper we present Double Multi-Head Attention pooling,...
Chapter
It is supposed in Speaker Recognition (SR) that everyone has a unique voice which could be used as an identity rather than or in addition to other identities such as fingerprint, face, or iris. Even though steps have been taken long ago to apply neural networks in SR, recent advances in computing hardware, new deep learning (DL) architectures and t...
Article
The AXIOM project aims at providing an environment for Cyber-Physical Systems. Smart Video Surveillance targets public environments, involving real-time face detection in crowds. Smart Home Living targets home environments and access control. These applications are used as experimental usecases for the AXIOM platform, currently based on the Xilinx...
Conference Paper
Over the last years, i-vectors have been the state-of-the-art approach in speaker recognition. Recent improvements in deep learning have increased the discriminative quality of i-vectors. However, deep learning architectures require a large amount of labeled background data which is difficult in practice. The aim of this paper is to propose an alte...
Article
Full-text available
Restricted Boltzmann Machines (RBMs) have shown success in both the front-end and backend of speaker verification systems. In this paper, we propose applying RBMs to the front-end for the tasks of speaker clustering and speaker tracking in TV broadcast shows. RBMs are trained to transform utterances into a vector based representation. Because of th...
Preprint
Full-text available
Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level. Given the speech signal, these algorithms extract a sequence of speaker embeddings from short segments and those are averaged to obtain an utterance level speaker representation. In this work we propose the use of an attention mechanism to o...
Preprint
Full-text available
Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level. Given the speech signal, these algorithms extract a sequence of speaker embeddings from short segments and those are averaged to obtain an utterance level speaker representation. In this work we propose the use of an attention mechanism to o...
Article
Full-text available
Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficient...
Conference Paper
At a dynamic construction site, conversations convey vital information including construction activities, operation status, and task performance. Even though because of information security, recording the entire conversations of a construction project is currently somewhat restricted, establishing a framework to capture and analyze construction con...
Conference Paper
Full-text available
This paper presents a new speaker change detection system based on Long Short-Term Memory (LSTM) neural networks using acoustic data and linguistic content. Language modelling is combined with two different Joint Factor Analysis (JFA) acoustic approaches: i-vectors and speaker factors. Both of them are compared with a baseline algorithm that uses c...
Conference Paper
Full-text available
The rapid growth of multimedia databases and the human interest in their peers make indices representing the location and identity of people in audio-visual documents essential for searching archives. Person discovery in the absence of prior identity knowledge requires accurate association of audio-visual cues and detected names. To this end, we pr...
Article
Full-text available
Over the last few years, i-vectors have been the state-of-the-art technique in speaker recognition. Recent advances in Deep Learning (DL) technology have improved the quality of i-vectors but the DL techniques in use are computationally expensive and need phonetically labeled background data. The aim of this work is to develop an efficient alternat...
Article
The lack of labeled background data makes a big performance gap between cosine and Probabilistic Linear Discriminant Analysis (PLDA) scoring baseline techniques for i-vectors in speaker recognition. Although there are some unsupervised clustering techniques to estimate the labels, they cannot accurately predict the true labels and they also assume...
Conference Paper
Full-text available
The UPC system works by extracting monomodal signal segments (face tracks, speech segments) that overlap with the person names overlaid in the video signal. These segments are assigned directly with the name of the person and used as a reference to compare against the non-overlapping (unassigned) signal segments. This process is performed independe...
Conference Paper
Full-text available
This paper is focused on the application of the Language Identification (LID) technology for intelligent vehicles. We cope with short sentences or words spoken in moving cars in four languages: English, Spanish, German, and Finnish. As the response time of the LID system is crucial for user acceptance in this particular task, speech signals of diff...
Article
Full-text available
People and objects will soon share the same digital network for information exchange in a world named as the age of the cyber-physical systems. The general expectation is that people and systems will interact in real-time. This poses pressure onto systems design to support increasing demands on computational power, while keeping a low power envelop...
Conference Paper
Restricted Boltzmann Machines (RBMs) have shown success in different stages of speaker recognition systems. In this paper, we propose a novel framework to produce a vector-based representation for each speaker, which will be referred to as RBM-vector. This new approach maps the speaker spectral features to a single fixed-dimensional vector carrying...
Conference Paper
Full-text available
i-vectors have been successfully applied over the last years in speaker recognition tasks. This work aims at assessing the suitability of i-vector modeling within the frame of speaker diarization task. In such context, a weighted cosine-distance between two different sets of i-vectors is proposed for speaker clustering. Speech clusters generated by...
Conference Paper
Full-text available
Jitter and shimmer voice-quality measurements have been successfully used to detect voice pathologies and classify different speaking styles. In this paper, we investigate the usefulness of jitter and shimmer voice measurements in the framework of the speaker diarization task. The combination of jitter and shimmer voice-quality features with the lo...
Conference Paper
Full-text available
This paper describes a system to identify people in broadcast TV shows in a purely unsupervised manner. The system outputs the identity of people that appear, talk and can be identified by using information appearing in the show (in our case, text with person names). Three types of monomodal technologies are used: speech diarization, video diarizat...
Conference Paper
Full-text available
In this paper, we propose to discriminatively model target and impostor spectral features using Deep Belief Networks (DBNs) for speaker recognition. In the feature level, the number of impostor samples is considerably large compared to previous works based on i-vectors. Therefore, those i-vector based impostor selection algorithms are not computati...
Article
The automatic estimation of age from face images is increasingly gaining attention, as it facilitates applications including advanced video surveillance, demographic statistics collection, customer profiling, or search optimization in large databases. Nevertheless, it becomes challenging to estimate age from uncontrollable environments, with insuff...
Conference Paper
Full-text available
The use of Restricted Boltzmann Machines (RBM) is proposed in this paper as a non-linear transformation of GMM supervectors for speaker recognition. It will be shown that the RBM transformation will increase the discrimination power of raw GMM supervectors for speaker recognition. The experimental results on the core test condition of the NIST SRE...
Chapter
Full-text available
An effective global impostor selection method is proposed in this paper for discriminative Deep Belief Networks (DBN) in the context of a multi-session i-vector based speaker recognition. The proposed method is an iterative process in which in each iteration the whole impostor i-vector dataset is divided randomly into two subsets. The impostors in...
Conference Paper
Full-text available
Jitter and shimmer voice quality features have been successfully used to characterize speaker voice traits and detect voice patholo-gies. Jitter and shimmer measure variations in the fundamental frequency and amplitude of speaker's voice, respectively. Due to their nature, they can be used to assess differences between speakers. In this paper, we i...
Conference Paper
Full-text available
In this paper we propose an impostor selection method for a Deep Belief Network (DBN) based system which models i-vectors in a multi-session speaker verification task. In the proposed method, instead of choosing a fixed number of most informative impostors, a threshold is defined according to the frequencies of impostors. The selected impostors are...
Conference Paper
Full-text available
The use of Deep Belief Networks (DBNs) is proposed in this paper to model discriminatively target and impostor i-vectors in a speaker verification task. The authors propose to adapt the network parameters of each speaker from a background model, which will be referred to as Universal DBN (UDBN). It is also suggested to backpropagate class errors up...
Article
Full-text available
This paper addresses the problem of three-dimensional speaker orientation estimation in a smart-room environment equipped with microphone arrays. A Bayesian approach is proposed to jointly track the location and orientation of an active speaker. The main motivation is that the knowledge of the speaker orientation may yield an increased localization...
Article
Full-text available
Iris segmentation is the most determining factor in iris biometrics, which has traditionally assumed rigid constrained environments. In this work, a novel method that covers the localization of the pupillaty and limbic iris boundaries is proposed. The algorithm consists of an energy minimization procedure posed as a multilabel one-directional graph...
Article
Full-text available
The handling of overlapping speech in the context of speaker diarisation attracted in recent years the interest of the scientific community, since speaker overlap was identified as one of the factors degrading the performance of conventional diarisation systems. In this study, the authors are discussing the possibility of using long-term prosodic f...
Conference Paper
Full-text available
The goal of face detection is to determine the presence of faces in arbitrary images, along with their locations and dimensions. As it happens with any graphics workloads, these algorithms benefit from data-level parallelism. Existing parallelization efforts strictly focus on mapping different divide and conquer strategies into multicore CPUs and G...
Article
Full-text available
In this article, we present the evaluation results for the task of speaker diarization of broadcast news, which was part of the Albayzin 2010 evaluation campaign of language and speech technologies. The evaluation data consists of a subset of the Catalan broadcast news database recorded from the 3/24 TV channel. The description of five submitted sy...
Conference Paper
Full-text available
In this paper, we present a clustering algorithm for speaker diarization based on spectral clustering. State-of-the-art diariza-tion systems are based on agglomerative hierarchical clustering using Bayesian Information Criterion and other statistical met-rics among clusters which results in a high computational cost and in a time demanding approach...
Conference Paper
Full-text available
Iris recognition systems are strongly dependent on their segmentation processes, which have traditionally assumed rigid experimental constraints to achieve good performance, but now move towards less constrained environments. This work presents a novel method on iris segmentation that covers the localization of the pupillary and limbic iris boundar...
Article
Full-text available
Simultaneous speech poses a challenging problem for conventional speaker diarization systems. In meeting data, a substantial amount of missed speech error is due to speaker overlaps, since usually only one speaker label per segment is assigned. Furthermore, simultaneous speech included in training data can lead to corrupt speaker models and thus wo...
Article
Full-text available
This work presents a novel two-step algorithm to estimate the orientation of speakers in a smart-room environment equipped with microphone arrays. First the position of the speaker is estimated by the SRP-PHAT algorithm, and the time delay of arrival for each microphone pair with respect to the detected position is computed. In the second step, the...
Article
Full-text available
Acoustic event detection (AED) aims at determining the identity of sounds and their temporal position in audio signals. When applied to spontaneously generated acoustic events, AED based only on audio information shows a large amount of errors, which are mostly due to temporal overlaps. Actually, temporal overlaps accounted for more than 70 of erro...
Conference Paper
Full-text available
Modern GPUs have evolved into fully programmable parallel stream multiprocessors. Due to the nature of the graphic workloads, computer vision algorithms are in good position to leverage the computing power of these devices. An interesting problem that greatly benefits from parallelism is face detection. This paper presents a highly optimized Haar-b...
Conference Paper
Full-text available
Overlapping speech is responsible for a certain amount of errors produced by standard speaker diarization systems in meeting environment. We are investigating a set of prosody-based long-term features as a potential complement to our overlap detection system relying on short-term spectral parameters. The most relevant features are selected in a two...
Article
Full-text available
Real-time processing is a requirement for many practical signal processing applications. In this work we implemented online 2-source acoustic event detection and localization algorithms in a Smart-room, a closed space equipped with multiple microphones. Acoustic event detection is based on HMMs that enable to process the input audio signal with ver...
Method
Full-text available
The Pitch-Scaled Harmonic Filter (PSHF) is designed to take as input a recorded speech wave file and a file containing an estimated f0 (pitch) track to produce two output wave files, corresponding to the periodic and aperiodic sources. These typically are related to the voiced (periodic) and unvoiced (aperiodic) parts of the speech signal. The algo...
Conference Paper
A substantial portion of errors of the conventional speaker diarization systems on meeting data can be accounted to overlapped speech. This paper proposes the use of several spatial features to improve speech overlap detection on distant channel microphones. These spatial features are integrated into a spectral-based system by using principal compo...
Article
Full-text available
Voices can be deliberately disguised by means of human imitation or voice conversion. The question arises to what extent they can be modified by using either method. In the current paper, a set of speaker identification experiments are conducted; first, analysing some prosodic features extracted from voices of professional impersonators attempting...
Article
Full-text available
This document describes the joint submission of the INESC-ID’s Spoken Language Systems Laboratory (L 2 F) and the TALP Research Center from the Technical University of Catalonia (UPC) to the 2010 NIST Speaker Recognition evaluation. The L2F-UPC primary system is composed by the fusion of five individual sub-systems. Speaker recognition results have...
Article
Full-text available
Simultaneous speech in meeting environment is responsible for a certain amount of errors caused by standard speaker diarization systems. We are presenting an overlap detection system for far-field data based on spectral and spatial features, where the spatial features obtained on different microphone pairs are fused by means of principal component...
Conference Paper
Eigenfaces are the classical features used in face recognition and have been commonly used with classification techniques based on Euclidean distance and, more recently, with Support Vector Machines. In speaker verification, GMM has been widely used for the recognition task. Lately, the combination of the GMM supervector, formed by the means of the...
Article
Full-text available
Jitter and shimmer are measures of the fundamental frequency and amplitude cycle-to-cycle variations, respectively. Both features have been largely used for the description of pathological voices, and since they characterise some aspects concerning particular voices, they are expected to have a certain degree of speaker specificity. In the current...
Article
Multimodal biometric fusion at score level can be performed by means of combinatory or classificatory techniques. In the first case, it is straightforward that the normalisation of the scores is a very important issue for the success of the fusion process. In the classificatory approach as, for instance, in support vector machine (SVM)-based system...
Article
Full-text available
Acoustic events produced in meeting environments may contain useful information for perceptually aware inter-faces and multimodal behavior analysis. In this paper, a system to detect and recognize these events from a multi-modal perspective is presented combining information from multiple cameras and microphones. First, spectral and tem-poral featu...
Conference Paper
Full-text available
The detection of the acoustic events (AEs) that are naturally produced in a meeting room may help to describe the human and social activity that takes place in it. When applied to spontaneous recordings, the detection of AEs from only audio information shows a large amount of errors, which are mostly due to temporal overlapping of sounds. In this p...