Itshak Lapidot

Ben-Gurion University of the Negev, Be'er Sheva`, Southern District, Israel

Are you Itshak Lapidot?

Claim your profile

Publications (23)6.3 Total impact

  • Oshry Ben-Harush, Itshak Lapidot, Hugo Guterman
    [Show abstract] [Hide abstract]
    ABSTRACT: Speaker diarization systems attempt to assign temporal segments from a conversation between $R$ speakers to an appropriate speaker $r$. This task is generally performed when no prior information is given regarding the speakers. The number of speakers is usually unknown and needs to be estimated. However, there are applications where the number of speakers is known in advance. The diarization process generally consists of change detection, clustering and labeling of a given audio stream. Speaker diarization can be performed using an iterative approach that is optimized by the selection of appropriate initial conditions. This study examines the influence of several common initialization algorithms including two variants of a recently proposed, K-means based initialization algorithm over the performance of an iterative-based speaker diarization system applied to two speaker telephone conversations. The suggested speaker diarization system employs either self organizing maps or Gaussian mixture models in order to model the speakers and non-speech in the conversation. The diarization system and initialization algorithms are tuned using 108 telephone conversations taken from LDC CallHome corpus, this is the development set. The evaluation subset is composed of 2048 telephone conversations extracted from the NIST 2005 Rich Transcription corpus. The results obtained show that by initializing the speaker diarization system using the K-means based algorithms provide a relative improvement of 10.4% for the LDC development set and 12.2% for the NIST evaluation subset when compared to random initialization after 12 iterations which are required for the convergence of the diarization process using random initialization. However, when using the K-means based initialization approach, only five iterations are required for the system to converge. Thus, using the new initialization allows us to improve the performances both in terms of diarization error rate and speed of convergence.
    IEEE Transactions on Audio Speech and Language Processing 01/2012; 20:414-425. · 1.68 Impact Factor
  • I. Lapidot
    [Show abstract] [Hide abstract]
    ABSTRACT: We examine different initializations and their influence on the performances of iterative speaker diarization system. Six methods of initializations were under examination, starting with a naive frame based random initialization, continue with uniform conversation dividing between the clusters and ending with weighted segmental k-means. The initialization methods were tested on two telephone conversation databases: LDC America CallHome and NIST SRE-05. In contrast to most works on meeting and shows where the speakers turns are not very frequent and minimal duration constraints of 2.5 sec or more can be applied to capture speakers statistics, in telephone conversations the speaker turns are much more frequent and the minimum duration should be set to several hundreds of milliseconds. In such cases, good cluster initialization is very important. It will be shown that good initialization using weighted segmental k-means is outperforms all other methods, and the either fixed or minimum duration constraints can be minor, and even without any constraint on the segment duration the results are significantly better than in other initializations.
    Electrical & Electronics Engineers in Israel (IEEEI), 2012 IEEE 27th Convention of; 01/2012
  • Source
    I. Lapidot, J.-F. Bonastre
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work we examine whether linear discriminant analysis (LDA) can improve the diarization performance, when used as an additional phase in a telephone conversation diarization system. We first apply a classical diarization system. Using systems output (to define the classes of interest) an LDA transformation on the mel-cepstrum features is performed. Then, the final diarization process is applied onto the transformed features. A relative improvement of 14.8% was obtained on LDC America CallHome database. The LDA seemed sensible to both segment duration and amount of data available for training, as shown by the results obtained on NIST SRE-05 database where no significative improvement was observed.
    Electrical & Electronics Engineers in Israel (IEEEI), 2012 IEEE 27th Convention of; 01/2012
  • U. Ben Simon, I. Lapidot, H. Guterman
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a comparison between several features normalization methods, and a comparison between different types of Gaussian Mixture Model (GMM) based supervectors normalizations for robust Speaker Verification. We implemented the methods of normalizations as a part of speaker verification system using Support Vector Machine (SVM) classifier and GMM-based supervectors. When implementing the speaker recognition system, we used Mel Frequency Cepstral Coefficients (MFCC) feature extraction. A valid question is which features normalization to use, if any. We examine the most common methods of feature normalizations, such as: Feature Warping mapping, and Cepstral Mean Subtraction (CMS) normalization with and without variance normalization. These methods were compared to features without normalization at all, and to a basic [-1, 1] normalization. In addition, we applied few types of normalizations to the GMM-mean supervectors, in order to improve the performance of the SVM classifier. All comparisons of the speaker verification system had been done in terms of DET curve, EER (Equal Error Rate) and Min. DCF. The best results we achieved were on combined supervector normalizations of Universal Background Model (UBM) Standard Deviation (STD) and [-1, 1] normalization. The type of the MFCC normalization has no big influence on the verification performance. The best results were: EER about 5.0% and MIN. DCF of 0.02.
    Electrical and Electronics Engineers in Israel (IEEEI), 2010 IEEE 26th Convention of; 12/2010
  • Oshry Ben-Harush, Itshak Lapidot, Hugo Guterman
    INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010; 01/2010
  • Wafi Abo-Gannemhy, Itshak Lapidot, H. Guterman
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we employ backward Viterbi search for speech recognition. Contrary to forward Viterbi search that is performed from the beginning to the end, and where a word depends on the preceding words, backward Viterbi search is performed from the end to the beginning and the current word depends from the following words. As the errors of the forward and the backward searches are not the same, improvement can be achieved by combining the forward and the backward Viterbi search. The fusion is attained by an expert system based on rover algorithm, and using confidence measure for the words and optimal confidence value for null arcs depending on its place in word transition network (WTN). The experimental result of the combined system showed significant improvement over both forward and backward Viterbi decoding system on the Number 95 database.
    01/2010;
  • Source
    O. Ben-Harush, H. Guterman, I. Lapidot
    [Show abstract] [Hide abstract]
    ABSTRACT: Speaker diarization systems attempt to assign temporal speech segments in a conversation to the appropriate speaker, and non-speech segments to non-speech. Speaker diarization systems basically provide an answer to the question "Who spoke when ?". One inherent deficiency of most current systems is their inability to handle co-channel or overlapped speech. During the past few years, several studies have attempted dealing with the problem of overlapped or co-channel speech detection and separation, however, most of the algorithms suggested perform under unique conditions, require high computational complexity and require both time and frequency domain analysis of the audio data. In this study, frame based entropy analysis of the audio data in the time domain serves as a single feature for an overlapped speech detection algorithm. Identification of overlapped speech segments is performed using Gaussian Mixture Modeling (GMM) along with well known classification algorithms applied on two speaker conversations. By employing this methodology, the proposed method eliminates the need for setting a hard threshold for each conversation or database. LDC CALLHOME American English corpus is used for evaluation of the suggested algorithm. The proposed method successfully detects 60.0% of the frames labeled as overlapped speech by the baseline (ground-truth) segmentation , while keeping a 5% false-alarm rate.
    Machine Learning for Signal Processing, 2009. MLSP 2009. IEEE International Workshop on; 10/2009
  • Oshry Ben-Harush, Itshak Lapidot, Hugo Guterman
    INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009; 01/2009
  • O. Ben-Harush, I. Lapidot, H. Guterman
    [Show abstract] [Hide abstract]
    ABSTRACT: A new approach for initial assignment of data in a speaker clustering application is presented. This approach employs segmental k-means clustering algorithm prior to competitive based learning. The clustering system relies on self-organizing maps (SOM) for speaker modeling and as a likelihood estimator. Performance is evaluated on 108 two speaker conversations taken from LDC CALLHOME American English Speech corpus using NIST criterion and shows an improvement of 20%-30% in cluster error rate (CER) relative to the randomly initialized clustering system. The number of iterations was reduced significantly, which contributes to both speed and efficiency of the clustering system.
    ELMAR, 2008. 50th International Symposium; 10/2008
  • Source
    Oshry Ben-Harush, Hugo Guterman, Itshak Lapidot
    [Show abstract] [Hide abstract]
    ABSTRACT: Audio diarization is the process of assigning audio channel temporal segments to the appropriate generating source according to specific acoustic properties. Sources can be speech, music, background noise etc. Speaker diarization systems confronts the problem of segmentation and labeling of a conversation while no prior knowledge on the speakers is available. As human expert segmentation is time and money consuming; it is worthwhile to develop an automatic diarization system as a replacement to human expert segmentation for speaker recognition applications. However, diarization systems has more false detected segments than can be allowed for speaker model training. This work focuses on the reduction of the false detected segments and in the selection of "pure" segments which contains only the required speaker data. For this purpose a measure of "purity" and the methodology for the extraction of the "pure" segments are required. In this paper a pure segments selection algorithm employing an expert system decision is presented. The proposed system is based on majority vote and normalized maximum likelihood of the segments. The pure segments selection algorithm relies on the accuracy of the diarization system which is based on Self Organizing Maps (SOM) as speaker models. One hundred and eight conversations from LDC America Call Home database are used for evaluation. The proposed approach shows a DER improvement of 29% relative to the DER achieved by the original diarization system.
    01/2008;
  • Oshry Ben-Harush, Itshak Lapidot, Hugo Guterman
    INTERSPEECH 2008, 9th Annual Conference of the International Speech Communication Association, Brisbane, Australia, September 22-26, 2008; 01/2008
  • Source
    I. Lapidot, H. Guterman
    [Show abstract] [Hide abstract]
    ABSTRACT: In many time-series such as speech, biosignals, protein chains, etc. there is a dependency between consecutive vectors. As the dependency is limited in duration, such data can be referred to as piecewise-dependent data (PDD). In clustering, it is frequently needed to minimize a given distance function. In this letter, we will show that in PDD clustering there is a contradiction between the desire for high resolution (short segments and low distance) and high accuracy (long segments and high distance), i.e., meaningful clustering.
    IEEE Signal Processing Letters 05/2003; · 1.67 Impact Factor
  • Itshak Lapidot
    8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1-4, 2003; 01/2003
  • Article: Unknown
    Pdd Clustering, Itshak Lapidot, Hugo Guterman
    [Show abstract] [Hide abstract]
    ABSTRACT: In many signal such speech, bio-signals, protein chains, etc. there is a dependency between consecutive vectors. As the dependency is limited in duration such data can be called as Piecewise-DependentData (PDD). In clustering it is frequently needed to minimize a given distance function. In this paper we will show that in PDD clustering there is a contradiction between the desire for high resolution (short segments and low distance) and high accuracy (long segments and high distortion), i.e. meaningful clustering.
    12/2002;
  • Jitendra Ajmera, Itshak Lapidot
    [Show abstract] [Hide abstract]
    ABSTRACT: In this report, we build upon our previous work on automatic speaker clustering. In the previous work, we presented a HMM-based clustering framework where both the number of speakers and the segmentation boundaries are unknown a priori. Starting from over-clustering, we converge to a nal clustering using an iterative merging and retraining process. The process consists of training a Gaussian Mixture Model (GMM) for each hypothesized speaker cluster, selecting the `closest' pair of clusters for merging, and retraining the GMM of the merged cluster. Actually, the main contribution of this paper is to propose a new similarity measure between two probability density functions estimated by GMM. It is shown that this similarity measure can be used without the need for any threshold/penalty term as often used in information theoretic measures like Bayesian information criteria (BIC) and minimum description length (MDL). The merging and retraining are repeated until no possible pair of clusters for merging is left. The system is evaluated on 1996 Hub-4 evaluation set, and shows signi cant improvements over our previous results. In particular, it is shown that the system often converges to the correct number of clusters (that is, the correct number of speakers) and, consequently, a high average speaker purity is observed.
    10/2002;
  • Jitendra Ajmera, Itshak Lapidot, Iain Mccowan
    [Show abstract] [Hide abstract]
    ABSTRACT: An HMM-based speaker clustering framework is presented, where the number of speakers and segmentation boundaries are unknown a priori. Ideally, the system aims to create one pure cluster for each speaker. The HMM is ergodic in nature with a minimum duration topology. The nal number of clusters is determined automatically by merging closest clusters and retraining this new cluster, until a decrease in likelihood is observed. In the same framework, we also examine the eect of using only the features from highly voiced frames as a means of improving the robustness and computational complexity of the algorithm. The proposed system is assessed on the 1996 HUB-4 evaluation test set in terms of both cluster and speaker purity. It is shown that the number of clusters found often correspond to the actual number of speakers.
    05/2002;
  • Source
    I Lapidot, H Guterman, A Cohen
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a method for clustering the speakers from unlabeled and unsegmented conversation (with known number of speakers), when no a priori knowledge about the identity of the participants is given. Each speaker was modeled by a self-organizing map (SOM). The SOMs were randomly initiated. An iterative algorithm allows the data move from one model to another and adjust the SOMs. The restriction that the data can move only in small groups but not by moving each and every feature vector separately force the SOMs to adjust to speakers (instead of phonemes or other vocal events). This method was applied to high-quality conversations with two to five participants and to two-speaker telephone-quality conversations. The results for two (both high- and telephone-quality) and three speakers were over 80% correct segmentation. The problem becomes even harder when the number of participants is also unknown. Based on the iterative clustering algorithm a validity criterion was also developed to estimate the number of speakers. In 16 out of 17 conversations of high-quality conversations between two and three participants, the estimation of the number of the participants was correct. In telephone-quality the results were poorer.
    IEEE Transactions on Neural Networks 02/2002; 13(4):877-87. · 2.95 Impact Factor
  • Source
    7th International Conference on Spoken Language Processing, ICSLP2002 - INTERSPEECH 2002, Denver, Colorado, USA, September 16-20, 2002; 01/2002
  • I. Lapidot
    [Show abstract] [Hide abstract]
    ABSTRACT: A new approach is presented for clustering the speakers from unlabeled and unsegmented conversation, when the number of speakers is unknown. In this approach, each speaker is modeled by a Self- Organizing-Map (SOM). For estimation of the number of clusters the Bayesian Information Criterion (BIC) is applied. This approach was tested on the NIST 1996 HUB-4 evaluation test in terms of speaker and cluster purities. Results indicate that the combined SOM-BIC approach can lead to better clustering results than the baseline system.
    01/2002;
  • Source
    Itshak Lapidot, Hugo Guterman
    [Show abstract] [Hide abstract]
    ABSTRACT: In unlabeled and unsegmented conversation, i.e. no a-priori knowledge about speakers' identity and segments boundaries is provided, it is very important to cluster the conversation (make a segmentation and labeling) with the best possible resolution. For low-resolution cases, i.e. the duration of the segment is long; the segments might contain data from several speakers. On the other hand, when short segments are used (high resolution) not enough statistics is provided to allow correct decision about the identity of the speakers. In this work the performance of a system, which employs different segment lengths, is presented. We assumed that the number of speakers, R, is known, and high-quality conversations were used. Each speaker was modeled by a Self-Organizing-Map (SOM). An iterative algorithm allows the data move from one model to another and adjust the SOMs. The restriction that the data can move only in small groups but not by moving each and every feature vector separately force the SOMs to adjust to speakers (instead of phonemes or other vocal events). We found that the optimal segment duration was half-second. The system has a clustering performance of about 90% for tow- speaker conversation and over 80% for three-speaker conversations.