Adam Janin’s research while affiliated with Collateral Analytics and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (44)


Toward Zero Oracle Word Error Rate on the Switchboard Benchmark
  • Conference Paper

September 2022

·

4 Reads

·

1 Citation

·

Adam Janin

·

Sidhi Adkoli

·


Toward Zero Oracle Word Error Rate on the Switchboard Benchmark

June 2022

·

8 Reads

The "Switchboard benchmark" is a very well-known test set in automatic speech recognition (ASR) research, establishing record-setting performance for systems that claim human-level transcription accuracy. This work highlights lesser-known practical considerations of this evaluation, demonstrating major improvements in word error rate (WER) by correcting the reference transcriptions and deviating from the official scoring methodology. In this more detailed and reproducible scheme, even commercial ASR systems can score below 5\% WER and the established record for a research system is lowered to 2.3%. An alternative metric of transcript precision is proposed, which does not penalize deletions and appears to be more discriminating for human vs. machine performance. While commercial ASR systems are still below this threshold, a research system is shown to clearly surpass the accuracy of commercial human speech recognition. This work also explores using standardized scoring tools to compute oracle WER by selecting the best among a list of alternatives. A phrase alternatives representation is compared to utterance-level N-best lists and word-level data structures; using dense lattices and adding out-of-vocabulary words, this achieves an oracle WER of 0.18%.


DCAR: A Discriminative and Compact Audio Representation for Audio Processing

May 2017

·

37 Reads

·

12 Citations

IEEE Transactions on Multimedia

Liping Jing

·

·

·

[...]

·

This paper presents a novel two-phase method for audio representation, Discriminative and Compact Audio Representation (DCAR), and evaluates its performance at detecting events and scenes in consumer-produced videos. In the first phase of DCAR, each audio track is modeled using a Gaussian mixture model (GMM) that includes several components to capture the variability within that track. The second phase takes into account both global structure and local structure. In this phase, the components are rendered more discriminative and compact by formulating an optimization problem on a Grassmannian manifold. The learned components can effectively represent the structure of audio. Our experiments used the YLIMED and DCASE Acoustic Scenes datasets. The results show that variants on the proposed DCAR representation consistently outperform four popular audio representations (mv-vector, ivector, GMM, and HEM-GMM). The advantage is significant for both easier and harder discrimination tasks; we discuss how these performance differences across tasks follow from how each type of model leverages (or doesn’t leverage) the intrinsic structure of the data.


A Discriminative and Compact Audio Representation for Event Detection

October 2016

·

26 Reads

·

6 Citations

This paper presents a novel two-phase method for audio representation: Discriminative and Compact Audio Representation (DCAR). In the first phase, each audio track is modeled using a Gaussian mixture model (GMM) that includes several components to capture the variability within that track. The second phase takes into account both global structure and local structure. In this phase, the components are rendered more discriminative and compact by formulating an optimization problem on Grassmannian manifolds, which we found represents the structure of audio effectively. Experimental results on the YLI-MED dataset show that the proposed DCAR representation consistently outperforms state-of-the-art audio representations: i-vector, mv-vector, and GMM.


Figure 1. CNN feature-map-level fusion (fCNN). 
Figure 2. CNN decision-level fusion (pCNN). 
Fusion Strategies for Robust Speech Recognition and Keyword Spotting for Channel- and Noise-Degraded Speech
  • Conference Paper
  • Full-text available

September 2016

·

1,141 Reads

·

7 Citations

Download

Figure 4: Per-event percentages of test tracks that are correctly classified by how many representations. Acc is the proportion of representation types that correctly classify a given audio track.
Comparison of detection performance for GMM representations with and without pre-training feature reduction.
DCAR: A Discriminative and Compact Audio Representation to Improve Event Detection

July 2016

·

69 Reads

·

2 Citations

This paper presents a novel two-phase method for audio representation, Discriminative and Compact Audio Representation (DCAR), and evaluates its performance at detecting events in consumer-produced videos. In the first phase of DCAR, each audio track is modeled using a Gaussian mixture model (GMM) that includes several components to capture the variability within that track. The second phase takes into account both global structure and local structure. In this phase, the components are rendered more discriminative and compact by formulating an optimization problem on Grassmannian manifolds, which we found represents the structure of audio effectively. Our experiments used the YLI-MED dataset (an open TRECVID-style video corpus based on YFCC100M), which includes ten events. The results show that the proposed DCAR representation consistently outperforms state-of-the-art audio representations. DCAR's advantage over i-vector, mv-vector, and GMM representations is significant for both easier and harder discrimination tasks. We discuss how these performance differences across easy and hard cases follow from how each type of model leverages (or doesn't leverage) the intrinsic structure of the data. Furthermore, DCAR shows a particularly notable accuracy advantage on events where humans have more difficulty classifying the videos, i.e., events with lower mean annotator confidence.


Content-Based Privacy for Consumer-Produced Multimedia

April 2015

·

20 Reads

·

1 Citation

We contend that current and future advances in Internet scale multimedia analytics, global inference, and linking can circumvent traditional security and privacy barriers. We therefore are in dire need of a new research field to address this issue and come up with new solutions. We present the privacy risks, attack vectors, details for a preliminary experiment on account linking, and describe mitigation and educational techniques that will help address the issues.


Figure 1: A snapshot of the YLI-MED index and annotations. 
Figure 2: Comparison of the proportional geographical distributions of geotagged positive-example videos in YLI-MED vs. all videos in YFCC100M, in terms of grid placement. X-axis labels identify the latitude and longitude of the northeast corner of each grid unit. 
Table 2 .
Figure 3: Comparison of the proportional temporal distributions of positive-example videos in YLI-MED vs. all videos in YFCC100M, (a) by upload year and (b) by upload month (for all years combined). 
the final numbers of videos for each event category in the released corpus (including positives, near misses, and related videos). There is a fair amount of variation in the number of videos for each event, from 139 positive examples for Ev106 Person Grooming an Animal to 237 positive examples for Ev101 Birthday Party.
The YLI-MED Corpus: Characteristics, Procedures, and Plans

March 2015

·

247 Reads

·

17 Citations

The YLI Multimedia Event Detection corpus is a public-domain index of videos with annotations and computed features, specialized for research in multimedia event detection (MED), i.e., automatically identifying what's happening in a video by analyzing the audio and visual content. The videos indexed in the YLI-MED corpus are a subset of the larger YLI feature corpus, which is being developed by the International Computer Science Institute and Lawrence Livermore National Laboratory based on the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset. The videos in YLI-MED are categorized as depicting one of ten target events, or no target event, and are annotated for additional attributes like language spoken and whether the video has a musical score. The annotations also include degree of annotator agreement and average annotator confidence scores for the event categorization of each video. Version 1.0 of YLI-MED includes 1823 "positive" videos that depict the target events and 48,138 "negative" videos, as well as 177 supplementary videos that are similar to event videos but are not positive examples. Our goal in producing YLI-MED is to be as open about our data and procedures as possible. This report describes the procedures used to collect the corpus; gives detailed descriptive statistics about the corpus makeup (and how video attributes affected annotators' judgments); discusses possible biases in the corpus introduced by our procedural choices and compares it with the most similar existing dataset, TRECVID MED's HAVIC corpus; and gives an overview of our future plans for expanding the annotation effort.


table 1 ,
TANDEM-Bottleneck Feature Combination using Hierarchical Deep Neural Networks

September 2014

·

302 Reads

·

10 Citations

To improve speech recognition performance, a combination between TANDEM and bottleneck Deep Neural Networks (DNN) is investigated. In particular, exploiting a feature combination performed by means of a multi-stream hierarchical processing, we show a performance improvement by combining the same input features processed by different neural networks. The experiments are based on the spontaneous telephone recordings of the Cantonese IARPA Babel corpus using both standard MFCCs and Gabor as input features.


The TAO of ATWV: Probing the mysteries of keyword search performance

December 2013

·

99 Reads

·

51 Citations

In this paper we apply diagnostic analysis to gain a deeper understanding of the performance of the the keyword search system that we have developed for conversational telephone speech in the IARPA Babel program. We summarize the Babel task, its primary performance metric, “actual term weighted value” (ATWV), and our recognition and keyword search systems. Our analysis uses two new oracle ATWV measures, a bootstrap-based ATWV confidence interval, and includes a study of the underpinnings of the large ATWV gains due to system combination. This analysis quantifies the potential ATWV gains from improving the number of true hits and the overall quality of the detection scores in our system's posting lists. It also shows that system combination improves our systems' ATWV via a small increase in the number of true hits in the posting lists.


Citations (37)


... While we believe this to be a valid method when lacking additional resources, it does add a lot of noise to any error analysis efforts and masks the true accuracy of underlying systems with higher error rates. (Faria et al., 2022) is a similar effort to show that WER on a standard English task, Switchboard, is actually lower than has been understood if one accounts for certain common alternate hypotheses. Their approach uses some of the various mechanisms that the sclite tool provides for expressing alternates and synonyms. ...

Reference:

Style-agnostic evaluation of ASR using multiple reference transcripts
Toward Zero Oracle Word Error Rate on the Switchboard Benchmark
  • Citing Conference Paper
  • September 2022

... Different approaches have been proposed to combine systems to utilize the advantages of each system for better performance. Representative of these are ROVER [4], Confusion Network Combination (CNC) [5], and Multi-Stream Combination [6,7,8,9]. Recently, the Minimum Bayes Risk (MBR) combination method proposed by [10] is reported to outperform the more traditional ROVER and CNC. However, most of these system combination techniques perform multipass decoding, which makes the decoding process more complex and time consuming. ...

Multi-stream speech recognition: ready for prime time?
  • Citing Conference Paper
  • September 1999

... Traditionally, SAD is formulated as a statistical hypothesis test employing probabilistic models, such as Gaussians, mixtures of Gaussians, or Laplacian distributions [6,10,11,12]. During the last decade, however, deep neural networks (DNNs) have achieved impressive results on some of the more taxing SAD tasks, outperforming the traditional approaches [8,13,14]. ...

All for one: feature combination for highly channel-degraded speech activity detection
  • Citing Conference Paper
  • August 2013

... These works show a tremendous potential for pattern identification on audio; we would like to explore the information relevance and complementarity of these new features and spectral features in our future work. Jing et al. [17] present a novel two-phase method for audio representation; they take into account both global structure and local structure to learn the representation of audio. Ren et al. [32] argue that an image-like spectrogram cannot well capture the complex texture details of the spectrogram, so that they propose a multichannel LBP feature to improve the robustness of the audio noise. ...

DCAR: A Discriminative and Compact Audio Representation for Audio Processing
  • Citing Article
  • May 2017

IEEE Transactions on Multimedia

... While most previous works consider visual information, some works utilize the audio information of videos to detect events. In [31], discriminative and compact audio representation was proposed to respect the structure of audio signals for event detection. ...

A Discriminative and Compact Audio Representation for Event Detection
  • Citing Conference Paper
  • October 2016

... We observed that time-frequency convolution (using TFCNN [11,17,18,19,20,21]) performed better than 1-D frequency convolution, and hence we have focused on the TFCNN acoustic models for our experiments presented in this paper. The TFCNN architecture is same as in [11,22], where two parallel convolutional layers are used at the input, one performing convolution across time, and the other across frequency on the input filterbank features. The TFCNNs had 75 filters to perform time convolution and 200 filters to perform frequency convolution. ...

Fusion Strategies for Robust Speech Recognition and Keyword Spotting for Channel- and Noise-Degraded Speech

... A good VAD model needs to work accurately in challenging environments, including noisy conditions, reverberant environments and environments with competing speech. Significant research has been devoted to finding the optimal VAD features and models [1,2,3,4,5,6,7,8,9]. In the literature, LSTM-based VAD is a popular architecture for sequential modeling of the VAD task, showing state-of-the-art performance [7,8,9]. ...

All for one: Feature combination for highly channel-degraded speech activity detection

... Of course, several other AI datasets coexist. For instance, Zhixiang et al., 2018 points to three of them: CCV (Columbia Consumer Video) Jiang et al., 2011, YLI-MED (YLI Multimedia Event Detection) Bend, 2015, Thomee, 2016, and ActivityNet Heilbron et al., 2015. Note that unlike the TRECVID datasets, the datasets mentioned in this paragraph are not specifically designed for fingerprinting applications but for general video tracking applications (including indexing): hence, the near duplicated content is expected to be created by the experimenter, according to the application requirements and the principles above. ...

The YLI-MED Corpus: Characteristics, Procedures, and Plans

... Another modification in the design of the Tandem solution of longterm speech modulations is the use of separate hierarchical bottleneck approaches (Plahl et al. 2010) based on the perception of the MLP. Recently, Ravanelli and Janin (2014) have achieved a non-linear way of reducing features with a combination of different NN structures that leads to a relative improvement in the performance of ASR systems. ...

TANDEM-Bottleneck Feature Combination using Hierarchical Deep Neural Networks