Preeti S Rao

Preeti S Rao
  • Doctor of Philosophy
  • Professor at Indian Institute of Technology Bombay

About

160
Publications
29,173
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,455
Citations
Introduction
Speech and music signal processing
Current institution
Indian Institute of Technology Bombay
Current position
  • Professor

Publications

Publications (160)
Conference Paper
Typical sounds we hear, including speech and music, are a complex combination of tones. Yet, owing to the challenges involved in understanding the processing of sounds in the cochlea, complex sound inputs are seldom explored. Among a few exceptions, three tones are used in acquiring the Distortion Product Otoacoustic Emissions (DPOAE) suppression t...
Preprint
Full-text available
Reading fluency assessment is a critical component of literacy programmes, serving to guide and monitor early education interventions. Given the resource intensive nature of the exercise when conducted by teachers, the development of automatic tools that can operate on audio recordings of oral reading is attractive as an objective and highly scalab...
Preprint
Full-text available
Creating a complex work of art like music necessitates profound creativity. With recent advancements in deep learning and powerful models such as transformers, there has been huge progress in automatic music generation. In an accompaniment generation context, creating a coherent drum pattern with apposite fills and improvisations at proper location...
Preprint
Full-text available
Literacy assessment is an important activity for education administrators across the globe. Typically achieved in a school setting by testing a child's oral reading, it is intensive in human resources. While automatic speech recognition (ASR) is a potential solution to the problem, it tends to be computationally expensive for hand-held devices apar...
Preprint
Full-text available
The detection of perceived prominence in speech has attracted approaches ranging from the design of linguistic knowledge-based acoustic features to the automatic feature learning from suprasegmental attributes such as pitch and intensity contours. We present here, in contrast, a system that operates directly on segmented speech waveforms to learn f...
Preprint
Full-text available
The tabla is a unique percussion instrument due to the combined harmonic and percussive nature of its timbre, and the contrasting harmonic frequency ranges of its two drums. This allows a tabla player to uniquely emphasize parts of the rhythmic cycle (theka) in order to mark the salient positions. An analysis of the loudness dynamics and timing dev...
Preprint
Full-text available
Expressive reading, considered the defining attribute of oral reading fluency, comprises the prosodic realization of phrasing and prominence. In the context of evaluating oral reading, it helps to establish the speaker's comprehension of the text. We consider a labeled dataset of children's reading recordings for the speaker-independent detection o...
Preprint
Full-text available
Syllable detection is an important speech analysis task with applications in speech rate estimation, word segmentation, and automatic prosody detection. Based on the well understood acoustic correlates of speech articulation, it has been realized by local peak picking on a frequency-weighted energy contour that represents vowel sonority. While seve...
Article
Full-text available
An interesting aspect of Indian art music is the prominent place of improvisation in performance. We explore the influence of the structural constraints of the genre on raga motifs in the course of improvisation. Audio recordings of North Indian vocal concerts are analysed to extract measurements of the defining parameters of the recurrent melodic...
Article
Prosody is the supra-segmental aspect of speech that helps to convey the structure and intended meaning of lexical content unambiguously. The automatic detection of prosodic events, such as phrase boundary and word prominence, has a number of applications in discourse analysis, where a combination of syntactic and acoustic-prosodic features is typi...
Article
Full-text available
Dhrupad vocal concerts exhibit a temporal evolution through a sequence of homogeneous sections marked by shared rhythmic characteristics. In this work, we address the segmentation of a concert audio’s unmetered improvisatory section into musically meaningful segments at the highest time scale. Motivated by the distinct musical properties of the sec...
Preprint
Full-text available
Generative Models for Audio Synthesis have been gaining momentum in the last few years. More recently, parametric representations of the audio signal have been incorporated to facilitate better musical control of the synthesized output. In this work, we investigate a parametric model for violin tones, in particular the generative modeling of the re...
Preprint
Full-text available
A Dhrupad vocal concert comprises a composition section that is interspersed with improvised episodes of increased rhythmic activity involving the interaction between the vocals and the percussion. Tracking the changing rhythmic density, in relation to the underlying metric tempo of the piece, thus facilitates the detection and labeling of the impr...
Preprint
With the advent of data-driven statistical modeling and abundant computing power, researchers are turning increasingly to deep learning for audio synthesis. These methods try to model audio signals directly in the time or frequency domain. In the interest of more flexible control over the generated sound, it could be more useful to work with a para...
Preprint
Full-text available
In this paper our goal is to convert a set of spoken lines into sung ones. Unlike previous signal processing based methods, we take a learning based approach to the problem. This allows us to automatically model various aspects of this transformation, thus overcoming dependence on specific inputs such as high quality singing templates or phoneme-sc...
Conference Paper
Full-text available
With the advent of data-driven statistical modeling and abundant computing power, researchers are turning increasingly to deep learning for audio synthesis. These methods try to model audio signals directly in the time or frequency domain. In the interest of more flexible control over the generated sound, it could be more useful to work with a para...
Preprint
Full-text available
Use a parametric representation of audio to train a generative model in the interest of obtaining more flexible control over the generated sound.
Preprint
Full-text available
We present a melody based classification of musical styles by exploiting the pitch and energy based characteristics derived from the audio signal. Three prominent musical styles were chosen which have improvisation as integral part with similar melodic principles, theme, and structure of concerts namely, Hindustani, Carnatic and Turkish music. List...
Article
A prominent aspect of the notion of musical similarity across the music of various cultures is related to the local matching of melodic motifs. This holds for Indian art music, a highly structured form with raga playing a critical role in the melodic organization. Apart from the tonal material, a raga is characterized by a set of melodic phrases th...
Article
Full-text available
Raga grammar provides a theoretical framework that supports creativity and flexibility in improvisation while carefully maintaining the distinctiveness of each raga in the ears of a listener. A computational model for raga grammar can serve as a powerful tool to characterize grammaticality in performance. Like in other forms of tonal music, a distr...
Conference Paper
We present a Computer-Aided Language Learning (CALL) system that assesses a child's oral reading skill including the prosodic aspects. With children who have otherwise achieved word decoding automaticity, prosodic fluency is a reliable predictor of comprehension. Prosody includes attributes such as pace, phrasing and expression. Based on the acoust...
Conference Paper
Full-text available
In the Indian classical drumming tradition, the different strokes on the tabla are named by spoken syllables(bols) in a case of onomatopoeia. The recitation of a tabla composition using vocalic syllables(bols) plays an important role in the oral tradition of pedagogy in North Indian classical music. Previous studies have considered the phonetic fea...
Preprint
Full-text available
We investigate methods for the automatic labeling of the taan section, a prominent structural component of the Hindustani Khayal vocal concert. The taan contains improvised raga-based melody rendered in the highly distinctive style of rapid pitch and energy modulations of the voice. We propose computational features that capture these specific high...
Article
This work targets building an oral reading “tutor” that provides automatic and reliable feedback to children learning to read. The work uses state-of-the-art in speech recognition technology coupled with prosody modeling. The system is tested on available datasets of children’s readings in English as second language. The expected challenges relate...
Conference Paper
The evaluation of oral reading skills is considered an important component of language education in school. Compared with word decoding skill, prosodic fluency typically takes children much longer to achieve. Prosodic fluency, however, is linked to comprehension making its evaluation very useful in an automatic reading assessment system. We conside...
Article
Focus or prominence is an important linguistic function of prosody. The acoustic realisation of prominence in an utterance, in most languages, involves one or more acoustic dimensions while affecting one or more words in the utterance. It is of interest to identify the acoustic correlates as well as their possible interaction in the production and...
Conference Paper
Full-text available
A tool for automatic pronunciation evaluation of singing is desirable for those learning a second language. However , efforts to obtain pronunciation rules for such a tool have been hindered by a lack of data; while many spoken-word datasets exist that can be used in developing the tool, there are relatively few sung-lyrics datasets for such a purp...
Conference Paper
This project targets using state-of-the-art in automatic speech recognition technology, coupled with new work in predicting the relevant prosody ratings, to build an oral reading assess- ment tool. A reliable automatic system can prove invaluable in helping children acquire basic reading skills apart from facili- tating the monitoring of literacy p...
Conference Paper
Text-to-speech synthesizers present an attractive alternative to reading in hands-free communication scenarios. Speech intelligibility and naturalness are key to the user acceptability of synthesized speech. The accurate modeling of prosody plays an important role in both dimensions. While prosody is language dependent, it is also strongly dependen...
Conference Paper
Full-text available
We study the effect of focus shift on prosodic features for Marathi, a major Indian language. In our analysis, we consider different focus locations and different focus widths. We report observations of fundamental frequency, intensity, and syllabic durations of constituent words of the utterance. F0 is studied via the accent commands of the Fujisa...
Article
The computer-assisted learning of spoken language is closely tied to automatic speech recognition (ASR) technology which, as is well known, is challenging with non-native speech. By focusing on specific phonological differences between the target and source languages of non-native speakers, pronunciation assessment can be made more reliable. The fo...
Conference Paper
Full-text available
Time-series pattern matching methods that incorporate time warping have recently been used with varying degrees of success on tasks of search and discovery of melodic phrases from audio for Indian classical vocal music. While these methods perform effectively due to the minimal assumptions they place on the nature of the sampled pitch temporal traj...
Article
In degraded listening conditions, speakers are known to adapt their speech via the Lombard reflex to make it more comprehensible. This characteristic has been used in previous work to modify speech recorded in quiet before it is rendered in a noisy environment. The spectral modifications used have been found to be effective in low-pass noise such a...
Article
Full-text available
The melodic phrases of a raga are an important cue to its identity. Artists, however, incorporate considerable creative variation within a raga phrase during performance while still preserving its identity in the ears of the listeners. It is of interest therefore to explore the boundaries of this categorization of phrase identity, given the space o...
Article
Full-text available
With its origin in the Samveda, composed between 1500–900 BC, the art music of India has evolved through ages and come to be regarded as one of the oldest surviving music systems in the world today. This paper aims to provide an overview of the fundamentals governing Hindustani music (also known as North Indian music) as practiced today. The delibe...
Article
Full-text available
Ragas are characterized by their melodic motifs or catch phrases that constitute strong cues to the raga identity for both the performer and the listener, and therefore are of great interest in music retrieval and automatic transcription. While the characteristic phrases, or pakads, appear in written notation as a sequence of notes, musicological r...
Conference Paper
Voice-based querying for information is a powerful technology that can hugely enhance the scope of information retrieval systems by enabling their remote access via the ubiquitous mobile phone. Information retrieval based on automatic speech recognition however is challenging due to the environment noise and speaker idiosyncrasies typical of real-w...
Conference Paper
With mobile phone penetration high and growing rapidly, speech based access to information is an attractive proposition. However, automatic speech recognition (ASR) performance is seriously compromised in real-world scenarios where background acoustic noise is omnipresent. Speech enhancement methods can help to improve the signal quality presented...
Article
Motivated by the potential of speech modification for the enhancement of intelligibility in noisy environments, we study the acoustic characteristics of speech produced in the context of critical announcements made in noisy listening situations. A corpus of 3 speakers producing 20 Marathi train station announcements is analysed for articulatory-aco...
Conference Paper
Full-text available
In this work, we study the different playing styles of Indian classical instrumental music with respect to signal characteristics from flute solo concert performances by prominent artists. Like other Hindustani classical instrumentalists, flautists have evolved Gayaki (vocal) and Tantrakari (plucked string) playing styles. The production and acoust...
Conference Paper
We consider the pronunciation assessment of vowels of Indian English uttered by speakers with Gujarati L1 using confidence measures obtained by automatic speech recognition. The goodness-of-pronunciation measure as captured by the acoustic likelihood scores can be effective only when the acoustic models used are appropriate for the task i.e. detect...
Article
Plosives in Indo-Aryan languages such as Hindi and Marathi display a 4-way contrast involving the two dimensions of voicing and aspiration. While many studies are available on the acoustics of aspiration in unvoiced stops due to their more universal presence in the world's languages, voiced aspirated plosives have been less studied. Rather than the...
Article
Full-text available
The classification of unvoiced stops in consonant-vowel (CV) syllables, segmented from continuous speech, is investigated by features related to speech production. As burst and vocalic transitions contribute to identification of stops in the CV context, features are computed from both regions. Although formants are the truly discriminating articula...
Conference Paper
Full-text available
Errors of speech recognition systems occur due to a variety of reasons. It is desirable to have a confidence measure that gives an idea of the accuracy of the decoder output, so that appropriate remedial measures can be taken. In this paper, we compare two approaches to detect incorrect output of a speech recognition system. The first approach empl...
Conference Paper
Full-text available
Computer-aided spoken language learning has been an important area of research. The assessment of a learner"s pronunciation with respect to native pronunciation lends itself to automation using speech recognition technology. However phone recognition accuracies achievable in state-of-the-art automatic speech recognition systems make their direct ap...
Article
Unvoiced stops are rapidly varying sounds with acoustic cues to place identity linked to the temporal dynamics. Neurophysiological studies have indicated the importance of joint spectro-temporal processing in the human perception of stops. In this study, two distinct approaches to modeling the spectro-temporal envelope of unvoiced stop phone segmen...
Article
Full-text available
Rāga forms the melodic framework for most of the music of the Indian subcontinent. Thus automatic rāga recognition is a fundamental step in the computational modelling of the Indian art-music traditions. In this work, we investigate the properties of rāga and the natural processes by which people identify it. We bring together and discuss the previ...
Conference Paper
To make the vast digital archives of music more easily accessible, it is necessary to have searchable music descriptors, or metadata, that are meaningful and robust. While metadata conventionally covers factual information that accompanies the music on a CD such as genre, composer, artist, it could also include community-contributed semantic labels...
Article
Several researchers have established the superior performance of spatial audio over monaural audio in multi-talker environments. The advantage provided by spatial audio can be incorporated in an audio teleconferencing system for an enhanced user experience. In our earlier work, a novel scheme for the spatial rendition of audio in an audio teleconfe...
Conference Paper
The proper pronunciation of lyrics is an important component of vocal music. While automatic vowel classification has been widely studied for speech, a separate investigation of the methods is needed for singing due to the differences in acoustic properties between sung and spoken vowels. Acoustic features combining spectrum envelope and pitch are...
Article
Audio processing applications that use short-time signal analysis techniques typically utilize fixed window duration single- or multi-resolution analyses. However, different real-world signal conditions such as polyphony and non-stationarity, manifested as musical accompani- ment and pitch-modulations, respectively, in the context of music content...
Article
Full-text available
An instrumental accompaniment system for Indian classical vocal music is designed and implemented on a Texas Instruments Digital Signal Processor TMS320C6713. This will act as a virtual accompanist following the main artist, possibly a vocalist. The melodic pitch information drives an instrument synthesis system, which allows us to play any pitched...
Article
Full-text available
Melodic motifs form essential building blocks in Indian Classical music. The motifs, or key phrases, provide strong cues to the identity of the underlying raga in both Hindustani and Carnatic styles of Indian music. Thus the automatic detection of such recurring basic melodic shapes from audio is of relevance in music information retrieval. The ext...
Conference Paper
Full-text available
The effectiveness of audio content analysis for music retrieval may be enhanced by the use of available metadata. In the present work, observed differences in singing style and instrumentation across genres are used to adapt acoustic features for the singing voice detection task. Timbral descriptors traditionally used to discriminate singing voice...
Conference Paper
The meter of a musical excerpt provides high-level rhythmic information and is valuable in many music information retrieval tasks. We investigate the use of a computationally efficient approach to metrical analysis based on psycho-acoustically motivated decomposition of the audio signal. A two-stage comb filter-based approach, originally proposed f...
Conference Paper
Important aspects of singing ability include musical accuracy and voice quality. In the context of Indian classical music, not only is the correct sequence of notes important to musical accuracy but also the nature of pitch transitions between notes. These transitions are essentially related to gamakas (ornaments) that are important to the aestheti...
Conference Paper
Full-text available
The automatic classification of musical genre from audio signals has been a topic of active research in recent years. Although the identification of genre is a subjective task that likely involves high-level musical attributes such as instrumentation, style, rhythm and melody, low-level acoustic features have been widely applied to the automatic ta...
Conference Paper
Full-text available
Aspiration is an important phonemic feature in several Indian languages. Unlike English, languages such as Marathi have lexicons in which words with different meanings differ only in the aspiration feature of the initial voiced or unvoiced stop. Thus the reliable discrimination of aspirated stops from their unaspirated counterparts is important in...
Conference Paper
This paper proposes a novel scheme for the spatial rendition of audio in a teleconferencing situation. Several researchers have examined and established that there is a substantial improvement in intelligibility by the use of spatialization in such a multitalker environment, over monaural rendition. We provide experimentally obtained values for the...
Article
An investigation of acoustic features relating to vehicular traffic on roadways is reported. Computable features that relate to the type of vehicle and state of motion can be useful in monitoring traffic congestion. In the present work, different vehicles, broadly classified into two, three wheelers and heavy vehicle, are studied for their acoustic...
Article
Full-text available
Raaga is the spine of Indian classical music. It is the single most crucial element of the melodic framework on which the music of the subcontinent thrives. Naturally, automatic raaga recognition is an important step in computational musicology as far as Indian music is considered. It has several applications like indexing Indian music, automatic n...
Article
Digital audio broadcast monitoring is the task of detecting and locating occurrences of specific audio content in broadcast streams. It has important applications in the media industry such as automatic monitoring of the airing of advertisements or commercials. The moni-toring is accomplished by searching the streaming audio to detect re-gions wher...
Conference Paper
Feedback on pronunciation or articulation is an important com-ponent of spoken language teaching. Automating this aspect with speech recognition technology has been an active area of research in the context of computer-aided language-learning systems. Well-known limitations in the accuracy of automatic speech recognition (ASR) systems pose challeng...
Article
Full-text available
Detection of perceived tempo of music is an important aspect of music information retrieval. Perceived tempo depends in a complex manner on the rhythm structure of the audio signal. Machine learning approaches, proposed recently, avoid peak picking and use rhythm pattern matching with stored tempo annotated songs in the database. We investigate dif...
Article
Full-text available
Melody extraction algorithms for single-channel polyphonic music typically rely on the salience of the lead melodic instrument, considered here to be the singing voice. However the simultaneous presence of one or more pitched instruments in the polyphony can cause such a predominant-F0 tracker to switch between tracking the pitch of the voice and t...
Conference Paper
We propose a new time-adaptive windowing technique to obtain a sparse time-frequency representation for audio signals. This transformation helps in providing better source separation from stereo mixtures for improved subsequent spatial rendering over headphones. We start with standard stereo audio recordings, transform them to a sparse representati...
Conference Paper
Full-text available
The automatic extraction of the melody of the music from polyphonic recordings is a challenging problem for which no general solutions currently exist. We present a novel interface for semi-automatic melody extraction with the goal to provide highly accurate pitch tracks of the lead voice with minimal user intervention. Audio-visual feedback facili...
Conference Paper
Full-text available
Quantitative evaluation of the quality of a speaker's pronunciation of the vowels of a language can contribute to the important task of speaker accent detection. Our aim is to qualitatively and quantitatively distinguish between native and non-native speakers of a language on the basis of a comparative study of two analysis methods. One deals with...
Conference Paper
In low rate speech coders, frame-based speech spectral parameters, represented by line spectral frequencies (LSF), are typically encoded by vector quantization without exploiting explicitly the temporal correlation between frames. Recently, however, methods have been proposed that model the temporal trajectory of each LSF over a speech segment by a...
Conference Paper
Full-text available
Correct and temporally accurate phonetic segmentation of speech utterances is important in applications ranging from transcription alignment to pronunciation error detection. Automatic speech recognizers used in these tasks provide insufficient temporal alignment accuracy apart from a recognition performance that is sensitive to accent and style va...
Article
Full-text available
In this paper we investigate information pertaining to the intonation of swaras (scale-degrees) in Hindustani Classical Music for automatically identifying ragas. We briefly explain why raga identification is an interesting problem and the various attributes that characterize a raga. We look at two approaches by other authors that exploit some of t...
Article
Music information retrieval is currently an active research area that addresses the extraction of musically important information from audio signals, and the applications of such information. The extracted information can be used for search and retrieval of music in recommendation systems, or to aid musicological studies or even in music learning....
Conference Paper
Full-text available
Landmark based recognition of unvoiced word-initial stops is investigated. The relative effectiveness of acoustic-phonetic attributes versus more global spectral shape features is experimentally evaluated for four-way place classification of unvoiced, unaspirated stops. Various feature sets derived from the burst and vocalic transition regions of w...

Network

Cited By