Article

Separation of Music+Effects sound track from several international versions of the same movie

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Les méthodes de séparation de sources guidée regroupent un certain nombre de techniques qui permettent de prendre en compte des informations à propos des sources [111,173]. Différents types d'informations extérieures peuvent être incorporés, par exemple des informations symboliques sous forme de fichier MIDI (Musical Instrument Digital Interface) [60] ou de texte [105]), des informations données par un utilisateur [44,131,157] ou encore sous forme de signal [113,157]. Nous nous intéresserons particulièrement à ce dernier cas aussi appelé séparation guidée par signal de référence ou encore exemplar-based separation comme défini par Liutkus et al. [111]. Ces approches guidées sont à ne pas confondre avec la séparation de sources informée [110,137] dont le but est le codage des sources et leur transmission. ...
... Une seconde situation est celle où le signal source présent dans la référence et le signal source que l'on cherche à estimer sont les mêmes [109,113]. Le problème peut alors revenir à l'estimation des différents processus de mélange (celui du mélange à séparer et celui de la référence). En effet, la référence est bien un mélange dans ce cas. ...
... Séparation de signaux communs On parle de signal commun (common signal ) ou de « séparation de composantes communes » [116] lorsque la même source apparait de façon identique dans deux mélanges différents. Par exemple dans le cas des films en différentes langues, la même musique 2 apparaitra dans les différentes versions mais mélangée avec des dialogues différents [29,109,113]. On peut également se retrouver dans cette situation lorsque les références sont des éléments répétés provenant du même film [160,161]. ...
Thesis
Lorsque l'on manipule un signal audio, il est généralement utile d'opérer un isolement du ou des éléments sonores que l'on cherche à traiter. Cette étape est couramment appelée séparation de sources audio. Il existe de nombreuses techniques pour estimer ces sources et plus on prend en compte d'informations à leur sujet plus la séparation a des chances d'être réussie. Une façon d'incorporer des informations sur une source est l'utilisation d'un signal de référence qui va donner une première approximation de cette source. Cette thèse s'attache à explorer les aspects théoriques et appliqués de la séparation de sources audio guidée par signal de référence. La nouvelle approche proposée appelée SPOtted REference based Separation (SPORES) examine le cas particulier où les références sont obtenues automatiquement par détection de motif, c'est-à-dire par une recherche de contenu similaire. Pour qu'une telle approche soit utile, le contenu traité doit comporter une certaine redondance ou bien une large base de données doit être disponible. Heureusement, le contexte actuel nous permet bien souvent d'être dans une des deux situations et ainsi de retrouver ailleurs des motifs similaires. L'objectif premier de ce travail est de fournir un cadre théorique large qui une fois établi facilitera la mise au point efficace d'outils de traitement de contenus audio variés. Le second objectif est l'utilisation spécifique de cette approche au traitement de bandes-son de films avec par exemple comme application leur conversion en format surround 5.1 adapté aux systèmes home cinema.
... In this paper, we focus on methods that guide the separation process by a reference signal that is similar to one of the target sources [10], [14]- [16], [23]- [27]. Such a framework can be referred to as reference guided source separation, and it has recently been used in several scenarios : the restoration of music pieces guided by isolated piano sounds [10], the separation of music and sound effects from speech guided by several versions of the same movie in different languages [23], the separation of musical instruments guided by a multitrack cover version of a song [24], [25], and the denoising of speech guided by the same sentence pronounced by the same speaker [26] or by a different speaker [16]. ...
... In this paper, we focus on methods that guide the separation process by a reference signal that is similar to one of the target sources [10], [14]- [16], [23]- [27]. Such a framework can be referred to as reference guided source separation, and it has recently been used in several scenarios : the restoration of music pieces guided by isolated piano sounds [10], the separation of music and sound effects from speech guided by several versions of the same movie in different languages [23], the separation of musical instruments guided by a multitrack cover version of a song [24], [25], and the denoising of speech guided by the same sentence pronounced by the same speaker [26] or by a different speaker [16]. Symbolic information such as a text [16] or a musical score [12] can also be used to generate reference signals. ...
... The proposed framework generalizes the state-of-the-art approaches in [10], [16], [27] as they exploit similar models. Our framework can also model the same kind of signals as used in [10], [14], [23]- [26] even if the models can be quite different. Finally, it makes it possible to investigate some new scenarios that have been put forward in [27], like music source separation for a verse guided by another verse. ...
Article
Full-text available
We present a general multi-channel source separation framework where additional audio references are available for one (or more) source(s) of a given mixture.Each audio reference is another mixture which is supposed to contain at least one source similar to one of the target sources.Deformations between the sources of interest and their references are modeled in a linear manner using a generic formulation.This is done by adding transformation matrices to an excitation-filter model, hence affecting different axes, namely frequency, dictionary component or time. A nonnegative matrix co-factorization algorithm and a generalized expectation-maximization algorithm are used to estimate the parameters of the model.Different model parameterizations and different combinations of algorithms are tested on music plus voice mixtures guided by music and/or voice references and on professionally-produced music recordings guided by cover references.Our algorithms improve the signal-to-distortion ratio (SDR) of the sources with the lowest intensity by 9 to 15 decibels (dB) with respect to original mixtures.
... The problem is formulated here as a multichannel (more than two input channels), underdetermined (more sources than mixtures) source separation problem. Apart from our previous work on the subject [1], we believe this is a new application of multichannel source separation. ...
... It is thus supposed that a preprocessing has been applied to compensate these issues and to approximate the linearity of the mixing. A preprocessing block aimed at aligning and equalizing input tracks was presented in our previous work [1]. ...
... In this work we present the comparative evaluation of 5 methods for extracting the common signal b(t), and in some cases the dialog signals vi(t), from the linear mixture model of Eq. 2. Two of the methods were already presented in a previous work [1] and will be summarized in the following section. The three other approaches are novel: the first (cone) is based on an extension of the DUET algorithm [2] and the two others (N-SP, N-SP-SUB) on an extension of the SP algorithm [3]. ...
Conference Paper
We address the task of separation of music and effects from dialogs in film or television soundtracks. This is of interest for film studios wanting to release films in new, previously unavailable languages when the original separated music and effects track is not available. For this purpose, we propose several methods for common signal extraction from a set of soundtracks in different languages, which are multichannel extensions of previous methods for center signal extraction from stereo signals. The proposed methods are simple, effective, and have an intuitive geometrical interpretation. Experiments show that the proposed methods improve the results provided by our previously proposed methods based on basic filtering techniques.
... Apriori known non-vocal segments are used to train an accompaniment model; which is based on a Probability Latent Component Analysis (PLCA) [9] Vocal and Non-Vocal Segmentation is performed using MFCCs and Gaussian Mixture models(GMM) [8][9].Bayesian models [10] are trained to adapt an background model learned from the non-vocal sections. Predominant pitch and melody estimators are accomplished in vocal segments to extract the pitch-contour [11], which was finally put to use to separate the speech components via Binary TF masking. ...
... Auto-Correlation Matrix 'B' [10,11] is given by: ...
Conference Paper
Full-text available
The distinction of the lead varying vocals from the background music in an audio recording is an extremely demanding and exigent task. The speech-separation research usually inculcates Time-frequency masking technique that ultimately appraises the hearing-aid design. The core principle in music which is capitalized to discriminate underlying non-vocals from vocals (speech) is Repetition. The ‘Repetition’ feature is especially enacted for pop songs where the singer often overlays frequently changing vocals on a periodically repeating background in a mixture. The basic approach of this research paper is the recognisation of periodically repeating segments in audio excerpts, analogize them with a repeating model and finally discrimate the repeating musical patterns via Time-Frequency masking. A TF mask is grounded on the basis of TF representation of any signal. In the proposed algorithm, the quality of foreground vocals and accompanying background can be adjudicated in terms of SIR (Signal to Interference Ratio) value utilizing ‘ANOVA’ (Analysis Of Variation) computational method on six different genres of musical audios.
... Common signal separation is a problem where, from a set of mixtures, one tries to recover a source that is shared among all of them. Practical applications range from film music extraction [1] to multichannel denoising [2,3]. Repeating pattern separation [4], focuses on separating a varying component (e.g. the singing voice) from a repeating background (e.g. ...
... In the common signal separation problem, the individual sources are often considered as noise and the shared component is the signal of interest. Redundancy in this case is the result of a multisensor acquisition [3] or the existence of multiple versions [1]. In the repeating pattern separation framework, the shared source is the musical background (e.g ...
Conference Paper
This paper describes a greedy algorithm for audio source separation of repeated musical patterns. The problem is understood as retrieving from a set of mixtures the part that is redundant among them and the parts that are specific to only one mixture. The key assumption is the sparsity of all the sources in the same multiscale dictionary. Synthetic and real life examples of source separation of hand cut repeated musical patterns are exposed. Results shows that the proposed method succeeds in simultaneously providing a sparse approximant of the mixtures and a separation of the sources.
... In this case, the MNE track is common to the several versions, apart from several treatments (equalization, processing) and defects (missynchronization or drifting, noise, clicks, etc.). Following previous work [1,2], the goal is to achieve realistic applications of this extraction process. In our previous approach [2], the MNE tracks are assumed to be identical (up to a gain factor) among the versions, the dialog tracks being signals specific to the versions. ...
... Indeed, a different filtering is often applied to the common signal by the local mix engineer, which is often the case when the MNE track is mixed with the language-specific signal. This effect can be somewhat compensated [1] using pre-computation of filters on signal parts when only the common signal is active, but such method can lack in robustness especially when this type of data is missing. In this study, we integrate the filtering of the common signal as part of the signal model, so that it can be computed in the global estimation process. ...
Conference Paper
Full-text available
This paper addresses the extraction of a common signal among several mono audio tracks when this common signal undergoes a track-specific filtering. This problem arises in the extraction of a common music and effects track from a set of soundtracks in different languages. To this aim, a novel approach is proposed. The method is based on the dictionary modeling of track-specific and common signals, and is compared to a previous one proposed by the authors based on geometric considerations. The approach is integrated into a Non-Negative Matrix Factorization framework using the Itakura-Saito divergence. The method is evaluated on a synthetic database composed of filtered music and effects tracks, the filters being track-specific, and track-specific dialogs. The results show that this task becomes tractable, while the previously introduced method could not handle track-specific filtering.
... 16 Furthermore, using an additional audio references as a side information such as using the multitrack cover version of the same song [17][18][19][20] or using several international versions of the same movie. 21 Additionally, the text can be used as side information to mimic the targeted speech signal. 22 The use of the exemplar gives a controllability over the separation mechanism by means of which source is to be separated in the mixture. ...
Article
In this paper, a method is proposed to tackle the problem of single channel audio separation. The proposed method leverages on the exemplar source is used to emulate the targeted speech signal. A multicomponent nonnegative matrix factor 2D deconvolution (NMF2D) is proposed to model the temporal and spectral changes and the number of spectral basis of the audio signals. The paper proposes an artificial auxiliary channel to imitate a pair of stereo mixture signals, which is termed as “artificial‐stereophonic mixtures.” The artificial‐stereophonic mixtures and the exemplar source are jointly used to guide the factorization process of the NMF2D. The factorization is adapted under a hybrid framework that combines the generalized expectation–maximization algorithm with multiplicative update adaptation. The proposed algorithm leads to fast and stable convergence and ensures the nonnegativity constraints of the solution are satisfied. Adaptive sparsity has also been introduced on each sparse parameter in the multicomponent NMF2D model when the exemplar deviates from the target signal. Experimental results have shown the competence of the proposed algorithms in comparison with other algorithms.
... On pourrait imaginer construire des masques temps-fréquence à partir de la représentation parcimonieuse.La pertinence de ces approches repose sur le bon alignement des mélanges. Dans un problème de séparation de source commune tel que l'extraction de fond musical à partir d'une collection de bandes originales de lm[LL10], cet alignement est souvent problématique. De même pour la séparation de voix dans une chanson par détection de motifs récurrents. ...
Article
The main goal of this work is automated processing of large volumes of audio data. Most specifically, one is interested in archiving, a process that encompass at least two distinct problems: data compression and data indexing. Jointly addressing these problems is a difficult task since many of their objectives may be concurrent. Therefore, building a consistent framework for audio archival is the matter of this thesis. Sparse representations of signals in redundant dictionaries have recently been found of interest for many sub-problems of the archival task. Sparsity is a desirable property both for compression and for indexing. Methods and algorithms to build such representations are the first topic of this thesis. Given the dimensionality of the considered data, greedy algorithms will be particularly studied. A first contribution of this thesis is the proposal of a variant of the famous Matching Pursuit algorithm, that exploits randomness and sub-sampling of very large time frequency dictionaries. We show that audio compression (especially at low bit-rate) can be improved using this method. This new algorithms comes with an original modeling of asymptotic pursuit behaviors, using order statistics and tools from extreme values theory. Other contributions deal with the second member of the archival problem: indexing. The same framework is used and applied to different layers of signal structures. First, redundancies and musical repetition detection is addressed. At larger scale, we investigate audio fingerprinting schemes and apply it to radio broadcast on-line segmentation. Performances have been evaluated during an international campaign within the QUAERO project. Finally, the same framework is used to perform source separation informed by the redundancy. All these elements validate the proposed framework for the audio archiving task. The layered structures of audio data are accessed hierarchically by greedy decomposition algorithms and allow processing the different objectives of archival at different steps, thus addressing them within the same framework.
... Coding extra cues through Informed source separation has been introduced in [3,4]. A novel approach has been introduced in [5,6] that incorporates user assistance into separating music and speech mixtures, [7][8][9][10] have also used the idea in [5] with various approaches. State-of-the-art approaches that uses spectral harmonics are introduced in [11,12]. ...
... Whereas MIDI files may provide symbolic data that can help separation, there are cases where additional audio recordings are available, which are known to be related to the mixture to separate. For example, common signal separation [17,15] was proposed as a way to separate the music+effect track from surrounding music in a movie soundtrack. To this purpose, the music is assumed to be the same in several international versions of the same movie. ...
Conference Paper
Full-text available
Audio source separation consists in recovering different unknown signals called sources by filtering their observed mixtures. In music processing, most mixtures are stereophonic songs and the sources are the individual signals played by the instruments, e.g. bass, vocals, guitar, etc. Source separation is often achieved through a classical generalized Wiener filtering, which is controlled by parameters such as the power spectrograms and the spatial locations of the sources. For an efficient filtering, those parameters need to be available and their estimation is the main challenge faced by separation algorithms. In the blind scenario, only the mixtures are available and performance strongly depends on the mixtures considered. In recent years, much research has focused on informed separation, which consists in using additional available information about the sources to improve the separation quality. In this paper, we review some recent trends in this direction.
Conference Paper
Full-text available
The distinction of the lead varying vocals from the background music in an audio recording is an extremely demanding and exigent task. The speech-separation research usually inculcates Time-frequency masking technique that ultimately appraises the hearing-aid design. The core principle in music which is capitalized to discriminate underlying non-vocals from vocals (speech) is Repetition. The ‘Repetition’ feature is especially enacted for pop songs where the singer often overlays frequently changing vocals on a periodically repeating background in a mixture. The basic approach of this research paper is the recognisation of periodically repeating segments in audio excerpts, analogize them with a repeating model and finally discrimate the repeating musical patterns via Time-Frequency masking. A TF mask is grounded on the basis of TF representation of any signal. In the proposed algorithm, the quality of foreground vocals and accompanying background can be adjudicated in terms of SIR (Signal to Interference Ratio) value utilizing ‘ANOVA’ (Analysis Of Variation) computational method on six different genres of musical audios and formulated the complete comparison of SIR values using hamming, hanning and blackmann windows and concluded that separation of mono-aural vocal and non-vocal components applying blackmann window shows better SIR values as compared to hanning and hamming windows.
Conference Paper
Enormous applications and the future necessities of image processing area open new paths for researchers. The analysis and recognition of numerous documents in the form of images are the challenging task. The classification is one of the vital phases in the image processing. The methods of classification must possess the consistency and accuracy. This research analyzes the different classification techniques and the performance analysis is carried out with the help of testing the image classification principles. In this research paper few classification techniques for Devanagari script are considered and evaluated in MATLAB R2014a.
Conference Paper
A method for removing impulse noise from audio signals by fusing multiple copies of the same recording is introduced in this paper. The proposed algorithm exploits the fact that while in general multiple copies of a given recording are available, all sharing the same master, most degradations in audio signals are record-dependent. Our method first seeks for the optimal non-rigid alignment of the signals that is robust to the presence of sparse outliers with arbitrary magnitude. Unlike previous approaches, we simultaneously find the optimal alignment of the signals and impulsive degradation. This is obtained via continuous dynamic time warping computed solving an Eikonal equation. We propose to use our approach in the derivative domain, reconstructing the signal by solving an inverse problem that resembles the Poisson image editing technique. The proposed framework is here illustrated and tested in the restoration of old gramophone recordings showing promising results; however, it can be used in other applications where different copies of the signal of interest are available and the degradations are copy-dependent.
Article
Repetition is a core principle in music. Many musical pieces are characterized by an underlying repeating structure over which varying elements are superimposed. This is especially true for pop songs where a singer often overlays varying vocals on a repeating accompaniment. On this basis, we present the REpeating Pattern Extraction Technique (REPET), a novel and simple approach for separating the repeating “background” from the non-repeating “foreground” in a mixture. The basic idea is to identify the periodically repeating segments in the audio, compare them to a repeating segment model derived from them, and extract the repeating patterns via time-frequency masking. Experiments on data sets of 1,000 song clips and 14 full-track real-world songs showed that this method can be successfully applied for music/voice separation, competing with two recent state-of-the-art approaches. Further experiments showed that REPET can also be used as a preprocessor to pitch detection algorithms to improve melody extraction.
Conference Paper
Full-text available
Existing perceptual models of audio quality, such as PEAQ, were designed to measure audio codec performance and are not well suited to evaluation of audio source separation algorithms. The rela- tionship of many other signal quality measures to human perception is not well established. We collected subjective human assessments of dis- tortions encountered when separating audio sources from mixtures of two to four harmonic sources. We then correlated these assessments to 18 machine-measurable parameters. Results show a strong correlation (r=0.96) between a linear combination of a subset of four of these pa- rameters and mean human assessments. This correlation is stronger than that between human assessments and several measures currently in use.
Article
Full-text available
Extracting the main melody from a polyphonic music recording seems natural even to untrained human listeners. To a certain extent it is related to the concept of source separation, with the human ability of focusing on a specific source in order to extract relevant information. In this paper, we propose a new approach for the estimation and extraction of the main melody (and in particular the leading vocal part) from polyphonic audio signals. To that aim, we propose a new signal model where the leading vocal part is explicitly represented by a specific source/filter model. The proposed representation is investigated in the framework of two statistical models: a Gaussian Scaled Mixture Model (GSMM) and an extended Instantaneous Mixture Model (IMM). For both models, the estimation of the different parameters is done within a maximum-likelihood framework adapted from single-channel source separation techniques. The desired sequence of fundamental frequencies is then inferred from the estimated parameters. The results obtained in a recent evaluation campaign (MIREX08) show that the proposed approaches are very promising and reach state-of-the-art performances on all test sets.
Article
Full-text available
In this paper, we discuss the evaluation of blind audio source separation (BASS) algorithms. Depending on the exact application, different distortions can be allowed between an estimated source and the wanted true source. We consider four different sets of such allowed distortions, from time-invariant gains to time-varying filters. In each case, we decompose the estimated source into a true source part plus error terms corresponding to interferences, additive noise, and algorithmic artifacts. Then, we derive a global performance measure using an energy ratio, plus a separate performance measure for each error term. These measures are computed and discussed on the results of several BASS problems with various difficulty levels
Article
Full-text available
Binary time-frequency masks are powerful tools for the separation of sources from a single mixture. Perfect demixing via binary time-frequency masks is possible provided the time-frequency representations of the sources do not overlap: a condition we call W-disjoint orthogonality. We introduce here the concept of approximate W-disjoint orthogonality and present experimental results demonstrating the level of approximate W-disjoint orthogonality of speech in mixtures of various orders. The results demonstrate that there exist ideal binary time-frequency masks that can separate several speech signals from one mixture. While determining these masks blindly from just one mixture is an open problem, we show that we can approximate the ideal masks in the case where two anechoic mixtures are provided. Motivated by the maximum likelihood mixing parameter estimators, we define a power weighted two-dimensional (2-D) histogram constructed from the ratio of the time-frequency representations of the mixtures that is shown to have one peak for each source with peak location corresponding to the relative attenuation and delay mixing parameters. The histogram is used to create time-frequency masks that partition one of the mixtures into the original sources. Experimental results on speech mixtures verify the technique. Example demixing results can be found online at http://alum.mit.edu/www/rickard/bss.html.
A sinusoidal model for the speech waveform is used to develop a new analysis/synthesis technique that is characterized by the amplitudes, frequencies, and phases of the component sine waves. These parameters are estimated from the short-time Fourier transform using a simple peak-picking algorithm. Rapid changes in the highly resolved spectral components are tracked using the concept of "birth" and "death" of the underlying sine waves. For a given frequency track a cubic function is used to unwrap and interpolate the phase such that the phase track is maximally smooth. This phase function is applied to a sine-wave generator, which is amplitude modulated and added to the other sine waves to give the final speech output. The resulting synthetic waveform preserves the general waveform shape and is essentially perceptually indistinguishable from the original speech. Furthermore, in the presence of noise the perceptual characteristics of the speech as well as the noise are maintained. In addition, it was found that the representation was sufficiently general that high-quality reproduction was obtained for a larger class of inputs including: two overlapping, superposed speech waveforms; music waveforms; speech in musical backgrounds; and certain marine biologic sounds. Finally, the analysis/synthesis system forms the basis for new approaches to the problems of speech transformations including time-scale and pitch-scale modification, and midrate speech coding [8], [9].