Multi Head Attention Based Normalization in proposed FSM-SS architecture

Multi Head Attention Based Normalization in proposed FSM-SS architecture

Source publication
Preprint
Full-text available
The style of the speech varies from person to person and every person exhibits his or her own style of speaking that is determined by the language, geography, culture and other factors. Style is best captured by prosody of a signal. High quality multi-speaker speech synthesis while considering prosody and in a few shot manner is an area of active r...

Context in source publication

Context 1
... three audio-related features: speaker embedding, fundamental frequency and energy of the reference speech sample which are important for capturing the prosody of reference speech. These three features are passed into the convolution layer to generate the affine parameters (Fig. 3). The parameter ρ is used to combine these parameters (Equation (1) (Fig. 4), we have concatenated the speaker embedding (256 dimensional vector), frequency and energy of the reference speech sample to generate a tensor of size (audio-frames * 258 * batches). This is then fed it into the multi-head attention network to generate the affine parameters. These affine parameters (Equation (3)) are used to bias β ...