Figure 4 - available via license: Creative Commons Attribution-ShareAlike 4.0 International
Content may be subject to copyright.
Source publication
The style of the speech varies from person to person and every person exhibits his or her own style of speaking that is determined by the language, geography, culture and other factors. Style is best captured by prosody of a signal. High quality multi-speaker speech synthesis while considering prosody and in a few shot manner is an area of active r...
Context in source publication
Context 1
... three audio-related features: speaker embedding, fundamental frequency and energy of the reference speech sample which are important for capturing the prosody of reference speech. These three features are passed into the convolution layer to generate the affine parameters (Fig. 3). The parameter ρ is used to combine these parameters (Equation (1) (Fig. 4), we have concatenated the speaker embedding (256 dimensional vector), frequency and energy of the reference speech sample to generate a tensor of size (audio-frames * 258 * batches). This is then fed it into the multi-head attention network to generate the affine parameters. These affine parameters (Equation (3)) are used to bias β ...