September 2024
·
20 Reads
In contrast to Variational Autoencoders, Dynamical Variational Autoencoders (DVAEs) learn a sequence of latent states for a time series. Initially, they were implemented using recurrent neural networks (RNNs) known for challenging training dynamics and problems with long-term dependencies. This led to the recent adoption of Transformers close to the RNN-based implementation. These implementations still use RNNs as part of the architecture even though the Transformer can solve the task as the sole building block. Hence, we improve the LigHT-DVAE architecture by removing the dependence on RNNs and Cross-Attention. Furthermore, we show that a trained LigHT-DVAE ignores output-to-hidden connections, which allows us to simplify the overall architecture by removing output-to-hidden connections. We demonstrate the capability of the resulting T-DVAE on librispeech and voice bank with an improvement in training time, memory consumption, and generative performance.