Training models end-to-end
In this line of research, we were interested in constructing audio models which operated on raw waveforms. The working hypothesis was that we could overcome difficulties of doing audio processing in the Fourier domain by learning a custom domain.
In initial work, lead by Shrikant Venkataramani
This approach lead to variety of exciting results and was significantly improved thanks to the development of better network architectures (Conv TASNet, U-Net, etc), use of better training schemes (Permutation Invariant Training), and more. This work also revealed some interesting connections to more classic work on NMF. This line of NMF based research culminated in Shrikant’s thesis where he explored a variety of interesting processing concepts.
Later on, with Umut Isik, Shrikant and I explored some more efficient end-to-end formulations by reinterpreting the convolutional weights as a sparse factorization of the DFT. The intuition was that the FFT could be interpreted as a sparse factorization of the DFT. When implemented correctly, multiplying by this sparse factorization yields the well known \(O(n \log(n))\) runtime. We realized that this stack of matrices could be made learnable – allowing us to learn fast transforms! We demonstrated this approach on speech enhancement and focused on very small models
I revisited end-to-end learning with Umut later on. Here, we were again interested in speech enhancement but we wanted to do so with discrete latent space models. Our idea was that you could both enhance speech and compress it at the same time! This work was inspired by VQ-VAE and lead to an interesting kind of model which we called a compressor-enhancer