Adaptive Front-ends for End-to-end Source Separation

Source separation and other audio applications have traditionally relied on the use of short-time Fourier transforms as a front-end frequency domain representation step. The unavailability of a neural network equivalent to forward and inverse transforms hinders the implementation of end-to-end learning systems for these applications. We present an auto-encoder neural network that can act as an equivalent to short-time front-end transforms. We demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal and further show how it can be used as an adaptive front-end for supervised source separation. In terms of separation performance, these transforms significantly outperform their Fourier counterparts. Finally, we also propose a novel source to distortion ratio based cost function for end-to-end source separation.

Work by Shrikant Venkataramani, Jonah Casebeer and Paris Smaragdis.

Check out our NIPS workshop paper here for more details.

Here are some samples of the model in action!

  1. Reference: Two overlayed voices from the TIMIT dataset.
  2. Female DFT: The female voice separated by a classic fixed transform model.
  3. Female ADFE: The female voice separated by our adaptive front end model.
  4. Female ADFEO: The female voice separated by our adaptive front end model with orthogonal(tied) front and back ends.
#. Reference. Female DFT Female ADFE Female ADFEO
1
2
3
4
5
6
7
8
9
10