Jonah Casebeer

Adobe Research
San Francisco

About Me

I am a research scientist in the Music AI group at Adobe Research, focusing on machine learning for signal processing. My research interests center on optimization and representation learning for audio generation, editing, and enhancement. I received both my Ph.D. in Computer Science and B.S. in Statistics and Computer Science from the University of Illinois Urbana-Champaign, where I was advised by Prof. Paris Smaragdis.

Prospective Interns

I am enthusiastic about collaborating with students through research internships at Adobe Research in San Francisco. The internship program spans 3-4 months during the summer. If you are passionate about applying machine learning, deep learning, or signal processing to audio, I encourage you to get in touch. Please send me your CV along with a brief explanation of your research interests. My Adobe page is here.

Publications

2025

Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders

Bralios, Dimitrios and Casebeer, Jonah

Preprint

Abstract: Neural audio codecs and autoencoders have emerged as versatile models for audio compression, transmission, feature-extraction, and latent-space generation. However, a key limitation is that most are trained to maximize reconstruction fidelity, often neglecting the specific latent structure necessary for optimal performance in diverse downstream applications. We propose a simple, post-hoc framework to address this by modifying the bottleneck of a pre-trained autoencoder. Our method introduces a "Re-Bottleneck", an inner bottleneck trained exclusively through latent space losses to instill user-defined structure. We demonstrate the framework's effectiveness in three experiments. First, we enforce an ordering on latent channels without sacrificing reconstruction quality. Second, we align latents with semantic embeddings, analyzing the impact on downstream diffusion modeling. Third, we introduce equivariance, ensuring that a filtering operation on the input waveform directly corresponds to a specific transformation in the latent space. Ultimately, our Re-Bottleneck framework offers a flexible and efficient way to tailor representations of neural audio models, enabling them to seamlessly meet the varied demands of different applications with minimal additional training.

[PDF] [arXiv] [Code]

Learning to Upsample and Upmix Audio in the Latent Domain

Bralios, Dimitrios and Smaragdis, Paris and Casebeer, Jonah

Preprint

Abstract:Neural audio autoencoders create compact latent representations that preserve perceptually important information, serving as the foundation for both modern audio compression systems and generation approaches like next-token prediction and latent diffusion. Despite their prevalence, most audio processing operations, such as spatial and spectral up-sampling, still inefficiently operate on raw waveforms or spectral representations rather than directly on these compressed representations. We propose a framework that performs audio processing operations entirely within an autoencoder's latent space, eliminating the need to decode to raw audio formats. Our approach dramatically simplifies training by operating solely in the latent domain, with a latent L1 reconstruction term, augmented by a single latent adversarial discriminator. This contrasts sharply with raw-audio methods that typically require complex combinations of multi-scale losses and discriminators. Through experiments in bandwidth extension and mono-to-stereo upmixing, we demonstrate computational efficiency gains of up to 100× while maintaining quality comparable to post-processing on raw audio. This work establishes a more efficient paradigm for audio processing pipelines that already incorporate autoencoders, enabling significantly faster and more resource-efficient workflows across various audio tasks.

[PDF] [arXiv] [Demo]

DRAGON: Distributional Rewards Optimize Diffusion Generative Models

Bai, Yatong and Casebeer, Jonah and Sojoudi, Somayeh and Bryan, Nicholas J

Preprint

Abstract:We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modality encoders such as CLAP are used, the reference examples may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them to construct a positive demonstration set and a negative set, and leverages the contrast between the two sets to maximize the reward. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 different reward func- tions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Fréchet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target re- wards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Example generations can be found at https://ml-dragon.github.io/web.

[PDF] [arXiv] [Demo]

REGEN: Learning Compact Video Embedding with (Re-) Generative Decoder

Zhang, Yitian and Mai, Long and Mahapatra, Aniruddha and Bourgin, David and Hong, Yicong and Casebeer, Jonah and Liu, Feng and Fu, Yun

Preprint

Abstract:We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions. This relaxed criterion enables substantial improvements in compression ratios without compromising the quality of downstream generative models. Specifically, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space. Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding. Our experiments demonstrate that our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases. To demonstrate the efficacy of our approach, we report results from our video embedders achieving a temporal compression ratio of up to 32x (8x higher than leading video embedders) and validate the robustness of this ultra-compact latent space for text-to-video generation, providing a significant efficiency boost in latent diffusion model training and inference.

[PDF] [arXiv] [Demo]

2024

Presto! Distilling Steps and Layers for Accelerating Music Generation

Novack, Zachary and Zhu, Ge and Casebeer, Jonah and McAuley, Julian and Berg-Kirkpatrick, Taylor and Bryan, Nicholas J

Preprint

Abstract: Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at this https URL.

[PDF] [arXiv] [Demo]

Scaling Up Adaptive Filter Optimizers

Casebeer, Jonah and Bryan, Nicholas J and Smaragdis, Paris

Preprint

[PDF] [arXiv] [Code] [Demo]

Meta-AF Echo Cancellation for Improved Keyword Spotting

Casebeer, Jonah and Wu, Junkai and Smaragdis, Paris

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Abstract: Adaptive filters (AFs) are vital for enhancing the performance of downstream tasks, such as speech recognition, sound event detection, and keyword spotting. However, traditional AF design prioritizes isolated signal-level objectives, often overlooking downstream task performance. This can lead to suboptimal performance. Recent research has leveraged meta-learning to automatically learn AF update rules from data, alleviating the need for manual tuning when using simple signal-level objectives. This paper improves the Meta-AF framework by expanding it to support end-to-end training for arbitrary downstream tasks. We focus on classification tasks, where we introduce a novel training methodology that harnesses self-supervision and classifier feedback. We evaluate our approach on the combined task of acoustic echo cancellation and keyword spotting. Our findings demonstrate consistent performance improvements with both pre-trained and joint-trained keyword spotting models across synthetic and real playback. Notably, these improvements come without requiring additional tuning, increased inference-time complexity, or reliance on oracle signal-level training data.

[PDF] [arXiv] [Code]

2023

Meta-Learning for Adaptive Filtering - Ph.D. Thesis

Casebeer, Jonah

Ph.D. Thesis, University of Illinois at Urbana-Champaign

[Website]

Meta-AF: Meta-Learning for Adaptive Filters

Casebeer, Jonah and Bryan, Nicholas J and Smaragdis, Paris

IEEE/ACM Transactions on Audio, Speech, and Language Processing

[PDF] [arXiv] [Code]

2022

Meta-Learning for Adaptive Filters with Higher-Order Frequency Dependencies

Wu, Junkai and Casebeer, Jonah and Bryan, Nicholas J. and Smaragdis, Paris

IEEE International Workshop on Acoustic Signal Enhancement (IWAENC)

[PDF] [arXiv] [Code]

NICE-Beam: Neural Integrated Covariance Estimators for Time-Varying Beamformers

Casebeer, Jonah and Donley, Jacob and Wong, Daniel and Xu, Buye and Kumar, Anurag

Preprint

[PDF] [arXiv]

2021

Auto-DSP: Learning to Optimize Acoustic Echo Cancellers

Casebeer, Jonah and Bryan, Nicholas J and Smaragdis, Paris

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

[PDF] [arXiv] [Code]

Sound Event Detection with Adaptive Frequency Selection

Wang, Zhepei and Casebeer, Jonah and Clemmitt, Adam and Tzinis, Efthymios and Smaragdis, Paris

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

[PDF] [arXiv] [Code]

Separate but Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data

Tzinis, Efthymios and Casebeer, Jonah and Wang, Zhepei and Smaragdis, Paris

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

[PDF] [arXiv] [Code]

Enhancing Into the Codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders

Casebeer, Jonah and Vale, Vinjai and Isik, Umut and Valin, Jean-Marc and Giri, Ritwik and Krishnaswamy, Arvindh

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

[PDF] [arXiv]

2020

Communication-Cost Aware Microphone Selection For Neural Speech Enhancement with Ad-hoc Microphone Arrays

Casebeer, Jonah and Kaikaus, Jamshed and Smaragdis, Paris

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

[PDF] [arXiv] [Code]

Efficient Trainable Front-Ends for Neural Speech Enhancement

Casebeer, Jonah and Isik, Umut and Venkataramani, Shrikant and Krishnaswamy, Arvindh

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

[PDF] [arXiv]

2019

Deep Tensor Factorization for Spatially-Aware Scene Decomposition

Casebeer, Jonah and Colomb, Michael and Smaragdis, Paris

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

[PDF] [arXiv]

Dimensional Analysis of Laughter in Female Conversational Speech

Pietrowicz, Mary and Agurto, Carla and Casebeer, Jonah and Hasegawa-Johnson, Mark and Karahalios, Karrie and Cecchi, Guillermo

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

[Website]

Multipath-Enabled Private Audio with Noise

Chaman, Anadi and Liu, Yu-Jeh and Casebeer, Jonah and Dokmani{\'c}, Ivan

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

[PDF] [arXiv] [Code] [Demo]

Multi-View Networks For Multi-Channel Audio Classification

Casebeer, Jonah and Wang, Zhepei and Smaragdis, Paris

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

[PDF] [arXiv] [Demo]

2018

Cocktails, but no party: multipath-enabled private audio

Liu, Yu-Jeh and Casebeer, Jonah and Dokmani{\'c}, Ivan

IEEE International Workshop on Acoustic Signal Enhancement

[PDF] [arXiv] [Code] [Demo]

Verbal Protest Recognition in Children with Autism

Casebeer, J and Sarker, H and Dhuliawala, M and Fay, N and Pietrowicz, M and Das, A

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

[Poster]

Multi-View Networks for Denoising of Arbitrary Numbers of Channels

Casebeer, Jonah and Luc, Brian and Smaragdis, Paris

IEEE International Workshop on Acoustic Signal Enhancement (IWAENC)

[PDF] [arXiv]

Last updated: August 2025