▎TEACHING & LEARNING

Teaching AI to “Listen”: A Breakthrough in Audio-Language Models

Human speech is incredibly nuanced—imbued with emotions, tones, and subtleties that plain text simply cannot capture. Since standard Large Language Models (LLMs) are designed to process only text, a crucial challenge has emerged: How do we teach these models to truly “understand” the rich information embedded in audio?

Traditionally, teaching an LLM to “hear” requires massive amounts of labeled data. For instance, to teach a model that a shouted “Answer me!” conveys anger, humans must manually annotate the data. While collecting labeled data for every possible scenario is theoretically possible, it is practically unfeasible. Worse still, researchers face a dilemma known as catastrophic forgetting: as models learn to process audio, they often lose performance in their core strengths—text processing and reasoning (see Research [1]).

The Solution: The DeSTA Series

Enter the DeSTA series [2], co-developed by student researcher Kehan Lu from Prof. Hung-yi Lee’s speech processing and machine learning lab and the NVIDIA research team. Designed to overcome these limitations, DeSTA introduces a highly effective and scalable training methodology.

The team achieved a major breakthrough in generalization: by training the model on a single audio task, it was able to adapt successfully to other, previously “unseen” audio tasks. The key innovation lies in a technique that prevents catastrophic forgetting—using the model’s own self-generated data to expand its capabilities without erasing its original data intelligence.

The newly released DeSTA2.5-Audio was trained on approximately 7,000 hours of speech data. While this scale is often beyond the reach of most university laboratories, the project was made possible through the generous computational support of the NVIDIA Academic Grant Program.

The results are striking indeed. Despite being trained on only a fraction of the data used by industry-scale models—which often rely on hundreds of thousands of hours of speech—DeSTA2.5-Audio outperforms these larger systems across most audio-related benchmarks.

This achievement delivers a powerful message: innovative training strategies can matter more than sheer scale or raw computational resources.

A demonstration of the capabilities of audio-language models.

Performance comparison showing that DeSTA2.5-Audio outperformed existing audio models at the time of its release across multiple benchmarks.

Click or scan the QR code
to read research 1 “Speech-IFEval:
Evaluating Instruction-Following
and Quantifying Catastrophic
Forgetting in Speech-Aware
Language Models”.

Click or scan the QR code
to read research 2 “DeSTA2.5-Audio:
Toward General-Purpose Large Audio
Language Model with Self-Generated
Cross-Modal Alignment”.