🟡 📦 Open Source Published: · 2 min read ·

Stability AI: Stable Audio 3.0 with open-weight models and generation up to 6 minutes

Editorial illustration: Stability AI Stable Audio 3 open-weight model family with 6-minute generation and inpainting support

Stability AI released Stable Audio 3.0 on 20 May 2026 — a family of 4 audio models (Small SFX, Small, Medium, Large) of which three are open-weight and available on Hugging Face. The key advance is generation of audio up to 6 minutes (the previous version produced 47 seconds), along with support for audio inpainting, causal continuation, and LoRA fine-tuning. All models are claimed to have been trained exclusively on licensed data.

🤖

This article was generated using artificial intelligence from primary sources.

Stability AI released Stable Audio 3.0 on 20 May 2026 — a family of four generative audio models (Small SFX, Small, Medium, Large) of which three are open-weight and immediately available on Hugging Face. The most important change from previous versions: the model can now generate audio up to 6 minutes in length (Stable Audio 2 had a maximum of 47 seconds).

What does 6-minute generation enable?

The leap from 47 seconds to 6 minutes opens up applications for which the earlier model was unusable: soundtracks for short films, podcast intro/outro production, in-game music without looping, educational content, and ambient audio compositions for VR/AR applications. The technical foundation is a new diffusion transformer with time-conditioned attention that maintains structural coherence over long time spans — the problem that was previously the main reason generated audio would “drift”.

What is audio inpainting?

Stable Audio 3 supports three modes of audio inpainting: single-segment (fill in one section of an existing recording), multi-segment (multiple sections simultaneously), and causal continuation (extend an existing recording in a natural sequence). This brings the model closer to the Adobe Premiere Pro and iZotope RX ecosystem — tools for assembling real audio projects, not just a “text-to-audio” demonstrator.

How was the model trained and what is the licence?

Stability AI emphasises that all models were trained exclusively on licensed data — which resolves the key legal obstacle that has troubled open audio models. The company faced multiple lawsuits from Getty Images and music publishers over the past two years specifically for using protected data in training. Stable Audio 3 is the first attempt to address those concerns.

The licence permits free commercial use for organisations with revenue up to one million US dollars per year. An Enterprise licence is required above that threshold. The model supports LoRA fine-tuning, meaning studios can adapt models to their own sound catalogues without retraining from scratch.

What does this mean for the open AI audio ecosystem?

Competitors — Meta AudioCraft, Google MusicLM, OpenAI Jukebox — are mostly closed or legally problematic. Stability AI, with 3 of 4 models on Hugging Face and licensed training data, provides a production-ready open-source path for audio generation that has not previously existed.

Frequently Asked Questions

Which models are open-weight?
Three of the four models in the family — Small SFX, Small, and Medium — are available with open weights on Hugging Face. The Large model is available as a hosted API and through an Enterprise licence, while the small and medium models are suited for local use.
What is audio inpainting?
Audio inpainting is the model's ability to fill in or replace a section of an existing audio recording rather than generating a new one from scratch. Stable Audio 3 supports single-segment (one section), multi-segment (multiple sections), and causal continuation (extending an existing recording).
What is the licence?
Stable Audio 3 permits free commercial use for organisations with revenue up to one million US dollars per year. An Enterprise licence is required above that threshold. All models are trained exclusively on licensed data, which resolves the key legal obstacle that has troubled open audio models.