Add ACE-Step pipeline for text-to-music generation by ChuxiJ · Pull Request #13095 · huggingface/diffusers

ChuxiJ · 2026-02-07T11:24:08Z

What does this PR do?

This PR adds the ACE-Step 1.5 pipeline to Diffusers — a text-to-music generation model that produces high-quality stereo music with lyrics at 48kHz from text prompts.

New Components

AceStepDiTModel (src/diffusers/models/transformers/ace_step_transformer.py): A Diffusion Transformer (DiT) model with RoPE, GQA, sliding window attention, and flow matching for denoising audio latents. Includes custom components: AceStepRMSNorm, AceStepRotaryEmbedding, AceStepMLP, AceStepTimestepEmbedding, AceStepAttention, AceStepEncoderLayer, and AceStepDiTLayer.
AceStepConditionEncoder (src/diffusers/pipelines/ace_step/modeling_ace_step.py): Condition encoder that fuses text, lyric, and timbre embeddings into a unified cross-attention conditioning signal. Includes AceStepLyricEncoder and AceStepTimbreEncoder sub-modules.
AceStepPipeline (src/diffusers/pipelines/ace_step/pipeline_ace_step.py): The main pipeline supporting 6 task types:
- text2music — generate music from text and lyrics
- cover — generate from audio semantic codes or with timbre transfer via reference audio
- repaint — regenerate a time region within existing audio
- extract — extract a specific track (vocals, drums, etc.) from audio
- lego — generate a specific track given audio context
- complete — complete audio with additional tracks
Conversion script (scripts/convert_ace_step_to_diffusers.py): Converts original ACE-Step 1.5 checkpoint weights to Diffusers format.

Key Features

Multi-task support: 6 task types with automatic instruction routing via _get_task_instruction
Music metadata conditioning: Optional bpm, keyscale, timesignature parameters formatted into the SFT prompt template
Audio-to-audio tasks: Source audio (src_audio) and reference audio (reference_audio) inputs with VAE encoding
Tiled VAE encode/decode: Memory-efficient chunked encoding (_tiled_encode) and decoding (_tiled_decode) for long audio
Classifier-free guidance (CFG): Dual forward pass with configurable guidance_scale, cfg_interval_start, and cfg_interval_end (primarily for base/SFT models; turbo models have guidance distilled into weights)
Audio cover strength blending: Smooth interpolation between cover-conditioned and text-only-conditioned outputs via audio_cover_strength
Audio code parsing: _parse_audio_code_string extracts semantic codes from <|audio_code_N|> tokens for cover tasks
Chunk masking: _build_chunk_mask creates time-region masks for repaint/lego tasks
Anti-clipping normalization: Post-decode normalization to prevent audio clipping
Multi-language lyrics: 50+ languages including English, Chinese, Japanese, Korean, French, German, Spanish, etc.
Variable-length generation: Configurable duration from 10 seconds to 10+ minutes
Custom timestep schedules: Pre-defined shifted schedules for shift=1.0/2.0/3.0, or user-provided timesteps
Turbo model variant: Optimized for 8 inference steps with shift=3.0

Architecture

ACE-Step 1.5 comprises three main components:

Oobleck autoencoder (VAE): Compresses 48kHz stereo waveforms into 25Hz latent representations
Qwen3-Embedding-0.6B text encoder: Encodes text prompts and lyrics for conditioning
Diffusion Transformer (DiT): Denoises audio latents using flow matching with an Euler ODE solver

Tests

Pipeline tests (tests/pipelines/ace_step/test_ace_step.py):
- AceStepDiTModelTests — forward shape, return dict, gradient checkpointing
- AceStepConditionEncoderTests — forward shape, save/load config
- AceStepPipelineFastTests (extends PipelineTesterMixin) — 39 tests covering basic generation, batch processing, latent output, save/load, float16 inference, CPU/model offloading, encode_prompt, prepare_latents, timestep_schedule, format_prompt, and more
Model tests (tests/models/transformers/test_models_transformer_ace_step.py):
- TestAceStepDiTModel (extends ModelTesterMixin) — forward pass, dtype inference, save/load, determinism
- TestAceStepDiTModelMemory (extends MemoryTesterMixin) — layerwise casting, group offloading
- TestAceStepDiTModelTraining (extends TrainingTesterMixin) — training, EMA, gradient checkpointing, mixed precision

All 70 tests pass (39 pipeline + 31 model).

Documentation

docs/source/en/api/pipelines/ace_step.md — Pipeline API documentation with usage examples
docs/source/en/api/models/ace_step_transformer.md — Transformer model documentation

Usage

import torch
import soundfile as sf
from diffusers import AceStepPipeline

pipe = AceStepPipeline.from_pretrained("ACE-Step/ACE-Step-v1-5-turbo", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

# Text-to-music generation
audio = pipe(
    prompt="A beautiful piano piece with soft melodies",
    lyrics="[verse]\nSoft notes in the morning light\n[chorus]\nMusic fills the air tonight",
    audio_duration=30.0,
    num_inference_steps=8,
    bpm=120,
    keyscale="C major",
).audios

sf.write("output.wav", audio[0, 0].cpu().numpy(), 48000)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

References

Original codebase: ACE-Step/ACE-Step-1.5
Paper: ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation

## What does this PR do? This PR adds support for the ACE-Step pipeline, a text-to-music generation model that generates high-quality music with lyrics from text prompts. ACE-Step generates variable-length stereo music at 48kHz from text prompts and optional lyrics. The implementation includes: - **AceStepDiTModel**: A Diffusion Transformer (DiT) model that operates in the latent space using flow matching - **AceStepPipeline**: The main pipeline for text-to-music generation with support for lyrics conditioning - **AceStepConditionEncoder**: Condition encoder that combines text, lyric, and timbre embeddings - **Conversion script**: Script to convert ACE-Step checkpoint weights to Diffusers format - **Comprehensive tests**: Full test coverage for the pipeline and models - **Documentation**: API documentation for the pipeline and transformer model ## Key Features - Text-to-music generation with optional lyrics support - Multi-language lyrics support (English, Chinese, Japanese, Korean, and more) - Flow matching with custom timestep schedules - Turbo model variant optimized for 8 inference steps - Variable-length audio generation (configurable duration) ## Technical Details ACE-Step comprises three main components: 1. **Oobleck autoencoder (VAE)**: Compresses waveforms into 25Hz latent representations 2. **Qwen3-based text encoder**: Encodes text prompts and lyrics for conditioning 3. **Diffusion Transformer (DiT)**: Operates in the latent space using flow matching The pipeline supports multiple shift parameters (1.0, 2.0, 3.0) for different timestep schedules, with the turbo model designed for 8 inference steps using `shift=3.0`. ## Testing All tests pass successfully: - Model forward pass tests - Pipeline basic functionality tests - Batch processing tests - Latent output tests - Return dict tests Run tests with: ```bash pytest tests/pipelines/ace_step/test_ace_step.py -v ``` ## Code Quality - Code formatted with `make style` - Quality checks passed with `make quality` - All tests passing ## References - Original codebase: [ACE-Step/ACE-Step](https://github.com/ACE-Step/ACE-Step) - Paper: [ACE-Step: A Step Towards Music Generation Foundation Model](https://github.com/ACE-Step/ACE-Step)

- Add gradient checkpointing test for AceStepDiTModel - Add save/load config test for AceStepConditionEncoder - Enhance pipeline tests with PipelineTesterMixin - Update documentation to reflect ACE-Step 1.5 - Add comprehensive transformer model tests - Improve test coverage and code quality

- Add support for multiple task types: text2music, repaint, cover, extract, lego, complete - Add audio normalization and preprocessing utilities - Add tiled encode/decode for handling long audio sequences - Add reference audio support for timbre transfer in cover task - Add repaint functionality for regenerating audio sections - Add metadata handling (BPM, keyscale, timesignature) - Add audio code parsing and chunk mask building utilities - Improve documentation with multi-task usage examples

ChuxiJ marked this pull request as draft February 7, 2026 11:38

ChuxiJ mentioned this pull request Feb 7, 2026

[New Model]: ACE-Step 1.5 vllm-project/vllm-omni#1252

Open

1 task

chuxij added 2 commits February 7, 2026 12:51

ChuxiJ marked this pull request as ready for review February 7, 2026 14:29

dg845 requested review from dg845 and yiyixuxu February 8, 2026 03:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ACE-Step pipeline for text-to-music generation#13095

Add ACE-Step pipeline for text-to-music generation#13095
ChuxiJ wants to merge 3 commits intohuggingface:mainfrom
ChuxiJ:add-ace-step-pipeline

ChuxiJ commented Feb 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChuxiJ commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

New Components

Key Features

Architecture

Tests

Documentation

Usage

Before submitting

Who can review?

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChuxiJ commented Feb 7, 2026 •

edited

Loading