Skip to content

Add ACE-Step pipeline for text-to-music generation#13095

Open
ChuxiJ wants to merge 3 commits intohuggingface:mainfrom
ChuxiJ:add-ace-step-pipeline
Open

Add ACE-Step pipeline for text-to-music generation#13095
ChuxiJ wants to merge 3 commits intohuggingface:mainfrom
ChuxiJ:add-ace-step-pipeline

Conversation

@ChuxiJ
Copy link

@ChuxiJ ChuxiJ commented Feb 7, 2026

What does this PR do?

This PR adds the ACE-Step 1.5 pipeline to Diffusers — a text-to-music generation model that produces high-quality stereo music with lyrics at 48kHz from text prompts.

New Components

  • AceStepDiTModel (src/diffusers/models/transformers/ace_step_transformer.py): A Diffusion Transformer (DiT) model with RoPE, GQA, sliding window attention, and flow matching for denoising audio latents. Includes custom components: AceStepRMSNorm, AceStepRotaryEmbedding, AceStepMLP, AceStepTimestepEmbedding, AceStepAttention, AceStepEncoderLayer, and AceStepDiTLayer.

  • AceStepConditionEncoder (src/diffusers/pipelines/ace_step/modeling_ace_step.py): Condition encoder that fuses text, lyric, and timbre embeddings into a unified cross-attention conditioning signal. Includes AceStepLyricEncoder and AceStepTimbreEncoder sub-modules.

  • AceStepPipeline (src/diffusers/pipelines/ace_step/pipeline_ace_step.py): The main pipeline supporting 6 task types:

    • text2music — generate music from text and lyrics
    • cover — generate from audio semantic codes or with timbre transfer via reference audio
    • repaint — regenerate a time region within existing audio
    • extract — extract a specific track (vocals, drums, etc.) from audio
    • lego — generate a specific track given audio context
    • complete — complete audio with additional tracks
  • Conversion script (scripts/convert_ace_step_to_diffusers.py): Converts original ACE-Step 1.5 checkpoint weights to Diffusers format.

Key Features

  • Multi-task support: 6 task types with automatic instruction routing via _get_task_instruction
  • Music metadata conditioning: Optional bpm, keyscale, timesignature parameters formatted into the SFT prompt template
  • Audio-to-audio tasks: Source audio (src_audio) and reference audio (reference_audio) inputs with VAE encoding
  • Tiled VAE encode/decode: Memory-efficient chunked encoding (_tiled_encode) and decoding (_tiled_decode) for long audio
  • Classifier-free guidance (CFG): Dual forward pass with configurable guidance_scale, cfg_interval_start, and cfg_interval_end (primarily for base/SFT models; turbo models have guidance distilled into weights)
  • Audio cover strength blending: Smooth interpolation between cover-conditioned and text-only-conditioned outputs via audio_cover_strength
  • Audio code parsing: _parse_audio_code_string extracts semantic codes from <|audio_code_N|> tokens for cover tasks
  • Chunk masking: _build_chunk_mask creates time-region masks for repaint/lego tasks
  • Anti-clipping normalization: Post-decode normalization to prevent audio clipping
  • Multi-language lyrics: 50+ languages including English, Chinese, Japanese, Korean, French, German, Spanish, etc.
  • Variable-length generation: Configurable duration from 10 seconds to 10+ minutes
  • Custom timestep schedules: Pre-defined shifted schedules for shift=1.0/2.0/3.0, or user-provided timesteps
  • Turbo model variant: Optimized for 8 inference steps with shift=3.0

Architecture

ACE-Step 1.5 comprises three main components:

  1. Oobleck autoencoder (VAE): Compresses 48kHz stereo waveforms into 25Hz latent representations
  2. Qwen3-Embedding-0.6B text encoder: Encodes text prompts and lyrics for conditioning
  3. Diffusion Transformer (DiT): Denoises audio latents using flow matching with an Euler ODE solver

Tests

  • Pipeline tests (tests/pipelines/ace_step/test_ace_step.py):
    • AceStepDiTModelTests — forward shape, return dict, gradient checkpointing
    • AceStepConditionEncoderTests — forward shape, save/load config
    • AceStepPipelineFastTests (extends PipelineTesterMixin) — 39 tests covering basic generation, batch processing, latent output, save/load, float16 inference, CPU/model offloading, encode_prompt, prepare_latents, timestep_schedule, format_prompt, and more
  • Model tests (tests/models/transformers/test_models_transformer_ace_step.py):
    • TestAceStepDiTModel (extends ModelTesterMixin) — forward pass, dtype inference, save/load, determinism
    • TestAceStepDiTModelMemory (extends MemoryTesterMixin) — layerwise casting, group offloading
    • TestAceStepDiTModelTraining (extends TrainingTesterMixin) — training, EMA, gradient checkpointing, mixed precision

All 70 tests pass (39 pipeline + 31 model).

Documentation

  • docs/source/en/api/pipelines/ace_step.md — Pipeline API documentation with usage examples
  • docs/source/en/api/models/ace_step_transformer.md — Transformer model documentation

Usage

import torch
import soundfile as sf
from diffusers import AceStepPipeline

pipe = AceStepPipeline.from_pretrained("ACE-Step/ACE-Step-v1-5-turbo", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

# Text-to-music generation
audio = pipe(
    prompt="A beautiful piano piece with soft melodies",
    lyrics="[verse]\nSoft notes in the morning light\n[chorus]\nMusic fills the air tonight",
    audio_duration=30.0,
    num_inference_steps=8,
    bpm=120,
    keyscale="C major",
).audios

sf.write("output.wav", audio[0, 0].cpu().numpy(), 48000)

Before submitting

Who can review?

References

## What does this PR do?

This PR adds support for the ACE-Step pipeline, a text-to-music generation model that generates high-quality music with lyrics from text prompts. ACE-Step generates variable-length stereo music at 48kHz from text prompts and optional lyrics.

The implementation includes:
- **AceStepDiTModel**: A Diffusion Transformer (DiT) model that operates in the latent space using flow matching
- **AceStepPipeline**: The main pipeline for text-to-music generation with support for lyrics conditioning
- **AceStepConditionEncoder**: Condition encoder that combines text, lyric, and timbre embeddings
- **Conversion script**: Script to convert ACE-Step checkpoint weights to Diffusers format
- **Comprehensive tests**: Full test coverage for the pipeline and models
- **Documentation**: API documentation for the pipeline and transformer model

## Key Features

- Text-to-music generation with optional lyrics support
- Multi-language lyrics support (English, Chinese, Japanese, Korean, and more)
- Flow matching with custom timestep schedules
- Turbo model variant optimized for 8 inference steps
- Variable-length audio generation (configurable duration)

## Technical Details

ACE-Step comprises three main components:
1. **Oobleck autoencoder (VAE)**: Compresses waveforms into 25Hz latent representations
2. **Qwen3-based text encoder**: Encodes text prompts and lyrics for conditioning
3. **Diffusion Transformer (DiT)**: Operates in the latent space using flow matching

The pipeline supports multiple shift parameters (1.0, 2.0, 3.0) for different timestep schedules, with the turbo model designed for 8 inference steps using `shift=3.0`.

## Testing

All tests pass successfully:
- Model forward pass tests
- Pipeline basic functionality tests
- Batch processing tests
- Latent output tests
- Return dict tests

Run tests with:
```bash
pytest tests/pipelines/ace_step/test_ace_step.py -v
```

## Code Quality

- Code formatted with `make style`
- Quality checks passed with `make quality`
- All tests passing

## References

- Original codebase: [ACE-Step/ACE-Step](https://github.com/ACE-Step/ACE-Step)
- Paper: [ACE-Step: A Step Towards Music Generation Foundation Model](https://github.com/ACE-Step/ACE-Step)
@ChuxiJ ChuxiJ marked this pull request as draft February 7, 2026 11:38
chuxij added 2 commits February 7, 2026 12:51
- Add gradient checkpointing test for AceStepDiTModel
- Add save/load config test for AceStepConditionEncoder
- Enhance pipeline tests with PipelineTesterMixin
- Update documentation to reflect ACE-Step 1.5
- Add comprehensive transformer model tests
- Improve test coverage and code quality
- Add support for multiple task types: text2music, repaint, cover, extract, lego, complete
- Add audio normalization and preprocessing utilities
- Add tiled encode/decode for handling long audio sequences
- Add reference audio support for timbre transfer in cover task
- Add repaint functionality for regenerating audio sections
- Add metadata handling (BPM, keyscale, timesignature)
- Add audio code parsing and chunk mask building utilities
- Improve documentation with multi-task usage examples
@ChuxiJ ChuxiJ marked this pull request as ready for review February 7, 2026 14:29
@dg845 dg845 requested review from dg845 and yiyixuxu February 8, 2026 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant