Add ACE-Step pipeline for text-to-music generation#13095
Open
ChuxiJ wants to merge 3 commits intohuggingface:mainfrom
Open
Add ACE-Step pipeline for text-to-music generation#13095ChuxiJ wants to merge 3 commits intohuggingface:mainfrom
ChuxiJ wants to merge 3 commits intohuggingface:mainfrom
Conversation
## What does this PR do? This PR adds support for the ACE-Step pipeline, a text-to-music generation model that generates high-quality music with lyrics from text prompts. ACE-Step generates variable-length stereo music at 48kHz from text prompts and optional lyrics. The implementation includes: - **AceStepDiTModel**: A Diffusion Transformer (DiT) model that operates in the latent space using flow matching - **AceStepPipeline**: The main pipeline for text-to-music generation with support for lyrics conditioning - **AceStepConditionEncoder**: Condition encoder that combines text, lyric, and timbre embeddings - **Conversion script**: Script to convert ACE-Step checkpoint weights to Diffusers format - **Comprehensive tests**: Full test coverage for the pipeline and models - **Documentation**: API documentation for the pipeline and transformer model ## Key Features - Text-to-music generation with optional lyrics support - Multi-language lyrics support (English, Chinese, Japanese, Korean, and more) - Flow matching with custom timestep schedules - Turbo model variant optimized for 8 inference steps - Variable-length audio generation (configurable duration) ## Technical Details ACE-Step comprises three main components: 1. **Oobleck autoencoder (VAE)**: Compresses waveforms into 25Hz latent representations 2. **Qwen3-based text encoder**: Encodes text prompts and lyrics for conditioning 3. **Diffusion Transformer (DiT)**: Operates in the latent space using flow matching The pipeline supports multiple shift parameters (1.0, 2.0, 3.0) for different timestep schedules, with the turbo model designed for 8 inference steps using `shift=3.0`. ## Testing All tests pass successfully: - Model forward pass tests - Pipeline basic functionality tests - Batch processing tests - Latent output tests - Return dict tests Run tests with: ```bash pytest tests/pipelines/ace_step/test_ace_step.py -v ``` ## Code Quality - Code formatted with `make style` - Quality checks passed with `make quality` - All tests passing ## References - Original codebase: [ACE-Step/ACE-Step](https://github.com/ACE-Step/ACE-Step) - Paper: [ACE-Step: A Step Towards Music Generation Foundation Model](https://github.com/ACE-Step/ACE-Step)
1 task
added 2 commits
February 7, 2026 12:51
- Add gradient checkpointing test for AceStepDiTModel - Add save/load config test for AceStepConditionEncoder - Enhance pipeline tests with PipelineTesterMixin - Update documentation to reflect ACE-Step 1.5 - Add comprehensive transformer model tests - Improve test coverage and code quality
- Add support for multiple task types: text2music, repaint, cover, extract, lego, complete - Add audio normalization and preprocessing utilities - Add tiled encode/decode for handling long audio sequences - Add reference audio support for timbre transfer in cover task - Add repaint functionality for regenerating audio sections - Add metadata handling (BPM, keyscale, timesignature) - Add audio code parsing and chunk mask building utilities - Improve documentation with multi-task usage examples
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR adds the ACE-Step 1.5 pipeline to Diffusers — a text-to-music generation model that produces high-quality stereo music with lyrics at 48kHz from text prompts.
New Components
AceStepDiTModel(src/diffusers/models/transformers/ace_step_transformer.py): A Diffusion Transformer (DiT) model with RoPE, GQA, sliding window attention, and flow matching for denoising audio latents. Includes custom components:AceStepRMSNorm,AceStepRotaryEmbedding,AceStepMLP,AceStepTimestepEmbedding,AceStepAttention,AceStepEncoderLayer, andAceStepDiTLayer.AceStepConditionEncoder(src/diffusers/pipelines/ace_step/modeling_ace_step.py): Condition encoder that fuses text, lyric, and timbre embeddings into a unified cross-attention conditioning signal. IncludesAceStepLyricEncoderandAceStepTimbreEncodersub-modules.AceStepPipeline(src/diffusers/pipelines/ace_step/pipeline_ace_step.py): The main pipeline supporting 6 task types:text2music— generate music from text and lyricscover— generate from audio semantic codes or with timbre transfer via reference audiorepaint— regenerate a time region within existing audioextract— extract a specific track (vocals, drums, etc.) from audiolego— generate a specific track given audio contextcomplete— complete audio with additional tracksConversion script (
scripts/convert_ace_step_to_diffusers.py): Converts original ACE-Step 1.5 checkpoint weights to Diffusers format.Key Features
_get_task_instructionbpm,keyscale,timesignatureparameters formatted into the SFT prompt templatesrc_audio) and reference audio (reference_audio) inputs with VAE encoding_tiled_encode) and decoding (_tiled_decode) for long audioguidance_scale,cfg_interval_start, andcfg_interval_end(primarily for base/SFT models; turbo models have guidance distilled into weights)audio_cover_strength_parse_audio_code_stringextracts semantic codes from<|audio_code_N|>tokens for cover tasks_build_chunk_maskcreates time-region masks for repaint/lego taskstimestepsshift=3.0Architecture
ACE-Step 1.5 comprises three main components:
Tests
tests/pipelines/ace_step/test_ace_step.py):AceStepDiTModelTests— forward shape, return dict, gradient checkpointingAceStepConditionEncoderTests— forward shape, save/load configAceStepPipelineFastTests(extendsPipelineTesterMixin) — 39 tests covering basic generation, batch processing, latent output, save/load, float16 inference, CPU/model offloading, encode_prompt, prepare_latents, timestep_schedule, format_prompt, and moretests/models/transformers/test_models_transformer_ace_step.py):TestAceStepDiTModel(extendsModelTesterMixin) — forward pass, dtype inference, save/load, determinismTestAceStepDiTModelMemory(extendsMemoryTesterMixin) — layerwise casting, group offloadingTestAceStepDiTModelTraining(extendsTrainingTesterMixin) — training, EMA, gradient checkpointing, mixed precisionAll 70 tests pass (39 pipeline + 31 model).
Documentation
docs/source/en/api/pipelines/ace_step.md— Pipeline API documentation with usage examplesdocs/source/en/api/models/ace_step_transformer.md— Transformer model documentationUsage
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
References