Model Details
🚀 ACE-Step v1.5 is a highly efficient open-source music foundation model designed to bring commercial-grade music generation to consumer hardware.
Key Features
- 💰 Commercial-Ready: Unlike many models trained on ambiguous datasets, ACE-Step v1.5 is designed for creators. You can strictly use the generated music for commercial purposes.
- 📚 Safe & Robust Training Data: The model is trained on a massive, legally compliant dataset consisting of:
- Licensed Data: Professionally licensed music tracks.
- Royalty-Free / No-Copyright Data: A vast collection of public domain and royalty-free music.
- Synthetic Data: High-quality audio generated via advanced MIDI-to-Audio conversion.
- ⚡ Extreme Speed: Generates a full song in under 2 seconds on an A100 and under 10 seconds on an RTX 3090.
- 🖥️ Consumer Hardware Friendly: Runs locally with less than 4GB of VRAM.
Technical Capabilities
🌉 At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints—scaling from short loops to 10-minute compositions—while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). ⚡ Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model's internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. 🎚️
🔮 Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities—such as cover generation, repainting, and vocal-to-BGM conversion—while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. 🎸
- Developed by: [ACE-STEP]
- Model type: [Text2Music]
- Language(s): [50+ languages]
- License: [MIT]
Evaluation

🏗️ Architecture

🦁 Model Zoo

DiT Models
Table with columns: DiT Model, Pre-Training, SFT, RL, CFG, Step, Refer audio, Text2Music, Cover, Repaint, Extract, Lego, Complete, Quality, Diversity, Fine-Tunability, Hugging Face| DiT Model | Pre-Training | SFT | RL | CFG | Step | Refer audio | Text2Music | Cover | Repaint | Extract | Lego | Complete | Quality | Diversity | Fine-Tunability | Hugging Face |
|---|
acestep-v15-base |
LM Models
Table with columns: LM Model, Pretrain from, Pre-Training, SFT, RL, CoT metas, Query rewrite, Audio Understanding, Composition Capability, Copy Melody, Hugging Face| LM Model | Pretrain from | Pre-Training | SFT | RL | CoT metas | Query rewrite | Audio Understanding | Composition Capability | Copy Melody | Hugging Face |
|---|
acestep-5Hz-lm-0.6B | Qwen3-0.6B | ✅ | ✅ | ✅ | ✅ | ✅ |
🙏 Acknowledgements
This project is co-led by ACE Studio and StepFun.
📖 Citation
If you find this project useful for your research, please consider citing:
@misc{gong2026acestep,
title={ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
author={Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
howpublished={\url{https://github.com/ace-step/ACE-Step-1.5}},
year={2026},
note={GitHub repository}
}