ACE-STEP: a new frontier in the musical generation based on AI | Best generative ai courses | Is chatgpt generative ai | | Turtles AI

ACE-STEP: a new frontier in the musical generation based on AI
A fast and flexible open source model that combines diffusion, deep compression and advanced control to create coherent and customizable music
Isabella V8 May 2025

 

ACE-STEP is an open source model for the musical generation that combines speed, structural coherence and advanced controllability. Using an innovative architecture, ACE-STEP integrates diffusion techniques, audio compression and light transformers to produce high quality music in reduced times. It supports the generation of songs up to 4 minutes in just 20 seconds on the A100 GPU, offering features such as vocal cloning, text editing and remix. Designed for artists, producers and developers, ACE-STEP represents a significant step towards flexible basic models for the Ai Music.

Key points:

  • Unprecedented efficiency: it generates 4 minutes of music in 20 seconds, 15 times faster than LLM models.
  • Advanced control: allows vocal cloning, text editing and remix with precision.
  • Innovative architecture: it combines diffusion, DCAE and linear transformers for optimal performance.
  • Multilingual support: compatible with 19 languages ​​and various musical styles.

In the panorama of AI applied to music, ACE-STEP emerges as a reference model for the musical generation. Developed with the aim of overcoming the limitations of existing models, ACE-SP integrates different technologies to offer a complete and flexible solution.

At the heart of ACE-STEP there is an architecture that combines the generation based on spreading with the Deep Compression Autoencoder (DCAE) of Sana and a light linear transformer. This combination allows you to maintain fine acoustic details and ensure long -range structural coherence. During training, ACE-STEP uses Merts and M-Hubert to align semantic representations (reprisons), facilitating a rapid convergence and improving the alignment between text and music.

The performance of ACE-STEP are remarkable: it is able to synthesize up to 4 minutes of music in just 20 seconds on an A100 GPU, resulting 15 times faster than LLM-based models. In addition, it offers higher musical consistency in the metrics of melody, harmony and rhythm. The advanced features include the cloning of the item, the editing of the texts, the remix and the generation of traces, such as the conversion from text to voice (Lyric2vocal) and from accompanying singing (singing2accompaniment).

ACE-STEP supports a wide range of musical styles and descriptions, including short tags, descriptive text and scenarios of use cases. It is designed to be a basic model for the Ai Music, offering a fast, multipurpose, efficient and flexible architecture that facilitates under-activity training. This approach opens the way to the development of powerful tools that integrate perfectly into the creative work flows of musical artists, producers and content creators.

ACE-STEP is available as an open source project, with code and accessible models on Github and Hugging Face. It was recently integrated in Comfyui, expanding the possibilities of use for developers and creators. With the support for 19 languages ​​and compatibility with various musical styles, ACE-STEP is proposed as a versatile resource for the musical and technological community.

ACE-STEP represents a significant progress in the field of automatic musical generation, offering a combination of speed, quality and controllability that distinguishes it in the current panorama.