Nvidia developed an ultra-fast text-to-image (t2i) model for 1024x1024 in 0.1s: NvLabs Sana Sprint | Ai medical diagnosis company | Image creator | Ai art generator free online | Turtles AI

Nvidia developed an ultra-fast text-to-image (t2i) model for 1024x1024 in 0.1s: NvLabs Sana Sprint
A new diffusion model that optimizes speed and efficiency in generating images from text, dramatically reducing inference steps and improving visual quality
Isabella V16 March 2025

 

SANA-Sprint is a novel deployment model for ultrafast image generation from text, reducing inference steps from 20 to 1-4 while maintaining high quality and speed.

Key Points:

  • Efficiency: Reduced inference steps from 20 to 1-4.
  • Quality: FID score of 7.59 and GenEval of 0.74 in a single pass.
  • Speed: Latency of 0.1 seconds for 1024x1024 images on H100 GPU.
  • Interactivity: ControlNet integration for immediate visual feedback.


In recent years, image generation from text has seen significant progress, with a focus on efficiency and quality. In this context, SANA-Sprint emerges as a promising solution, offering a novel approach that combines speed and accuracy in image synthesis.

SANA-Sprint builds on a pre-trained model, implementing a hybrid distillation strategy to optimize the inference process. Traditionally, diffusion models require around 20 steps to generate a high-quality image. SANA-Sprint dramatically reduces this number to just 1-4 steps, thanks to three main innovations.

The first innovation is a no-training method that transforms a pre-trained flow-matching model to continuous-time coherence distillation (sCM). This approach eliminates the need for expensive training from scratch, improving overall efficiency. The hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): while sCM ensures alignment with the reference model, LADD improves the fidelity of images generated in a single step.

The second innovation is the unified and adaptable model structure, which enables the generation of high-quality images in 1-4 steps without the need for specific training for each step. This approach further increases the efficiency of the process.

The third innovation is the integration of ControlNet, which enables interactive generation of images in real time, providing immediate visual feedback and improving user interaction.

The performance of SANA-Sprint is remarkable: in a single pass, it achieves a FID score of 7.59 and a GenEval score of 0.74, beating FLUX-fast (7.94 FID / 0.71 GenEval) and being ten times faster (0.1 seconds vs. 1.1 seconds on an H100 GPU). For images with a resolution of 1024x1024, SANA-Sprint achieves a latency of 0.1 seconds for text-to-image (T2I) generation and 0.25 seconds with ControlNet on an H100 GPU. On an RTX 4090 GPU, the latency is 0.31 seconds for T2I, demonstrating exceptional efficiency and significant potential for AI-based consumer applications.

The developers of SANA-Sprint have announced their intention to open source the code and pre-trained models, enabling further development and applications in various industries, from marketing materials to game development. In particular, for companies specializing in AI-based content creation, SANA-Sprint offers the opportunity to accelerate and simplify the production of high-quality images. 

SANA-Sprint represents a significant step forward in text image generation, combining efficiency, quality, and interactivity in a single innovative solution.