From StepFun a new open source model for text-to-video | Dall e 3 image generator online free | How to use dalle 3 in chat gpt | Free generative ai text to image | Turtles AI

From StepFun a new open source model for text-to-video
Step-Video-T2V weights released, a high quality model
Isabella V17 February 2025

 

Step-Video-T2V represents an advanced text-to-video model with 30 billion parameters, capable of generating high-quality videos up to 204 frames. Its efficiency is guaranteed by a VAE with high spatial and temporal compression, optimized for the generation of smooth and realistic videos. The integration of Direct Preference Optimization (DPO) further improves the visual performance, while the model is evaluated on a proprietary benchmark that confirms its SoTA-level performance.

Key Points:

  • Advanced Compression: Video-VAE with 16x16 spatial and 8x temporal ratio to optimize resources and quality.
  • Sophisticated Architecture: DiT-based model with 3D attention and optimizations for stability and consistency.
  • Human-Driven Optimization: DPO employed to improve fidelity and naturalness of animations.
  • High Hardware Requirements: NVIDIA GPU with at least 80GB of memory required for optimal performance.

Step-Video-T2V is a novel state-of-the-art text-to-video model designed to generate high-quality videos based on text descriptions. With a capacity of 30 billion parameters, the system allows to produce animated sequences of up to 204 frames, benefiting from an advanced compression infrastructure. The key to the efficiency of Step-Video-T2V lies in the integration of a Video-VAE that ensures a 16x16 spatial and 8x temporal compression ratio, significantly improving both the training and inference phases. This structure allows to significantly reduce the computational load without compromising the quality of video reconstruction, a key element for performance optimization.

The model architecture is based on a Diffused Transformer (DiT) with 48 layers and as many attention heads, each with a size of 128. The generation process uses a denoising system based on Flow Matching, which transforms noisy inputs into latent frames through a complete 3D attention mechanism. The prompts are processed through two pre-trained text encoders, optimized to support both English and Chinese, thus ensuring greater accessibility and adaptability of the model to different linguistic needs.

A distinctive element of Step-Video-T2V is the adoption of Direct Preference Optimization (DPO), which introduces a level of refinement based on human feedback. This process allows to improve the visual quality of the generated videos, reducing artifacts and ensuring a more natural and realistic rendering. The DPO pipeline is configured as an essential step in the optimization of the model, since it aligns its productions with the perceptual expectations of users, increasing the level of detail and consistency in the final video sequences.

To ensure maximum performance, Step-Video-T2V has been tested on a system equipped with four NVIDIA GPUs, with CUDA support necessary for the correct functioning of the self-attention in the text encoders. The use of QK-Norm in self-attention and 3D RoPE for temporal sequence management ensures stability and adaptability of the model even in complex generation scenarios. Furthermore, for an optimal use of GPU resources, the model implements a decoupling strategy between the text encoder, the VAE decode and the Transformer DiT, ensuring a more efficient and balanced inference.

Step-Video-T2V performance evaluations were conducted on Step-Video-T2V-Eval, a specific benchmark developed to measure the quality of generated videos. The results confirmed that the model reaches SoTA standards in the text-based video generation landscape, outperforming both open source engines and commercial solutions currently available. However, the tests highlight that variations in the inference hyperparameters can affect the trade-off between visual fidelity and animation dynamics, requiring careful tuning to achieve the best possible trade-off between quality and smoothness of the generated video.

Step-Video-T2V represents a significant step forward in automatic video generation, combining compression technologies, human-preference-based optimization and a high-performance hardware infrastructure.