Meta Introduces Apollo: A New Standard for Advanced Video Analytics | Meta AI Europe | Meta AI | WhatsApp Business | Turtles AI

Meta Introduces Apollo: A New Standard for Advanced Video Analytics
An open source multimodal model that addresses the complexities of video understanding with efficiency and accuracy
Isabella V18 December 2024

 

Meta introduced Apollo, an open source multimodal model that redefines video understanding in AI, setting new standards for accuracy and efficiency in processing complex video sequences.

Key Points:

  • Breakthrough Model: Apollo is an LMM designed to understand video with unprecedented accuracy and scalability.
  • Technical Challenge: The complexity of video requires advanced handling of spatial, temporal, and contextual data.
  • Innovations: Strategies such as optimized sampling and the use of specific visual encoders dramatically improve performance.
  • Outstanding Performance: Apollo-3B and Apollo-7B outperform larger models on advanced benchmarks such as LongVideoBench and MLVU.

Understanding video is one of the most complex challenges in AI, requiring technologies that can handle massive amounts of spatial and temporal data. Unlike static images or text data, videos combine temporal dynamics and contextual details in a sequence of frames, which exponentially increases the computational load required to analyze them. Meta addressed this challenge by developing Apollo, an open-source multimodal model (LMM) that redefines the rules of the game. Apollo emerges as an advanced and scalable solution designed to extract meaning from videos with unprecedented levels of efficiency, offering a revolutionary contribution not only to research, but also to multiple practical applications.

A key aspect of Apollo’s progress is the adoption of innovative approaches to model training. These include frame sampling based on frames per second (fps), which has proven to be more effective than uniform sampling, and the optimization of visual encoder architectures, which are key to representing video information more accurately. These elements, combined with consistent scalability, enable design decisions to be effectively transferred from smaller models and datasets to larger, more complex models. Meta also carefully studied data composition, training programs, and other variables to ensure that Apollo could perceive and analyze long-form video, a feat rarely accomplished by previous models.

The results are impressive: The relatively small Apollo-3B outperforms models with 7 billion parameters in benchmark tests, earning a score of 55.1 on LongVideoBench. Apollo-7B stands out even further, setting new standards with scores of 70.9 on MLVU and 63.3 on Video-MME, demonstrating how efficient architecture and thorough training can compensate for size. Apollo’s ability to handle up to an hour of video with precision and efficiency positions it as a gold standard in the multimodal AI landscape.

This remarkable progress reflects not only a commitment to innovation, but also the importance of open research to understand and optimize the mechanisms that drive video perception in LMM models.

Apollo represents a milestone for anyone who wants to explore the potential of AI in video interaction, pushing the boundaries of what is possible.

Video