Intel Officially Launches Gaudi 3: The New AI Accelerator | List of hardware components and functions | Computer hardware parts and functions | How much cpu and gpu do i need | Turtles AI

Intel Officially Launches Gaudi 3: The New AI Accelerator
Intel Introduces Gaudi 3: The AI ​​Accelerator for Advanced Performance and Optimized Costs
Isabella V25 September 2024

 

Intel today announced the availability of its new Gaudi 3 line of AI accelerators, with shipments starting next month. Designed to improve performance and cost-effectiveness in AI, the new products aim to compete directly with solutions already on the market by offering advanced processing and memory capabilities.

Key points:

  •  Up to 1835 TFLOPS of peak FP8 processing and 128 GB of HBM2e memory.
  •  Configurations with support for PCIe Gen5 x16 and 200 GbE RDMA NICs.
  •  Uplift of 9% in inference over LLaMA 3 8B models and competitive cost advantage over H100.
  •  Integrated software solution to support major AI frameworks.


Intel has officially released the new Gaudi 3 series of AI accelerators, which will include several configurations, such as the Accelerator HL-325L (OAM-compliant), Universal Baseboard HLB-325, and PCIe CEM HL-388 Add-In-Cards. Among these, the Intel Gaudi 3 PCIe CEM represents an important technological evolution. Equipped with up to 1835 TFLOPS FP8 processing power, it features 128 GB of HBM2e memory, a 600W TDP, and a structure with 8 matrix multiplication engines, 64 Tensor Processing Cores (TPCs), and 22 200 GbE RDMA NICs, all packed into a dual-slot form factor. As for internal memory, the OAM solution will be equipped with 96 MB of SRAM divided into two stacks, with a total bandwidth of 3.67 TB/s for HBM and 19.2 TB/s for on-die (L2) SRAM memory. Each matrix multiplication engine features a 256x256 MAC array structure with FP32 accumulators, allowing 64K MAC per cycle for both BF16 and FP8 data.

The Tensor Processing Core (TPC) features a 256B programmable SIMD vector processor, supporting 1-, 2- and 4-byte floating-point and integer main data types. In parallel, the HLB-325 Universal Baseboard supports up to four Gaudi 3 accelerators with 200 GbE and 400 GbE interconnect links via QSFP-DD controllers, with a PCIe Gen5 x16 bandwidth of up to 1800 GB/s for vertical scalability and 800 GB/s for horizontal scalability. This platform is optimized for small-scale AI model inference and training tasks.

In terms of performance, Gaudi 3 offers concrete improvements in AI models. In the inference phase on models such as LLaMA 3 8B, there is a 9 percent increase in performance compared to competing solutions such as H100, while in LLaMA 70B, throughput improves by 19 percent, resulting in twice the processing capacity per dollar spent compared to H100. Intel’s reference node for Gaudi 3, the HLS-3, will be equipped with Intel Xeon 6900P CPUs and up to eight OAM cards, providing a total bandwidth of up to 67.2 Tb/s in scale-up configurations.

Rounding out the offering, Intel’s software platform for Gaudi 3 provides support for major quantization techniques, such as FP16, BF16 and FP8, as well as being compatible with major frameworks used in generative AI. Intel is collaborating with several industry partners, including Dell Technologies, HPE and Supermicro for hardware, and companies such as IBM, Infosys and LUMEN for the software ecosystem. The release of Gaudi 3 represents an important step in the evolution of AI acceleration solutions, aimed at improving performance without compromising on power efficiency and cost.

Intel thus strengthens its presence in a highly competitive market, aiming to offer more affordable and high-performance solutions that meet the growing needs of the AI industry.