AWS AI Infrastructure with NVIDIA Blackwell

AWS has announced the general availability of P6e-GB200 UltraServers, accelerated by NVIDIA Grace Blackwell Superchips, to drive innovation in generative AI and high-performance computing. These UltraServers, along with the previously launched P6-B200 instances, are designed to support the training and deployment of advanced AI models, enabling customers to push the boundaries of AI development.

The P6e-GB200 UltraServers represent AWS's most powerful GPU offering to date, featuring up to 72 NVIDIA Blackwell GPUs interconnected using fifth-generation NVIDIA NVLink. Each UltraServer delivers 360 petaflops of dense FP8 compute and 13.4 TB of total high-bandwidth GPU memory, providing over 20 times the compute power and 11 times the memory compared to P5en instances. This significant leap in performance is aimed at supporting complex AI workloads, including generative AI and deep learning models.

P6e-GB200 UltraServers: Powering AI Innovation

The P6e-GB200 UltraServers are engineered to handle the most demanding AI tasks. They support up to 28.8 Tbps of aggregate bandwidth using fourth-generation Elastic Fabric Adapter (EFAv4) networking, ensuring efficient data transfer and low-latency communication. The UltraServers are deployed in third-generation EC2 UltraClusters, which create a unified fabric across AWS data centers, reducing power consumption by up to 40% and cabling requirements by over 80%.

"The introduction of P6e-GB200 UltraServers marks a significant milestone in AWS's commitment to providing cutting-edge AI infrastructure," said David Brown, Vice President of AWS Compute and Machine Learning Services. "These UltraServers are designed to meet the evolving needs of AI developers, enabling them to tackle complex problems with unparalleled performance and reliability."

P6-B200 Instances: Versatile AI Solutions

For AI use cases that require flexibility, AWS offers P6-B200 instances. Each instance provides 8 NVIDIA Blackwell GPUs with 1.4 TB of high-bandwidth GPU memory and up to 3.2 Tbps of EFAv4 networking. These instances offer up to 2.25 times the GPU TFLOPs, 1.27 times the GPU memory size, and 1.6 times the GPU memory bandwidth compared to P5en instances, making them ideal for a wide range of AI applications.

The choice between P6e-GB200 UltraServers and P6-B200 instances depends on the specific workload requirements and architectural needs. Both options leverage NVIDIA Blackwell's capabilities to deliver reliable performance, supported by the sixth generation of the AWS Nitro System. The Nitro System ensures instance security and stability, with live updates that allow for firmware and software updates without system downtime.

Enhancing AI Infrastructure

AWS continues to innovate across compute, networking, operations, and managed services to provide customers with the infrastructure needed to accelerate AI development. The P6e-GB200 UltraServers and P6-B200 instances are part of this ongoing effort, offering secure, reliable, and high-performance solutions for AI workloads.

"As AI capabilities evolve, the demand for robust infrastructure grows," Brown added. "With P6e-GB200 UltraServers and P6-B200 instances, we are equipping our customers with the tools they need to explore new possibilities in AI and achieve their ambitious goals."

Customers can easily deploy P6e-GB200 UltraServers and P6-B200 instances through various deployment paths, including Amazon SageMaker HyperPod for managed infrastructure and Amazon Elastic Kubernetes Service (Amazon EKS) for Kubernetes-based management. These services provide automated provisioning, lifecycle management, and optimizations to maximize performance and reliability.

With these advancements, AWS reinforces its commitment to driving AI innovation and empowering customers to achieve breakthroughs in AI development.