Google unveils TPU v4 pods, 9 exaflop AI cluster
To support the next generation of fundamental advances in artificial intelligence (AI), the company announced the Google Cloud machine learning cluster with Cloud TPU v4 Pods in Preview — claimed as one of the fastest, most efficient, and most sustainable ML infrastructure hubs in the world. Powered by Cloud TPU v4 Pods, the ML cluster is designed to enable researchers and developers to make breakthroughs at the forefront of AI, allowing them to train increasingly sophisticated models to power workloads such as large-scale natural language processing (NLP), recommendation systems, and computer vision algorithms.
“At 9 exaflops of peak aggregate performance,” says Sachin Gupta, Vice President and GM, Infrastructure and Max Sapozhnikov, Product Manager, Cloud TPU, “we believe our cluster of Cloud TPU v4 Pods is the world’s largest publicly available ML hub in terms of cumulative computing power, while operating at 90% carbon-free energy.”
Building on its announcement of Cloud TPU v4 at Google I/O 2021, thr company granted early access to Cloud TPU v4 Pods to several top AI research teams, including Cohere, LG AI Research, Meta AI, and Salesforce Research.
“Researchers liked the performance and scalability that TPU v4 provides with its fast interconnect and optimized software stack, the ability to set up their own interactive development environment with our new TPU VM architecture, and the flexibility to use their preferred frameworks, including JAX, PyTorch, or TensorFlow,” says the company. “These characteristics allow researchers to push the boundaries of AI, training large-scale, state-of-the-art ML models with high price-performance and carbon efficiency.”
In addition, says the company, TPU v4 has enabled breakthroughs at Google Research in the areas of language understanding, computer vision, speech recognition, and much more, including the recently announced Pathways Language Model (PaLM) trained across two TPU v4 Pods.
“In order to make advanced AI hardware more accessible, a few years ago we launched the TPU Research Cloud (TRC) program that has provided access at no charge to TPUs to thousands of ML enthusiasts around the world,” says Jeff Dean, SVP, Google Research and AI. “They have published hundreds of papers and open-source github libraries on topics ranging from ‘Writing Persian poetry with AI’ to ‘Discriminating between sleep and exercise-induced fatigue using computer vision and behavioral genetics’. The Cloud TPU v4 launch is a major milestone for both Google Research and our TRC program, and we are very excited about our long-term collaboration with ML developers around the world to use AI for good.”
As part of the company’s commitment to sustainability, it has been matching 100% of its data centers’ and cloud regions’ annual energy consumption with renewable energy purchases since 2017. By 2030, says the company, its goal is to run its entire business on carbon-free energy (CFE) every hour of every day.
In addition to the direct clean energy supply, the data center has a Power Usage Efficiency (PUE)1 rating of 1.10, making it one of the most energy-efficient data centers in the world. Finally, the TPU v4 chip itself is highly energy efficient, with about 3x the peak FLOPs per watt of max power of TPU v3. With energy-efficient ML-specific hardware, in a highly efficient data center, supplied by exceptionally clean power, says the company, Cloud TPU v4 provides three key best practices that can help significantly reduce energy use and carbon emissions.
“In addition to sustainability,” says the company, “in our work with leading ML teams we have observed two other pain points: scale and price-performance. Our ML cluster in Oklahoma offers the capacity that researchers need to train their models, at compelling price-performance, on the cleanest cloud in the industry. Cloud TPU v4 is central to solving these challenges.”
Each Cloud TPU v4 Pod consists of 4096 chips connected together via an ultra-fast interconnect network with the equivalent of an industry-leading 6 terabits per second (Tbps) of bandwidth per host, enabling rapid training for the largest models. Each Cloud TPU v4 chip has ~2.2x more peak FLOPs than Cloud TPU v3, for ~1.4x more peak FLOPs per dollar. Cloud TPU v4 also achieves exceptionally high utilization of these FLOPs for training ML models at scale up through thousands of chips.
“While many quote peak FLOPs as the basis for comparing systems,” says the company, “it is actually sustained FLOPs at scale that determines model training efficiency, and Cloud TPU v4’s high FLOPs utilization (significantly better than other systems due to high network bandwidth and compiler optimizations) helps yield shorter training time and better cost efficiency.”
Cloud TPU v4 Pod slices are available in configurations ranging from four chips (one TPU VM) to thousands of chips. While slices of previous-generation TPUs smaller than a full Pod lacked torus links (“wraparound connections”), all Cloud TPU v4 Pod slices of at least 64 chips have torus links on all three dimensions, providing higher bandwidth for collective communication operations.
Cloud TPU v4 also enables accessing a full 32 GiB of memory from a single device, up from 16 GiB in TPU v3, and offers two times faster embedding acceleration, helping to improve performance for training large-scale recommendation models. Access to Cloud TPU v4 Pods comes in evaluation (on-demand), preemptible, and committed use discount (CUD) options.