Meta has completed the first phase of a new AI supercomputer. Once the AI Research SuperCluster (RSC) is fully built out later this year, the company believes it will be the fastest AI supercomputer on the planet, capable of “performing at nearly 5 exaflops of mixed-precision compute.”
The company says RSC will help researchers develop better AI models that can learn from trillions of examples. Among other things, the models will be able to build better-augmented reality tools and “seamlessly analyze text, images and video together,” according to Meta. Much of this work is in service of its vision for the metaverse, in which it says AI-powered apps and products will have a key role.
“We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together,” technical program manager Kevin Lee and software engineer Shubho Sengupta wrote in a blog post.
RSC currently has 760 Nvidia DGX A100 systems with a total of 6,080 GPUs. Meta believes the current iteration is already among the fastest AI supercomputers on the planet. Based on early benchmarks, it claims RSC can, compared with the company’s older setup, run computer vision workflows up to 20 times faster and the NVIDIA Collective Communication Library more than nine times faster.
Meta says RSC can train large-scale natural language processing models three times faster as well. As such, AI models that determine whether “an action, sound or image is harmful or benign” (for example, to root out hate speech) can be trained more quickly. According to the company, that research will help protect people on current services like Facebook and Instagram, as well as in the metaverse.
Along with creating the physical infrastructure and systems to run RSC, Meta said it needed to ensure there were security and privacy controls in place to protect the real-world training data it uses. It says that by using real-world data from its production systems, instead of publicly available data sets, it can more effectively put its research to use by, for instance, identifying harmful content.
This year, Meta plans to increase the number of GPUs in RSC to 16,000. It says that will boost AI training performance by more than 2.5 times. The company, which started working on the project in early 2020, wanted RSC to train AI models on data sets up to an exabyte in size (the equivalent of 36,000 years’ worth of high-quality video).
“We expect such a step function change in compute capability to enable us not only to create more accurate AI models for our existing services but also to enable completely new user experiences, especially in the metaverse,” Lee and Sengupta wrote.
Other exascale systems are being built in the US. The delayed Aurora supercomputer at the Department of Energy’s Argonne National Laboratory is expected to hit 2 exaflops, while the El Capitan supercomputer, which will manage the country’s nuclear stockpile, is expected to top 2 exaflops when it arrives next year.