Company built new robots to help with massive lift-and-shift
Meta deployed a 129,000 Nvidia H100 GPU data center cluster across five adjacent data centers.
The company’s VP of engineering for its infrastructure division explained that the company moved thousands of racks to develop its then-largest supercomputer, emptying out live facilities.
“It’s extremely expensive to take down data centers that are in production, because these are expensive investments, and you want them to keep running and doing useful work,” Yee Jiun Song said at the AI Infra Summit, in comments first reported by Data Center Richness.
“In our case, these data centers were serving live workloads, and we have to take them down as quickly as possible, while not causing any user-perceived outages.”
He added: “In order to do this quickly, we had to redesign the loading docks in our data centers, build brand new robots to move these 1,000-pound racks, and even design crateless packaging for these racks to speed up the moves.
“We quadrupled the amount of networking in these buildings, which involved pulling out and replacing hundreds of meters of network fiber, and digging new trenches to connect the five buildings together. And we did all of this in a matter of months.”
Song said that the company made the decision because the existing sites were “able to produce the power we needed to build a large cluster.”
The company did not disclose the location of the AI cluster, but it has since deployed larger ones and even used tents to help it deploy at speed.
“Having built out infrastructure for years, we thought we’d learn everything that was to learn about scale,” Song said. “But honestly, AI is kicking our butts and teaching us that we know nothing.”
Meta plans to invest “hundreds of billions of dollars into compute,” with a 1GW cluster called Prometheus coming next year, and a 5GW Hyperion planned over the remaining decade.
https://www.datacenterdynamics.com/en/news/meta-emptied-five-live-data-centers-to-make-one-large-ai-cluster/

