Diving into the technology behind Google’s AI-era global network : US Pioneer Global VC DIFCHQ SFO NYC Singapore – Riyadh Swiss Our Mind

The unprecedented growth and unique challenges of AI applications are driving fundamental architectural changes to Google’s next-generation global network.

The AI era brings an explosive surge in demand for network capacity, with novel traffic patterns characteristic of large-scale model training and inference. Simultaneously, the critical need for unwavering reliability has reached new heights; in an AI-driven world, outages are simply not an option. Furthermore, the requirement for enhanced security and fine-grained control, including data sovereignty considerations, is paramount. Finally, the operational cost and complexity associated with scaling traditional network architectures necessitate a more innovative approach, pushing us beyond basic automation towards true autonomy.

As we discussed in this blog, we are meeting these challenges head-on by building the next generation of Google’s global network upon four key architectural principles: (1) exponential scalability, (2) beyond-9s reliability, (3) intent-driven programmability, and (4) autonomous networking.

In this blog, let’s peel back the layers and see how the underlying technology makes these four principles a reality.

Exponential scalability with a multi-shard network

We embrace elastic horizontal scaling as a core architectural principle for Google’s global network through our multi-shard network. Instead of one monolithic network, we’ve built multiple independent shards. This provides several benefits:

  • Horizontal scaling: When more capacity is needed, we can scale up by growing a shard, and scale out by adding more shards, overcoming the limits and complexity of vertical scale. This is akin to adding more independent networks, rather than trying to make a single network bigger and bigger.
  • Independent planes: The separation of control, data, and management planes within each shard significantly limits the impact radius of any potential issue. A software bug or operational error (such as an incorrect configuration push) in one shard is far less likely to impact others, enhancing the network’s overall stability.
https://storage.googleapis.com/gweb-cloudblog-publish/images/1_-_GGN_Scalability_Multi_Shard.max-2000x2000.png

In the AI era, the WAN is the new LAN and the continent is the data center. This horizontal scaling approach, inspired by the design of our massive data center fabrics, allows Google’s global network to handle the unprecedented bandwidth demands of today’s AI workloads. This multi-shard network has been a key enabler for us to accommodate the average 7X WAN traffic growth between 2020 and 2025and more importantly, an order of magnitude growth in peak traffic due to the bursty nature of ML traffic over the same period.

Beyond-9s reliability: Architecting for resilience

In a world of always-on services, reliability is paramount. Google’s global network incorporates several key innovations to achieve beyond-9s availability, emphasizing diversity and independence at every layer of the stack to avoid “shared fate” (cascading failures) and minimize impact during failures.

  • Multi-shard isolation: Each network shard has independent data, control, and management planes. We control what can enter and leave these shards to a cluster or edge. This prevents a bad state from a cluster poisoning all the shards at the same time. The sharded architecture inherently provides a degree of isolation. Furthermore, we apply a multi-vendor paradigm when deploying our network shards, thanks to years of development of open API and models (discussed later) that allows us to operationalize any vendor platform under the same network function. This multi-vendor approach protects our network shards from vulnerabilities introduced by third-party software or hardware.
  • Region isolation: With this approach, regional cores keep traffic within their domains, and regional gateways enforce policies for traffic that’s entering or leaving. This limits the impact of regional events, effectively shielding the rest of the network.
https://storage.googleapis.com/gweb-cloudblog-publish/images/2_-_Multi_Shard_and_Region_Isolation_1.max-2200x2200.jpg
  • Protective ReRoute: Google’s global network implements a unique transport technique for shortening user-visible outages that complements routing repair, and it marks a radical shift in how we think about network reliability. In the conventional network model, hosts send packets, and routers handle them. With Protective ReRoute, hosts actively shift traffic flows across network paths to improve reliability and performance, intelligently detecting network path anomalies and promptly, automatically rerouting traffic to a healthy, alternative path, which can be in the same or alternative shard. The host reroutes traffic in round-trip time scales, i.e., O(RTT), by changing a few bits in the packet header that are used to compute the hash function to select a specific path among many equally viable paths. This host-initiated re-routing protects customer traffic beyond what traditional routing and traffic engineering can achieve, and is independent of the type of network, scale of network, or type of failure, thereby providing robust and deterministic recovery and performance. With Protective ReRoute in our network, we have observed up to a 93% reduction in cumulative outage minutes.
https://storage.googleapis.com/gweb-cloudblog-publish/original_images/3_-_Protective_ReRoute_Single_Shard.gif

For a conceptual overview of these scalability and resilience innovations, check out this video:

https://storage.googleapis.com/gweb-cloudblog-publish/images/maxresdefault_Dia1gbi.max-1300x1300.jpg

Also, be sure to check out this demo to see the combined value of our multi-shard network and Protective ReRoute in action. Here, we emulate a network shard failure and show how the host promptly detects a path failure and routes the traffic over an alternative path in a different, healthy shard, providing near-instant recovery.

https://storage.googleapis.com/gweb-cloudblog-publish/images/maxresdefault-1_MjCf1N4.max-1300x1300.jpg

Intent-driven programmability for fine-grained network controls

To cater to our customers’ diverse and evolving needs, network agility and fine-grained programmability is crucial. Google’s global network allows for network controls to be precisely tailored to specific business requirements, encompassing regulatory compliance, digital sovereignty mandates, and unique application performance needs, down to the most granular network attributes. This programmability is made possible by:

  • Software-defined networking (SDN) controllers: Google’s global network is fully intent-driven, with SDN everywhere. We use SDN controllers to manage network behavior hierarchically. Orion, our hierarchical and federated SDN control plane platform, propagates top-level intent through layers of network control applications, which then react by updating their internal state and generating intermediate intent for each network switch. This hierarchical propagation results in changes to the programmed flow state in network switches.
  • Universal network model: Our universal network model, Multi-Abstraction-Layer Topology representation, or MALT, allows us to specify generic intent and business policy. Our control and management planes can then use these representations to implement these policies coherently across the network.
  • Standardized API: Because we rely on the OpenConfig software layer, we can use multiple routing vendors interchangeably, making the network more robust. With vendor diversity, a bug or an issue in one vendor’s software or hardware doesn’t impact the whole network, and we have options when scaling our network.

This programmability enables us to implement business policies directly into the network fabric, offering granularity and the ability to isolate bandwidth for critical applications. Customers with specific regulatory requirements can also leverage this programmability to enforce their desired network path controls for their data in motion.

Autonomous networking for the network powering AI

The sheer scale and complexity of a global network of our scale demands a shift from traditional automation to a more intelligent, autonomous approach that requires minimal human intervention. This is especially critical to avoid the substantial increase in operational expenses that come with network growth, and to flatten the cost curves for network planning, design and operations. Below are some examples where we apply AI/ML techniques to help today. We see opportunities to expand into many more use cases:

  • Network incident response with a Gemini and Vertex AI agentic framework: We are using an agentic AI approach to shorten outage times by identifying and mitigating failures faster, and to perform more effective root-cause analysis. This is helping us reduce the mean-time to detect and mean-time to resolve network issues.
  • Demand forecasting and capacity planning: We are using AutoML for accurate demand forecasting, and employing graph optimization to optimize our network capacity planning.
  • Reinforcement learning for routing optimization: We tune routing metrics for specific objectives, such as network performance, with reinforcement learning.

Autonomous networking has allowed us to slash failure mitigation times from hours to minutes, improving our network’s resilience and customer experience. Check out this demo to see an example of our autonomous network in action!

https://storage.googleapis.com/gweb-cloudblog-publish/images/maxresdefault-2_SlQXIoe.max-1300x1300.jpg

Putting it all together

Google’s next-generation global network represents a paradigm shift in network architecture designed to power the AI era, embracing horizontal scalability through multi-sharding, architecting for resilience at every layer with regional isolation and Protective ReRoute, enabling fine-grained programmability with SDN, and adopting autonomous network operation powered by AI/ML. This helps Google’s global network provide the scale, reliability, performance, and security that today’s mission-critical services and AI/ML applications demand. This transformation of Google’s software-defined global backbone not only meets the formidable challenges of the AI era, but empowers our customers to innovate and thrive in this new landscape. Our next-generation network is designed to be the invisible, yet indispensable, force driving the future of technology and connectivity.

This deep dive only scratches the surface, but hopefully, provides a glimpse into the innovative technologies that underpin Google’s global network. As we continue to navigate the exciting challenges and opportunities of the AI era, Google’s global network is the bedrock upon which we build and deliver transformative experiences for users and customers worldwide. Stay tuned for more updates as Google’s global network continues to evolve!

https://cloud.google.com/blog/products/networking/google-global-network-technology-deep-dive