Dally declared: ‘90% of the power in data centers is going into inference’
Artificial intelligence is not magic. It’s sophisticated software and computation.
While the hype around AGI and superintelligence often overshadows reality, it’s essential to focus on the tangible progress of today’s technology rather than being distracted by conflated and sensationalized narratives.
That’s why one of the standout sessions at Nvidia’s GTC conference on March 18, 2026 was a candid and grounded conversation between Nvidia Chief Scientist Bill Dally and Google DeepMind Chief Scientist Jeff Dean.
What stood out most was their shared urgency about the near-term explosion of AI agentic demand and the need to significantly improve hardware for faster and more capable inference. Inference is the process of generating answers from AI models.
Their discussion covered novel memory technologies for AI chips, the advantages and disadvantages of Nvidia GPUs versus Google TPUs, and innovative new chip architectures.
A pivotal moment came when Dally declared: “I think it’s not just that inference is starting to become important. Inference is THE job now. It’s easily 90% of the power in data centers is going into inference.”
Both figures are industry legends. Dally has spearheaded Nvidia’s hardware and AI innovations through cutting-edge research, while Dean has built much of the early foundational software backbone for Google, co-led Gemini’s tech development, and helped create the TPU.
I loved the dynamic between these two computer scientists who are deeply optimistic about the future of AI. The dialogue was thoughtful and inspiring, sharing concrete insights for building the next generation of AI chips without a hint of competitive posturing.
Here are 12 additional key highlights from the session.
Dean on Agents: AI models are “way better at problems with verifiable rewards for things like math and coding. Particularly in the last year, we’ve accelerated how effectively these models can do mathematics. [recently] we’ve started to see agent-based workflows work extremely well for longer running kinds of things .. now you can actually give these models tasks that take hours or perhaps even days, and they will go off and independently do a bunch of things, correct themselves, do some more things. And I think that’s a pretty exciting transition.”
Dean on importance of improving agentic inference: “we’re going to have a lot more of these agents kind of operating in the background. And that one of the things that’s going to be important is how do we get ultra low latency inference so that these systems can do their work autonomously faster because I think that’s a rapid driver of how effectively they can solve problems.”
Dally: “I can see us running relatively big models at 10,000 to 20,000 tokens per second per user.”
Dally and Dean on Google TPUs versus Nvidia GPUs. GPUs are good for ..
https://taekim.substack.com/p/nvidias-bill-dally-and-googles-jeff

