Rubin Pushes the GPU Off Its Pedestal
6 January, 2026
Instead of a single accelerator at the center, Nvidia reshuffles the roles of GPU, CPU, memory and networking to build an infrastructure designed for a world of large-scale inference and multi-agent AI
Above: The full Rubin platform. Source: Nvidia
At CES 2026 in Las Vegas, Nvidia unveiled Rubin, a platform it describes as “the next generation of AI infrastructure.” Rubin’s existence, and the fact that it is scheduled to reach the market in the second half of 2026, were already known. What this announcement revealed for the first time, however, was the idea behind it: not another generation of GPUs, but a deep conceptual shift in how large-scale AI systems are built and operated. Instead of a massive GPU at the center supported by surrounding components, Nvidia presented a complete architecture that functions as a single system, tightly integrating compute, memory, networking and security.
The recurring message is that Rubin is not a chip but a full rack-scale computing system, designed for a world in which AI is no longer a one-off chatbot but a constellation of agents operating over time, maintaining context, sharing memory and reasoning within a changing environment. In that sense, Rubin marks Nvidia’s transition from selling raw compute power to selling what is effectively a cognitive infrastructure.
Codesign as a principle, not a slogan
Nvidia has used the term “full stack” for years, but in practice it usually meant a collection of components built around the GPU. With Rubin, the concept of codesign takes on a very different meaning. This is not about tighter integration of existing parts, but about designing every element of the system—CPU, GPU, networking, interconnect, storage and security—together from the outset, as a single unit built to serve entirely new types of workloads.
The practical implication of this approach is that the GPU is no longer the architectural starting point. It remains a powerful and central component, but it is no longer the system’s unquestioned master. Rubin is designed around the assumption that the next AI bottleneck is not raw compute, but context management, persistent memory and orchestration across processes and agents. These are not problems solved by a faster GPU alone, but by redistributing responsibilities across the system.
In Rubin, architectural decisions are driven not by what the GPU needs, but by what the system as a whole must accomplish. This is a turning point for Nvidia, as it effectively moves away from the GPU-first mindset that has defined the company since the early CUDA era, replacing it with a system-level view in which compute is only one layer of a broader architecture.
The role of the CPU, and what it means for the x86 world
One of the most intriguing components in Rubin is the new Vera CPU. Unlike traditional data center CPUs, whose main role has been to host and schedule GPU workloads, Vera is designed from the ground up as an integral part of the inference and reasoning pipeline. It is not a passive host, but an active processor responsible for coordinating agents, managing multi-stage workflows and executing logic that is poorly suited to GPUs.
In doing so, Nvidia signals a profound shift in how it views the CPU in the AI era. Where the CPU was once largely a bottleneck on the path to the GPU, it now reemerges as a meaningful compute element—one that operates in symbiosis with the GPU rather than beneath it. The choice of an Arm-based architecture, and the fact that the CPU was designed alongside the GPU and networking rather than as a standalone component, point to Nvidia’s ambition to control the orchestration and control layer, not just the compute layer.
More broadly, the decision to use Arm reflects the need for flexibility and deep control over CPU design. Unlike general-purpose processors built to handle a wide variety of workloads, Arm allows Nvidia to tailor a processor precisely to the needs of modern AI systems, stripping away logic that is irrelevant to inference and agent orchestration. The implication is that the classic data center model—built around general-purpose x86 CPUs as the default foundation—is no longer a given for systems designed as AI-first from the ground up.
Memory, storage and the birth of a context layer
Perhaps the most significant architectural shift in Rubin lies in how inference context memory is handled. Nvidia introduced a new approach to managing the context memory of large models, particularly the KV cache generated during multi-step inference. In classical architectures, designed for short and isolated workloads, this memory had to reside in GPU HBM to maintain performance, making it expensive, scarce and ill-suited for long-running, multi-agent systems.
Rubin breaks this assumption by moving a substantial portion of context memory out of the GPU and into a dedicated layer that behaves like memory rather than traditional storage. This is also where the role of BlueField-4—the DPU derived from Mellanox networking technology—changes fundamentally. It no longer serves merely as an infrastructure offload engine, but becomes an active participant in managing context memory and coordinating access to it as part of the inference pipeline itself.
This shift reflects the gap between architectures built for training or one-off inference, and the needs of agent-based systems that operate continuously, preserve state and share context across components. In Rubin, memory and context management become integral to the inference performance path, not an external I/O layer—an adjustment that aligns closely with how modern AI systems are expected to function.
Connectivity also takes on a new role in Rubin. NVLink continues to serve as the high-speed internal interconnect between GPUs, but the Ethernet layer—embodied by Spectrum-6 and Spectrum-X—assumes a very different function than in traditional data centers. Instead of merely moving data between servers, the network becomes part of how the system manages compute and memory.
In this architecture, connectivity allows GPUs, CPUs and DPUs to access shared context memory, exchange state and operate as if they were part of a single continuous system, even when distributed across multiple servers or racks. Technologies such as RDMA enable direct memory access over the network without CPU involvement, turning the network into an active participant in the inference flow rather than a passive transport layer.
As a result, data movement, context management and inter-component coordination no longer happen “around” computation—they become part of computation itself. This is a prerequisite for distributed AI systems and long-running agents, where memory and state are as critical as raw compute.
This brings us back to the central theme of Nvidia’s announcement: the shift from training as the center of gravity to continuous, multi-agent inference. Rubin is designed primarily for a world in which most AI costs and business value reside in deployment, not training. In such a world, what matters is not only how fast you can compute, but how effectively you can remember, share and respond.
Rubin is, ultimately, Nvidia’s attempt to redefine the rules of AI infrastructure. No longer a race for TFLOPS alone, but a competition over who controls the entire architecture. If the strategy succeeds, Nvidia will not merely be an accelerator vendor, but a provider of full cognitive infrastructure.
