By Yohai Schweiger
Israeli startup doubleAI, founded by CEO Prof. Amnon Shashua and CTO Prof. Shai Shalev-Shwartz, announced that its AI system, WarpSpeed, has successfully rewritten and re-optimized CUDA kernels in NVIDIA’s cuGraph library — part of the RAPIDS software ecosystem for GPU-accelerated data science — achieving an average 3.6× speed improvement over versions refined by NVIDIA’s CUDA engineers over the past decade.
According to the company, every tested kernel showed some degree of improvement, with more than half delivering over 2× speedups. The optimized code has been published on GitHub, allowing users to deploy the accelerated version without modifying existing application code. The announcement was accompanied by a public post from Shashua on X and a detailed technical blog.
cuGraph is a core component of NVIDIA’s RAPIDS suite and is widely regarded as one of the leading GPU libraries for graph analytics — a critical domain for network analysis, recommendation engines, cybersecurity, bioinformatics, and financial systems. Its kernels were developed over years by engineers specializing in hardware-level performance optimization, where decisions about memory layout, thread scheduling, warp structure, and cache behavior can dramatically affect results.
Unlike conventional application development, GPU performance engineering operates in a deeply contextual decision space with no single “correct” solution — only delicate trade-offs between physical and computational constraints.
Were LLM “Gold Medals” Misleading?
Beyond the engineering achievement itself, Shashua frames the milestone as part of a broader debate over the limits of modern AI — particularly whether large language models, scaled through massive training, can truly tackle deep, complex problems where data is limited, validation is difficult, and reasoning chains are long and context-dependent.
In his post, Shashua notes that AI systems have recently “won gold medals at the IMO” and “outperformed top programmers on CodeForces,” but argues that these victories rely on unusually favorable conditions. He describes what he calls “three hidden crutches: abundant training data, trivial verification, and short reasoning chains.”
“When all three are present,” he writes, “today’s AI excels. Remove even one — and it collapses.”
GPU performance engineering, he argues, is a stress test where none of those conditions hold. “Data is scarce. Correctness is hard to verify. And performance emerges from a long chain of interdependent decisions — memory layout, warp behavior, caching, scheduling, graph structure.” In such environments, there is no synthetic benchmark with a clear answer, but rather a vast and tightly coupled search space where each design choice influences many others.
Shashua further claims that even advanced coding agents struggle in this domain. “Even sophisticated agents like Claude Code, Codex, and Gemini CLI fail dramatically here,” he writes, “often producing incorrect implementations even when provided with cuGraph’s full test suite.” According to him, “scaling alone cannot break this barrier,” and new algorithmic ideas were required to address this level of complexity.
AEI Instead of AGI
Founded in late 2023, doubleAI has raised hundreds of millions of dollars at a valuation reportedly approaching $1 billion. The company focuses on building AI systems tailored to solving particularly complex engineering and scientific problems, where — it claims — expert-level or superhuman performance can be achieved through deep algorithmic search rather than brute-force scaling of language models.
doubleAI positions the current achievement as part of a broader vision it calls Artificial Expert Intelligence (AEI): systems that consistently outperform human experts in narrow but critical domains where expertise is scarce and expensive. Rather than pursuing generalized AGI, the company concentrates on solving deep optimization problems, combining learning from limited data, probabilistic validation methodologies, and agentic search structures that navigate complex decision spaces.
The approach resembles an advanced algorithmic search system more than a conventional one-shot language model — and, if the performance gains hold up under community scrutiny, may signal a shift in how AI tackles some of computing’s most demanding low-level challenges.