Israeli Researcher Uncovers Critical Software Infrastructure Flaws Using AI — on an $80 Budget

Next week, Simcha Kosman, a senior researcher at CyberArk Labs, will present a new study at BlackHat London—one of the world’s most prestigious cybersecurity conferences, where only a small fraction of submissions are accepted. His research demonstrates how artificial intelligence can be leveraged to detect security flaws in widely used software systems at a fraction of the traditional cost and time, effectively rivaling the capabilities of industry giants like Google and OpenAI.

Kosman and his team set out to answer a deceptively simple question: Can AI be used to uncover real vulnerabilities in massive software projects—such as the Linux kernel, Redis and FFmpeg—without huge budgets or large teams? Their findings point to an unequivocal yes. In just two days, and for less than $80 in total compute costs, their tool led to the discovery of dozens of vulnerabilities. Several have already been assigned nine official CVE identifiers across major projects including the Linux Kernel, FFmpeg, Redis, RetroArch, Libretro, Bullet3 and Linenoise.

At the heart of the study is a new open-source tool called Vulnhalla. The system combines CodeQL—GitHub’s industry-standard static analysis engine—with an AI model designed to dramatically reduce noise. On large repositories, CodeQL alone can generate tens of thousands of alerts, the vast majority of them false positives. Vulnhalla tackles this bottleneck directly: it analyzes CodeQL’s findings, extracts relevant code context for each alert, and uses the AI model to determine which findings have genuine exploit potential.

Crucially, the researchers don’t simply ask the model broad questions like “Is this a vulnerability?” Instead, they guide it through a structured sequence of prompts that mirror the reasoning of an experienced security analyst: Where is the buffer defined? What is its size? Does it change? What is the target size? Is there a data flow that could lead to a memory-boundary violation? This step-by-step, logic-driven approach forces the model to perform genuine reasoning rather than relying on superficial pattern recognition. According to the study, this methodology reduces false positives by more than 90% for several vulnerability classes, and in some cases by up to 96%.

The result positions Vulnhalla as a compelling alternative to more advanced proprietary systems such as Google’s Deep Sleep and OpenAI’s Aardvark. It delivers comparable vulnerability-detection performance while remaining fully open, transparent and community-driven. For development and security teams struggling under the weight of soaring alert volumes, this hybrid approach offers a way to focus resources on a far smaller set of findings with real-world impact.

As Kosman notes, the research marks another step toward using AI not just to detect weaknesses faster, but to help close widening security gaps in the software we all rely on every day.