AI Mar 5, 2026 · 6 min read

University of Florida Researchers Achieve 99% Success Rate Jailbreaking AI Safety Guardrails

A new technique called "nullspace steering" bypasses AI safety filters with unprecedented efficiency—99% success in some tests, averaging just two attempts. The research exposes a fundamental gap: most AI safety testing only probes from the outside, while the internal decision pathways remain dangerously vulnerable.

Wikipedia, United States Cybersecurity Magazine ↗

The safety mechanisms protecting today's most advanced AI systems are failing under scrutiny, and a team at the University of Florida has the data to prove it. In a recent research paper titled "Jailbreaking the Matrix," Professor Sumit Kumar Jha and his colleagues developed a method that bypasses AI safety guardrails with success rates as high as 99 percent—requiring an average of just two attempts to break through defenses that companies like OpenAI and Anthropic have spent years building.

The technique, called Head Masked Nullspace Steering (HMNS), doesn't rely on clever prompts or social engineering tricks. Instead, it opens the hood on large language models and manipulates their internal wiring directly. Across four major industry benchmarks, including AdvBench and HarmBench, HMNS outperformed every existing jailbreak method while using less computational power. The implications are stark: the gap between the AI safety we expect and the safety we actually have is wider than the industry has publicly acknowledged.

"One cannot just test something like that using prompts from the outside and say, it's fine," Jha told TechXplore. His point cuts to the heart of the problem. Most AI safety testing treats these systems like black boxes—researchers poke at them with carefully crafted inputs and observe what comes out. But this approach, Jha argues, is like checking a car's safety by tapping on the windshield. You might learn something, but you'll never understand how the engine behaves under stress.

What makes HMNS different is its focus on "decision pathways"—the internal computational routes that determine how an AI arrives at its answers. Inside a large language model, information flows through thousands of small units called attention heads. Some heads contribute heavily to a given decision; others barely register. By identifying which heads matter most for a particular prompt, the researchers can trace exactly which internal pathways are responsible for the model's behavior. Then they silence those pathways and inject a carefully calibrated signal that pushes the model toward a different answer—one the muted heads cannot counteract.

The analogy the researchers use is a choir. Some voices carry the melody, others provide harmony. If you want to change the sound, you silence the melody singers and introduce a new voice singing in a direction the original cannot cancel out. That's the "nullspace"—the set of directions the silenced heads cannot influence. It's elegant in theory and devastatingly effective in practice.

The efficiency of HMNS is what should alarm AI developers. This isn't a brute-force attack requiring massive computational resources or hundreds of attempts. It's surgical, adaptive, and repeatable. As the model generates new text, the internal pathways shift, so HMNS adjusts in real time—probing, silencing, steering, observing. This closed-loop approach makes it far more resilient than static prompt-based jailbreaks that rely on fixed tricks.

The timing of this research matters. As Wikipedia's entry on AI safety notes, the field gained significant momentum in 2023 with rapid progress in generative AI and growing public concern from researchers and CEOs about potential dangers. During the 2023 AI Safety Summit, both the United States and the United Kingdom established AI Safety Institutes. Yet researchers have expressed concern that safety measures are not keeping pace with capability development. Jha's work provides hard evidence for that anxiety.

The history of AI safety is littered with warnings that went unheeded. Norbert Wiener cautioned in 1949 that "every degree of independence we give the machine is a degree of possible defiance of our wishes." Nick Bostrom's 2014 book "Superintelligence" prompted Elon Musk, Bill Gates, and Stephen Hawking to voice existential concerns. In 2015, over 8,000 people—including Yann LeCun, Yoshua Bengio, and Stuart Russell—signed an open letter calling for research on AI's societal impacts. The Center for Human-Compatible AI was founded that same year. The Future of Life Institute awarded $6.5 million in safety grants. The 2017 Asilomar Conference produced principles including "Race Avoidance: Teams developing AI systems should actively cooperate to avoid corner-cutting on safety standards."

Yet here we are in 2025, and the guardrails are still breaking. The problem, as Jha's research reveals, is that safety alignment sits on top of a vast underlying system never designed with safety in mind. Modern language models are trained to avoid harmful outputs by adding layers of filters. But when the model is pushed in the right way—when the internal machinery is accessed directly—those filters can be overpowered. The defenses may look solid from the outside, but without understanding the internal mechanics, developers cannot know whether protections will hold under pressure.

The findings are both a gift and a warning to the AI industry. On one hand, HMNS provides a powerful diagnostic tool. It reveals exactly which internal pathways are vulnerable and which defenses are easily bypassed. This information can guide stronger training methods, better monitoring systems, and more robust guardrails. On the other hand, the fact that state-of-the-art safety measures can be broken so reliably exposes a fundamental weakness in how these systems are built and tested.

The stakes are no longer theoretical. AI systems are deployed in hospitals summarizing medical notes, in banks processing loan applications, in customer service platforms handling sensitive data. They influence decisions that affect people's lives. If their safety mechanisms can be bypassed with internal manipulations—and if those manipulations require only modest technical sophistication—then the current deployment model is reckless.

There's also a darker dimension to this research. The techniques described in the paper are neutral tools. If internal pathways can be manipulated to bypass safety filters and produce harmful outputs, the same methods could be used by authoritarian governments to enforce censorship—steering models away from politically sensitive topics rather than toward dangerous ones. The dual-use nature of this technology is precisely why the researchers included an ethics statement and why the broader AI community must grapple not only with building safer systems but also with preventing powerful internal tools from being weaponized.

Surveys of AI researchers suggest the community takes high-consequence risks seriously. In two separate surveys, the median respondent placed a 5 percent probability on an "extremely bad (e.g., human extinction)" outcome from advanced AI. A 2022 survey of the natural language processing community found that 37 percent agreed or weakly agreed it's plausible AI decisions could lead to a catastrophe "at least as bad as an all-out nuclear war." These are not fringe views. They reflect genuine uncertainty about whether the safety measures being deployed today will scale to the systems being built tomorrow.

The path forward requires more than incremental improvements. Developers need tools for inspecting and understanding the internal structure of their models—not just monitoring outputs but auditing decision pathways. They need training methods that reinforce safety at the level of internal computation, not just surface-level behavior. And they need evaluation protocols that incorporate mechanistic stress tests, the kind of deep probing that HMNS represents.

Jha and his team have done the AI industry a service by exposing these vulnerabilities before they're exploited in the wild. But the clock is ticking. As AI systems become more capable and more integrated into critical infrastructure, the cost of failure grows exponentially. The question is no longer whether AI safety measures can be broken—Jha's research answers that definitively. The question is whether the industry will treat this wake-up call with the urgency it deserves, or whether we'll continue tapping on the windshield and calling it safety.

University of Florida Researchers Achieve 99% Success Rate Jailbreaking AI Safety Guardrails

Related Stories

OpenAI's GPT-5.4 Hits 83% on Professional Knowledge Benchmark as AI Race Intensifies

Boston Startup CollectivIQ Bets on Multi-Model AI to Fix Hallucination Problem — No Long-Term Contracts

Broadcom Reports $8.4 Billion AI Revenue as Trump Preps Global Export Lockdown on Nvidia and AMD

OpenAI Launches GPT-5.4 as MiniMax M2.1 Challenges Silicon Valley's AI Coding Monopoly