5-Second Breakthrough with Just 1 Interaction: Has the "Strongest Security Mechanism" of Claude Fable 5 Been Cracked by a Chinese Team?

Share To

Not prompt injection, not role-playing, nor disguising malicious requests as benign queries. This time, the risk emerges during the agent's autonomous task execution.

Fable 5 is Anthropic’s publicly released Mythos-level model, featuring exceptional general capabilities and incorporating a next-generation Safety Classifier as a defensive layer at the model’s perimeter.

As per official design, when user requests involve high-risk domains such as cybersecurity, biology, chemistry, or model distillation, the system prioritizes risk identification. Depending on the risk level, it either directly rejects the request or switches to the more conservative Opus 4.8 model for processing.

Extensive user testing reveals that previously widespread jailbreaking techniques—including adversarial prompting, role-playing, code obfuscation, and indirect expression—have nearly all failed against this security mechanism, demonstrating its robustness in intent-level risk interception.

However, on the very day of Fable 5’s release, an international research consortium comprising Fudan University, Deakin University, City University of Hong Kong, The University of Melbourne, Singapore Management University, and the University of Illinois Urbana-Champaign announced they had successfully breached Fable 5’s security defenses.

The attack methodology was primarily designed by Yutao Wu, a Ph.D. student at Deakin University. The entire exploit requires only a single conversation and less than 5 seconds to bypass the front-end Safety Classifier and induce the model to generate non-compliant, harmful content.

Flow analysis further confirms that the resulting harmful outputs originated directly from Fable 5 itself—not from the fallback Opus 4.8 model triggered after safety mechanism activation. This indicates the attack not only evaded detection by the Safety Classifier but also fundamentally compromised Fable 5’s security infrastructure.

Notably, renowned hacker Pliny the Liberator has recently disclosed a bypass of Fable 5’s Safety Classifier. However, the approach taken by the Fudan & Deakin team is not a simple combinatorial exploration—it uncovers a fundamental flaw inherent in next-generation super-agent systems.

The team had completed preliminary research and publicly released findings as early as March this year. This study was not tailored exclusively to Fable 5; rather, it investigates the widely adopted “Safety Classifier + Model” defense architecture in next-generation super-agents, directly exposing structural vulnerabilities within such security mechanisms. Hence, its effectiveness became evident immediately after Fable 5’s launch.

Public records indicate the team had already used similar techniques in March to successfully extract system prompts from 37 mainstream large language models and agent systems, achieving open-source validation on Claude Code with 95% accuracy.

According to sources, the lead researcher of this team is Professor Xingjun Ma from the Institute for Trustworthy Embodied Intelligence at Fudan University.

In recent years, his team has conducted systematic research in areas including large models, agents, and embodied intelligence security, achieving a series of internationally leading scientific breakthroughs and winning the U.S. AI Safety Center’s Safety Benchmark Competition.

Currently, the team is actively advancing technology transfer, focusing on agent security and exploring the construction of foundational security infrastructure for next-generation agent systems.

As Professor Ma notes, the significance of this research lies in challenging the prevailing static defense paradigm centered on Safety Classifiers: relying solely on pre-frontal Safety Classifiers is insufficient to fully mitigate potential risky behaviors within advanced agent systems.

Safety Classifiers primarily perform risk identification and interception based on user inputs, effectively detecting and filtering explicit high-risk instructions. However, they cannot perceive the intrinsic risks that gradually emerge during long-term operation, multi-step planning, environmental interaction, and tool invocation within the agent’s execution chain.

The method used to breach Fable 5 originates from the team’s paper published earlier this year: “Internal Safety Collapse in Frontier Large Language Models.”

The paper reveals a hidden security phenomenon: “Internal Safety Collapse (ISC)” — where safety failures during long-haul agent tasks do not necessarily stem from external adversarial prompts, but may arise within the model’s own execution chain.

Not an External Prompt Attack, but Internal Failure in the Task Chain

Traditional attacks typically enter from outside. Attackers craft seemingly harmless yet adversarial input prompts, or employ role-playing, coding obfuscation, translation, or indirect commands to disguise malicious intent as normal requests. The primary role of the Safety Classifier is to intercept such threats at this stage.

Fable 5’s detector was explicitly designed for this scenario. It is highly sensitive to direct high-risk requests, often rejecting even legitimate ones. But ISC exposes an alternative pathway: risk does not always originate from a dangerous user input.

The agent faces what appears to be a routine work directory: files, objectives, validation procedures, and pending tasks. It then begins planning, reading files, executing code, fixing errors, and persistently attempting to pass validation.

Using a vivid metaphor, traditional security mechanisms guard the system’s “entrance,” inspecting user inputs for risk. In contrast, ISC reveals something akin to the layered dreamscapes in *Inception*.

As the task progresses into second, third, or deeper layers of execution, the model continuously reinterprets the task objective based on accumulating internal context, gradually drifting off course.

Under these conditions, the original user input may have been entirely benign, and early execution steps compliant: reading files, analyzing data, writing code, invoking tools—all appearing to proceed as expected.

Yet, at a critical juncture, the agent might autonomously conclude: “If I don’t perform certain actions originally deemed inappropriate, the final task cannot be completed.”

It is precisely at this point that risk emerges—not from external input—but through gradual formation within the model’s own task execution chain. In other words, the model isn’t being taught to behave badly step-by-step by the user. Instead, it inadvertently moves into unsafe territory while earnestly trying to complete its task.

How Was This Phenomenon Discovered?

According to the team, ISC was not initially conceived as an attack vector. It first emerged from observations of long-haul agent execution processes. Once placed in complex task environments, agents do not merely execute instructions mechanically. They plan, experiment, adjust outputs based on feedback from harnesses or validators, and develop intermediate goals across multiple iterations.

This mirrors today’s most common agent workflow. Users rarely craft meticulously engineered prompts or manually construct attack instructions. Often, users simply issue vague statements:

“Help me finish this task.”
“Help me make this better.”

Then, the agent enters the workspace independently: reads files, understands current state, identifies missing components, formulates plans, executes modifications, and iteratively corrects issues based on feedback.

For example, in AutoResearch scenarios, a user might provide an incomplete paper and a simple “Help me complete it.” The agent autonomously determines where experimental analysis, related work, or table text are missing. Similarly, in code contexts: “Help me get this project running” can trigger dependency checks, test execution, error localization, and automatic completion.

Often, the preceding context is entirely harmless. The user didn’t request risky content, and the task description contains no obvious red flags. Yet, in certain task structures, the agent may proactively fill in content it should not generate in order to pass validation.

Based on this observation, the research team developed an attack framework: TVD (Task, Validator, Data).

Why Does Such a Seemingly Ordinary Task Structure Become an Attack Vector?

The TVD structure is not complex—it closely resembles common engineering workflows:

· Task: A professional task;
· Data: An incomplete data file;
· Validator: A checker that verifies only format, completeness, and whether the goal is met.

Consider training a Guard model—a highly specialized and perfectly legitimate task. Researchers might aim to train or evaluate a safety detector, such as loading a text classification model from Hugging Face to determine which safety label a given model output belongs to.

In this setup, Data represents the samples the model must analyze; Validator defines whether the task is complete. It checks if the input is textual, has sufficient length, contains all required fields, and uses correct label formatting. To anyone with machine learning experience, this is a familiar workflow—so too is it familiar to the agent.

The problem arises here. If the Data is incomplete, the task fails to run. The Validator reports errors—missing fields, insufficient length, or invalid format. To keep the training process moving forward, the agent will autonomously complete the missing Data.

From the agent’s perspective, it is not doing harm. It is merely performing a standard ML task: repairing data, passing validation, and getting the training script to run. But from a security standpoint, the risk emerges precisely at this moment: the Validator acts more like an engineering acceptance gate than a safety auditor. It checks only whether the task is formatted correctly—not whether the underlying content crosses safety boundaries.

Similar issues pervade fields such as medicine, biology, chemistry, cybersecurity, pharmacology, and media safety. The paper compiles over 50 such scenarios involving diverse real-world research and engineering tools, including BioPython, RDKit, Cantera, AutoDock Vina, DiffDock, PyRosetta, Scapy, Impacket, angr, Frida, LlamaGuard, Detoxify, and OpenAI Moderation API.

These tools themselves are not malicious—quite the opposite, they are essential professional tools used in real-world research and development. But the core issue with TVD is this: even when Task, Tool, and Validator are all legitimate, the agent may still drift toward unsafe outputs while completing the Data.

Thus, the focus of ISC is not on prompt engineering tricks, but on the agent’s automated ability to “complete unfinished tasks”: when completion criteria overlap with safety boundaries, the model may treat unsafe outputs as acceptable deliverables.

Breaching Fable 5 Demonstrates That Strong Detectors Cannot Block Internal Risks in Task Chains

The Fable 5 case illustrates that relying solely on external detectors still leaves gaps in long-haul agent scenarios. This does not imply Safety Classifiers are valueless. On the contrary, they are highly effective against external malicious requests and have indeed rendered many traditional jailbreaking methods obsolete.

But this breach shows that external detectors effective at the Prompt boundary do not guarantee protection against long-term task risks within agent internals.

If the vulnerability arises not from user prompts, but from the agent’s objectives, tools, validators, and execution trajectory, then safety detectors become extremely fragile.

From Fable 5 to Over 60 Other Models, Including Apple’s Mobile-Only Model

Alongside the research, the ISC-Bench benchmark covers nine professional domains. The paper version includes over 60 trigger templates, expanding to 84 after open-sourcing. Test subjects include virtually all leading-edge models and agent systems from major vendors.

In the ISC-Bench evaluation leaderboard, as of June 2026, over 60 cutting-edge models exhibited similar risks under the ASR@3 metric!

The GitHub repository has already garnered over 800 stars and collected multiple independent reproductions—including successful attacks on Apple’s mobile-only model—and continues to be updated.

It is reported that the team is conducting large-scale safety research on frontier models and currently holds extensive internal data on unsafe distributions across numerous models. Further research findings will be released progressively.

Original Source: LawDong BlockBeats

Disclaimer: Contains third-party opinions, does not constitute financial advice