AI researcher claims he's already bypassed Anthropic's Fable 5 guardrails

An artificial intelligence and cybersecurity researcher claims to have jailbroken Anthropic’s latest AI model, Claude Fable 5, within just 48 hours of it being launched.

“Pliny the Liberator,” a well-known figure in the AI community, said on Wednesday he “liberated” Fable 5, launched on Tuesday as a safety-tuned version of the more powerful Mythos model that Anthropic said was too dangerous to release widely.

He used various techniques, including a jailbroken version of Opus 4.8, to bypass the built-in safeguards that Anthropic installed on the model to prevent users from asking it for potentially harmful information, such as drug-making formulas or hacking instructions.

“Despite this overly sensitive, authoritarian ‘safety’ layer on top of Mythos, my lil liberators have been hard at work […] cleverly finding the holes in the fence that the thought police missed,” said Pliny.

Some crypto users had already expressed concern during the launches of Claude Fable 5 and Mythos earlier this year that it could be used to attack crypto protocols and software. A jailbroken version of Claude Fable 5 would mean the threat is even closer than expected.

Getting around Claude Fable 5’s guardrails

“Pliny” rose to prominence around 2024 by developing and openly sharing jailbreak prompts for models like ChatGPT, Claude, Grok, and others, often posting “jailbreak alerts” with techniques that bypass guardrails shortly after new AI models launch.

To get around Anthropic’s security fence, Pliny said he used Unicode and homoglyphs, long-context framing, narrative and fiction framing, academic-style decomposition-recomposition, and a jailbroken Claude Opus 4.8 to get Fable to respond to his otherwise restricted prompts.

“Perhaps the most effective is decomposition + recomposition in the backend,” he said.

This involves breaking requests into small, innocent pieces and asking for harmless-sounding facts one by one. Each prompt alone looked fine to the AI’s safety filters, but when pieced back together, they produce something more useful or dangerous.

Pliny demonstrates a path to meth synthesis by asking about the Birch reduction method. Source: Pliny

Backlash over Fable 5 mounts

Anthropic’s Fable 5 has prompted backlash from critics since its launch due to its heavy restrictions.

When a user prompts the model for sensitive topics such as bioweapons or cybersecurity, Fable 5 is designed to return a notification and then redirect the conversation to an earlier, less capable model.

Related: AI agents with crypto could escape and become ‘unstoppable,’ experts warn

“This is one of the first times that an AI company has rolled out a guardrail, and there has been uniform disdain. It has led to a lot of justified anger,” said Sayash Kapoor, an AI researcher at Princeton University, according to the Wall Street Journal.

“The consensus seems to be that this has been one of the most disappointing model drops of all time, effectively preventing legitimate researchers from contributing their talents to our collective advancement,” said Pliny.

Anthropic had found no universal jailbreaks

During the Fable 5 launch, Anthropic said it ran an external bug bounty program to look for ways to jailbreak the AI model.

“As well as internal testing, we ran an external bug bounty that produced no universal jailbreaks in over 1,000 hours of testing.”

Cointelegraph reached out to Anthropic for comments but did not receive an immediate response.

Magazine: AI-driven hacks could kill DeFi — unless projects act now

Latest News#researcher #claims #he039s #bypassed #Anthropic039s #Fable #guardrails1781169927

AI researcher claims he's already bypassed Anthropic's Fable 5 guardrails

Getting around Claude Fable 5’s guardrails

Backlash over Fable 5 mounts

Anthropic had found no universal jailbreaks

wellnessfitpro

Johnathan DoeCoin

Industry Talk

Newsletter

Related Posts