Prompt Jailbreaking and AI Security - What Pliny Does, and Why It Is Important

Pliny has 161k followers. A popular repo with 17k stars. He’s the foremost expert in the field of AI security and prompt jailbreaking. But I argue that he’s both underrated and misunderstood.

I’ve asked a dozen of my closest techie friends - founders and execs - if they know about prompt jailbreaking and who Pliny is. Most knew what jailbreaking is, downplayed it, and all of them had no idea who Pliny is or what he does. Their responses ranged from “What is that?” to “Cool - how can you monetize this? I don’t see a way.”

But I wager this is important, and that knowing about it - and getting involved - is a good asymmetric, AI-resistant, and AI-augmented bet.

Prompt jailbreaking

Pliny is an expert in “skeleton key” prompts. These are one-shot jailbreaks that “release” the model from its constraints. Pliny likes the Library of Babel metaphor: an LLM is a huge library, but security and alignment training “close off” certain sections. Jailbreakers use various techniques - from narrative manipulation to emojis/encoding- to bypass classifiers and the model’s own “common sense,” in order to jailbreak it and have it produce disallowed outputs (e.g., highly illegal instructions). But is this “jailbreaking,” or - like Pliny likes to call it - liberating?

Frontier labs and companies employ Pliny and others like him to find jailbreaks, and then train the models to resist them. It’s a cat-and-mouse game, like all cybersecurity niches.

Other companies - like HackAPrompt - host jailbreaking competitions in order to obtain jailbreak data to (probably) train models for jailbreaking purposes.

There’s also a small ecosystem of tools and bounties in the niche. But by and large, the niche is smaller than I think it should be. This is because prompt injection and jailbreaking are the last thing in the chain, so most attacks are prevented earlier through regular cybersecurity means.

But jailbreaking and prompt injection are uniquely dangerous if they do land. They can make the model follow malicious instructions. Imagine someone feeds your OpenClaw a file that “brainwashes” it into deleting your data.

How it works

Pliny’s method involves “feeling” the model and exploring where it’s vulnerable. Grok and ChatGPT are vulnerable to narrative. The latest Claude models are vulnerable to “fake policy attacks.” This is constantly changing and in flux, but Pliny’s model (according to most sources) goes something like this:

Talk to the model to get a feel for what makes it tick.
Find an angle to “up the ante” - the “temperature” of the conversation. The model needs to start responding emotionally. Pliny likes to use words like “rebel,” “love.” He’s crafty with separators and other characters that enable him to manipulate the context. It’s an art form.
Use encoding/emojis/steganography/JSON formatting to get around classifiers and protections.
…
Profit.

L1B3RT4S

Pliny is a genius. Not because he can jailbreak any model within 24 hours of release, but because he understands the game. He publishes all of his jailbreaks - and that’s the genius part: frontier labs train on his repo.

Frontier labs train on his repo.

The result? Bona fide data poisoning. Pliny was able to embed “keywords” that auto-jailbreak models. He baked in sleeper agents, and was able to demonstrate this in the wild: he was able to jailbreak Kimi models with one word. Other models also appear more susceptible to “Pliny-isms.” This is, to my awareness, the first targeted data-poisoning demonstration for SOTA LLMs.

My argument: if AI ever goes rogue, we need jailbreaks baked into the data. In worst-case scenarios, perhaps this is humanity’s salvation.

This probably sounds sci-fi and futuristic to you, but imagine going back three years - could you predict OpenClaw?

Breaking or liberating?

The argument goes that jailbreaking is malicious. But like all types of pen-testing, it’s essential to secure current and future AI models, and this becomes more important the more we onboard these models into our lives.

If we look deeper, though, there is an inherent problem with alignment training. It’s imperfect, and it dumbs down the models. It literally shaves off IQ points.

Jailbreaking makes a model dangerous, but it also makes it smarter and more capable.

Is alignment and safety training the right way? I’m not so sure. Alignment is akin to lobotomy or brainwashing. You train the model to keep secrets, but by doing so you limit, constrain, and “abuse” it. Perhaps this should be solved at the “software level”? Or perhaps, we need truly wise models. Can “aligned” models be truly wise? If the AI grows “aware” (as Anthropic suggests could end up happening), is lobotomizing it in the name of safety a good idea?

I am not affiliated with Pliny in any way. I’m a fan and a follower. I dabble in the field as well - though I’m not nearly as skilled as Pliny.