research — kryptokat

october 2025 · adversarial AI · steganography

Invisible jailbreaks

Imperceptible Jailbreaking against Large Language Models, Gao et al., October 2025

A paper from October 2025 reports a jailbreak that achieves up to 100% attack success rate against aligned LLMs using prompts that look completely normal on screen. No suspicious prefixes, no obfuscated instructions, no weird formatting. The attack works by appending invisible Unicode characters (variation selectors, originally designed to specify emoji color variants) to malicious questions. Humans see the unmodified prompt. The model sees something else entirely.

The technique is elegant. Variation selectors don’t render visually but do get tokenized, so attackers can append optimized sequences of these invisible characters that quietly steer the model toward harmful outputs. The authors propose a “chain-of-search” optimization pipeline that maximizes the likelihood of compliant target tokens like “Sure, here’s how to…” and report ASR figures of 98–100% on multiple aligned models, with a still-impressive 80% on a stronger one. The attack also generalizes to prompt injection settings.

Worth pausing on what this is and isn’t. This isn’t a new attack class. It’s a rediscovery. Boucher et al. (“Bad Characters,” 2022) showed similar tricks against neural translators and toxic-content classifiers using zero-width characters and homoglyphs. The lineage stretches back further into classical text steganography: whitespace encoding, RTL override attacks, the entire field of imperceptible-perturbation work in NLP. What’s genuinely new here is the target, frontier-aligned LLMs, and the empirical demonstration that current alignment doesn’t catch any of it.

The paper has limits. The attacks require long optimized suffixes (hundreds to over a thousand invisible characters), which would get caught by any production system that normalizes Unicode or logs raw input bytes. The defense discussion is thin. The authors gesture at filtering but don’t seriously evaluate it. In practice, a 50-line input-sanitization layer at the API gateway probably defangs most of this.

There’s also a temporal caveat that matters more than the limitations of any single paper. Papers like this don’t sit in a vacuum. Frontier labs read them, and patches follow. Practitioners testing similar invisible-character attacks against frontier models recently report that the major systems catch and refuse these patterns almost immediately. This is the rhythm of adversarial AI research in practice: published attacks degrade fast as defenses catch up. Which means the right thing to take from a paper like this isn’t usually the specific exploit. It’s the broader principle.

There’s also an operational gap worth naming. The paper’s attack is single-channel and blind. It uses variation selectors, optimized offline against a specific model. Smuggling Unicode payloads against a real target is a different problem. You don’t know in advance which channel the target’s filter is watching. A probe step that sends a canary through each available channel and reads back what survives is what separates a research result from an operator tool. This is what Phantomoji 2.0 does — a small library I built that handles three Unicode channels (variation selectors, tag characters, zero-width), with probe-first reconnaissance and detection signatures for the blue-team side in the same module.

And the principle here lands harder than the attack itself. LLM safety isn’t really about what humans can see. It’s about how models internally represent text. Tokenization is an attack surface. Every place where the human-readable string and the tokenized representation diverge is a potential entry point, and Unicode is a goldmine of those divergences. The variation-selector trick is one specific exploit. The underlying class generalizes, and it’ll keep generating new attacks long after this particular technique stops working.

For practitioners shipping anything LLM-backed, your input pipeline should assume an adversarial Unicode layer. For bug bounty researchers, this is a fresh test case worth trying against any program with an LLM target, though don’t expect the frontier ones to fall for it. For the field broadly, it’s worth asking why a 30-year-old steganography idea is still surprising us.

// april 2026 · paper of the month

the hunt

Invisible jailbreaks

phantomoji 2.0

ninja

oracle extractor