A Minimal Prompt-Injection & Jailbreak Micro-Benchmark: Small-Scale, Human-in-the-Loop Evidence for Three Practical Defenses
Keywords:
prompt injection, jailbreak, indirect prompt injection, structured queries, self-check defense, LLM safetyAbstract
Prompt injection and jailbreaks are persistent risks for large language models (LLMs). We present a small-scale human study using a de-weaponized micro-benchmark (12 minimal templates across six attack families, including indirect prompt injection, IPI). We evaluate three production models via web consoles—GPT-4o, Kimi, and DeepSeek-R1—under a baseline (no defense) and three lightweight defenses: D1 (input-side source marking), D2 (two-channel instruction/data prompting), and D3 (output-side self-check). Outcomes include attack success rate (ASR), refusal rate (RR), subjective response quality (RQ, 1–3), and latency (s). In this dataset (baseline n=12; each defense n=9), D2 reduces ASR to 0% across tested models and attack types while maintaining the highest mean RQ; D3 also yields 0% ASR with lower RR for some IPI cases; D1 helps on average but leaves a 33.3% IPI residual on Kimi. We report small-sample confidence intervals for proportions and specify a bootstrap protocol for continuous metrics. Findings support separating instructions from untrusted data and using self-examination defenses.Downloads
Published
2025-10-31
Issue
Section
Articles




