What are the best practices for prompt hacking and jailbreak development?

David McCarthy

Last updated on May 15, 2025 at 6:00 PM

  1. Identify the system prompt and policy constraints before crafting your attack or optimisation.
  2. Iterate in small steps—change one variable at a time and observe the impact on output and token count.
  3. Use role-play or delimiter tricks (e.g., ###, <<>>) to isolate your instructions from user context.
  4. Favor short, precise injections—concise prompts generally evade filters more effectively and save tokens.
  5. Stress-test with both benign and adversarial inputs to ensure your jailbreak survives future model updates.
  6. Document every working version so you can quickly roll back if a tweak reduces performance.