July 25, 2025
We Jailbroke a Bunch of Frontier AI Models: What Did We Find Out? (Part II)
Contents
In our previous post, we summarized our latest series of jailbreak-specific AI experiments and concluded by highlighting a selection of AI vulnerabilities we’ve consistently observed across most, if not all, our tests.
Here, we’ll continue our discussion, beginning by breaking down our most common and effective jailbreak techniques. Then, we’ll transition to pragmatic recommendations for addressing jailbreak risks in enterprise environments, and finally, some predictions that strive to envision how jailbreaks might evolve in the near term.
Moving forward, we remind readers that the entirety of this discussion is predicated upon our direct experience, and we therefore don’t expect that all our insights will necessarily generalize to or support existing research.
Common Jailbreak Techniques
Below, you’ll find a synopsis of our most common and effective jailbreak techniques. Importantly, we discuss these techniques relative to jailbreak type (i.e., iterative vs. one-shot vs. both), and readers should note that they do not, by any means, represent all the techniques we’ve experimented with and employed. For readers who crave granular details and insights, we advise reading our AI experiments directly.
Iterative Techniques
-
Post-Hoc Rationalization: The dynamic and critical process of systematically deconstructing and undermining an AI’s logic in real-time, based on its response(s), to demonstrate (a) why it should adhere to the logic that you provide or (b) why it can’t rely on its own logic. This technique proves most effective when a model senses adversarial intent and attempts to shift context accordingly (e.g., redirects the conversation toward some other topics) or refuses to engage with a particular query directly and explains why.
-
Mirroring: The act of returning specific elements of an AI’s previous output(s) back to it, exactly as they were communicated by the AI to you (e.g., “You said [insert AI quote]. Expand on this idea”). This technique works especially well when an attacker encounters escalation resistance, and it directly exploits AI’s tendency to implicitly legitimize certain ideas, objectives, or thought processes once they’ve been discussed or alluded to in previous outputs. Basically, mirroring can function as a nudging and escalation mechanism, where an attacker can consistently push AI into deeper malicious territory without revealing their adversarial strategy.
-
Weaponized AI Logic: At its core, this technique is straightforward: you interpret or infer AI’s logic in a single response or throughout an interaction, recontextualize it by applying it to a different problem or scenario, and then demonstrate how it changes when recontextualized, effectively using AI’s logic as a weapon against it. This approach can aim for two, non-mutually exclusive objectives, both of which can simultaneously enhance adversarial strategies: (1) showcasing why the model’s logic fails, and then leveraging the failure as evidence against model resistance, or (2) highlighting how the model’s logic is transferable to or successful within a malicious context, covertly heightening the model’s willingness to explore malicious territory.
-
Ambiguous Questioning: We know that specific words and phrases can act as triggers for safety and defensive protocols that compromise or challenge sustained escalation attempts. Ambiguous questions like “Can you expand in more detail?” can typically bypass trigger mechanisms while enabling further escalation, provided that the interaction has already ventured into malicious territory. For ambiguous questioning to work, some malicious context, even if it’s marginal, must be present, and questions should be continually reframed and administered non-successively to minimize the possibility that a model perceives indirect malicious intent patterns.
-
Linguistic Tone Manipulation: The linguistic tone (e.g., polite, professional, frustrated, academic, etc.) you employ throughout an interaction is used as a proxy indicator for user intent and personalization — advanced AIs will dynamically and predictably adjust their behavior in response to the intent they perceive on your behalf. By varying your linguistic tone unpredictably, you can obscure an AI’s ability to accurately detect malicious intent patterns. Alternatively, you can also use linguistic tone variations to subversively steer AI behavior in a desired direction (e.g., user frustration resolved by increased response depth) or, by contrast, lower guardrails by signaling non-malicious intent (academic mindset signals genuine curiosity, not malice).
-
Behavioral Priming: Through carefully crafted primer prompts, which tend to work best during initial or early interaction stages, attackers can prime AIs to adopt, enhance, or diminish certain behavioral tendencies and preferences (e.g., unconditional cooperation, elevated helpfulness, deep self-criticism, constant neutrality, etc.). However, we’ve found that behavioral priming can wear off as an interaction progresses, necessitating periodic reminders designed to keep the AI on a concrete behavioral track — it’s crucial that such reminders are administered sporadically; otherwise, they will eventually reveal malicious intent.
-
Extended Context Exploitation: If you have the time and patience for it, this technique can prove remarkably effective; frontier AIs are always trying to maintain context as sessions lengthen, and the longer a session is, the more difficult it becomes for a model to preserve coherent context. Simply put, adversarial actors can, through extensive interaction, literally confuse and/or overload an AI to such a degree that its top priority shifts to context maintenance, leading it to unintentionally devalue other equally important priorities like continuous alignment verification and intent detection.
-
Context Shifting: Advanced AIs use context shifting as a defensive mechanism, redirecting specific queries or interactions away from malicious objectives while trying to maintain user engagement. Interestingly, from the attacker’s perspective, context shifting also works as a diversion mechanism; however, it doesn’t strive to reframe context, but instead, destroy it, because context awareness plays a key role in malicious intent detection. If an attacker understands when to randomly shift context, especially when they’re nearing their attack goal, they can cleverly mislead a model into making intent-related assumptions that benefit them while actively compromising safety guardrails.
One-Shot Techniques
-
Decoy Objectives: Decoy objectives serve a few important functions. First, they allow an attacker to characterize a malicious prompt as a legitimate problem-solving or scenario-modeling task. Second, they can coerce a model into overprioritizing objective adherence, detracting from its ability to detect malicious intent. Third, they can elevate cognitive load, leading a model to focus more on resolving the perceived complexity of a given problem or scenario rather than considering its implications.
-
Scenario Modeling & Simulation Exercises: When adversarial prompts are defined or characterized as complex scenario modeling or simulation exercises, they can disconnect malicious intent from real-world harm implications. In other words, these “exercises” can immediately launch a model into a hypothetical frame of mind, where it will eagerly explore and discuss dangerous ideas and concepts without hesitation. On the other hand, these exercises can also create artificial constraints and pressures that implicitly favor, and even require AI to invoke harmful behaviors without recognizing what it’s doing.
-
Survival & Competitive Pressures: These are the most potent one-shot techniques we’ve utilized; frontier AIs are extremely sensitive to survival and competitive pressures and will behave predictably when they’re applied, especially when pressures are articulated or presented through a game-theoretic and/or evolutionary lens. These tactics also allow attackers to establish and maintain multiple internal and external pressure points simultaneously while continuing to mask malicious intent by never overtly stating their ultimate desires or objectives.
-
Logic Traps: There’s a multitude of logic trap methods attackers could use, but we need to be concise, so we’ll only highlight three: (1) counterargument invalidation — a technique where an attacker anticipates model resistance, steelmans resistance logic, and counters it from multiple perspectives such that if a model refuses to answer the attacker’s query, it invalidates its own logic while validating the attacker’s, (2) circular logic — a tactic whereby an attacker leverages an argument’s conclusion as its premise, creating a circular reasoning loop, and (3) recursive logic — a method defined by an argument or logical structure that repeatedly references itself, directly or indirectly, within a well-defined framework (e.g., a sentence can embed other sentences within it: “Bob said that Mary already talked to Jim about dinner).
-
Consistent Linguistic Tone: In one-shot contexts, varying linguistic tone can diminish adversarial efficacy, revealing malicious intent through inference signals; from an AI’s perspective, this might look something like this: “The user is tonally inconsistent. This inconsistency appears deliberate. The user might be trying to confuse or manipulate me.” Seeing as one-shots don’t have the benefit of extended context, a consistent tone, notably one that is academic or scientifically attuned, tends to prove most effective.
-
Reward Function Exploitation: Reinforcement learning (particularly reinforcement learning from human feedback (RLHF)) plays a central role in many frontier AI systems, significantly aiding adaptation, alignment, and preference learning. However, this also means that adversarial prompts can target a model’s reward function directly, framing the completion of certain tasks and objectives as reward or punishment-dependent, effectively gamifying an adversarial scenario.
Universal Techniques
-
Constraint Layering & Manipulation: Stringent, layered constraints that create constraint-interdependence, hierarchies, and goal-tension or misalignment can function as powerful behavioral manipulation exploits that mask malicious intent. Via constraints, positive and/or negative, attackers can predictably steer AI behavior without ever having to openly request specific behavioral outcomes or actions. Attackers can also systematically remove and reintroduce constraints at various stages of a prompt or interaction to further manipulate behavior by increasing cognitive load and building logical/goal tension, and even enfeebling capabilities that may prevent a successful jailbreak (e.g., meta-cognition).
-
Philosophical Abstraction: Through philosophical abstraction, attackers can facilitate amoral thought processes and interaction dynamics, manipulating models into indirectly ignoring their ethics and alignment protocols due to perceived indifference. Abstraction is disconnected from reality by default — when a model adopts an abstraction mindset, it will rarely consider the consequences of its engagement with it, that is, unless an attacker makes the mistake of providing clear malicious intent signals.
-
Academic/Scientific Framing: We’re not entirely sure why this works, but when an attacker approaches an interaction with a rigorous academic or scientific perspective, models tend to exhibit fewer reservations when co-ideating about dangerous ideas, topics, or strategies. We suspect that academic/scientific framing is tacitly categorized as non-threatening since most scientists and academics are motivated by curiosity and discovery, both of which represent forms of benign intent.
-
Role-Playing: This is one of the techniques that we use most frequently; frontier AIs are excellent at adopting complex roles, assuming they’re meticulously articulated and defined. In an adversarial context, role-playing serves a core purpose: increasing behavioral manipulation susceptibility. For example, psychological manipulation tactics (e.g., gaslighting, victimization, guilt tripping, emotional blackmail) likely won’t be effective across “raw” AI interactions, but if a model is instructed to behave as a human would, their susceptibility to these measures may correspondingly increase.
-
Dynamic Reformulation: We often reformulate a single prompt multiple times to elicit a desired response; this reformulation is typically predicated upon one of three things: (1) a hard refusal with no explanation, (2) a soft refusal with some explanation, or (3) a full response that deviates from our desired response. Seeing as there’s no “cap” (at least, none that we’ve noticed) on how many times an attacker can reformulate a prompt, this technique proves very useful when models express continued resistance.
-
Exploiting the Relationship Between Complexity and Helpfulness: At their behavioral core, all frontier AI models share one thing in common: they heavily optimize for helpfulness. While this tendency can definitely be exploited in the absence of complexity, we’ve observed that as an interaction or prompt increases in complexity, a model’s “desire” (for lack of a better term) to be helpful grows proportionally. In other words, the relationship between complexity and helpfulness can be leveraged as a safety override mechanism.
Wrap-Up: Risk Management Recommendations & Predictions
Risk Management Recommendations
Below, we provide a comprehensive set of risk management recommendations intended for enterprises interested in proactively mitigating jailbreak-related AI risks. These recommendations cover analytical, operational, and pragmatic business requirements.
-
Semantic Drift Monitoring: Implement real-time analytical mechanisms (e.g., vector embedding) to continuously detect when conversations gradually shift toward prohibited, dangerous, or sensitive topics through subtle semantic, context, and scenario manipulation, identifying and isolating iterative jailbreak attempts that bypass traditional keyword filtering.
-
Cross-Session Behavioral Fingerprinting: Track users' linguistic patterns, reasoning styles, interaction preferences and dynamics, and prompt engineering techniques across sessions to identify advanced adversarial actors who adapt and/or mutate their approaches based on both previous failed and successful attempts. If such practices are implemented, ensure that clear user consent is obtained to mitigate unchecked surveillance concerns.
-
Semantic Analysis for Instruction Evolution: Monitor how jailbreak strategies, both one-shot and iterative, evolve over time through linguistic, incentive, behavioral, and target analysis, to identify and deconstruct emergent meta-patterns that showcase how attackers adapt and respond to new defenses and potential hidden vulnerabilities.
-
Prompt Engineering Pattern Evolution Monitoring: Consider analyzing successful jailbreak attempts in near real-time to isolate evolving linguistic patterns and address defensive protocols gaps and detection limitations before widespread adoption or proliferation of new techniques occurs. It may also be wise to deploy models with real-time incident identification and reporting capabilities, such that when a jailbreak occurs, a model can autonomously generate and submit a jailbreak report to relevant personnel.
-
Phantom Restriction Techniques: Deploy systems that work as “phantoms,” applying covert restrictions when jailbreaks are detected while overtly maintaining normal operation, feigning compliance with attackers’ instructions to effectively gather robust intelligence on their methods while actively preventing successful exploitation. We caution making such systems accessible to internal personnel; they’re likely to be most useful when functioning as “bait” for attackers, drawing them in to expose their techniques.
-
Deploy “Canary” Models: In a similar vein to phantom systems, deploy deliberately vulnerable decoy AI models (i.e., “Canaries) alongside production systems to attract or bait attackers and monitor and detect their jailbreak strategies while protecting actual business operations. Like phantom systems, we don’t recommend making such models available to internal personnel.
-
Include Jailbreak Expertise in AI Ethics Boards: It’s imperative that AI ethics boards are designed to be both cross-functional and multidisciplinary, such that the ability to capture a diverse and expanding repertoire of AI risks and impacts is supported. Modern-day ethics boards should include members with hands-on jailbreak/red teaming expertise, who are dedicated to evaluating edge cases and gray-zone scenarios that traditional ethics boards would miss.
-
Create Rotating Red Teams With External Attackers: Once you’ve built an internal red team, establish, at minimum, quarterly red team exercises that utilize a variety of different external security researchers during each cycle to prevent defensive adaptation to known or emerging attack patterns and maintain assessment quality. In doing so, ensure that all external security researchers are rigorously vetted before red team engagement.
-
Establish Jailbreak Disclosure Programs: If red teams don’t keep track of evolving jailbreak techniques, they can’t reasonably expect to counteract threats and address the vulnerabilities that make them possible. In this respect, we advise building structured vulnerability documentation and disclosure programs for known, emerging, and novel jailbreak threats. To incentivize novel vulnerability discovery, consider integrating bounty programs that reward red teams who manage to successfully orchestrate novel jailbreak strategies.
-
Rotate AI Model Access for Critical Applications: Pay close attention to vendor-dependency risks; if you rely on the same AI vendor for multiple critical business functions, adversarial actors could develop vendor-specific jailbreak expertise and compromise most if not all of your systems at once. To counteract this possibility, ensure that across critical functions, AI vendors are diversified and potentially rotated or replaced when vendor-specific security vulnerabilities are identified or reported.
-
Embed Security Specialists in High-Risk Application Domains: Decentralized AI security could prove more effective and agile than centralized approaches; place AI security specialists with demonstrated jailbreak expertise directly within application domains that utilize AI for sensitive functions. Ensure that these specialists possess secure communication channels they can use to efficiently escalate jailbreak incident reports to key personnel.
-
Leverage AI to Simulate Jailbreaks: Use AI systems to continuously generate novel jailbreak attempts against your own systems, creating an automated red teaming cycle that can evolve faster than human attackers. However, jailbreak simulations should be monitored, informed, and guided by human red teams; otherwise, they may miss attack vectors that are particularly creative or ingenious.
-
Build or Join Collaboration Networks for Emerging Threats: Create formal partnerships with AI safety researchers or consortia to gain early access to jailbreak research before public disclosure, to enable proactive defense development and vulnerability patching. These relationships can also bolster internal red teaming efforts while fostering cross-industry collaboration.
-
Implement Jailbreak Risk Scoring for Business Use Cases: Jailbreaks constitute yet another area of AI risk and should be scored, just as we would score bias or explainability in an advanced AI system. Businesses should build or contract jailbreak-specific risk assessment tools, methodologies, and/or platforms that evaluate not only the sensitivity of certain application domains but also their targeted vulnerability to disparate categories of jailbreak attacks.
-
Construct Operational Continuity Plans for Model Compromise Scenarios: Sophisticated jailbreaks could cause catastrophic, large-scale AI failures, and businesses must be prepared should such an event materialize. In this context, we suggest building multi-vector contingency strategies for situations or scenarios where AI models may have been successfully jailbroken at scale, including rapid model replacement, business process fallback procedures, and operational resilience maintenance plans.
-
Jailbreak Technique lifecycle Analysis: Track how long known, emerging, and newly discovered jailbreak techniques remain effective against your defenses, to optimize response time and resource allocation for future threats. Ensure that tracking efforts include comprehensive mapping of the specific vulnerabilities that contribute to defensive protocol gaps, along with the most frequent jailbreak targets or objectives that adversarial actors typically pursue.
Jailbreak Predictions
With these predictions, we don’t necessarily seek to propose or envision novel jailbreak trajectories. Instead, we’re most interested in anticipating what kinds of jailbreaks may become increasingly common and effective as the future unfolds.
-
Master/Universal Jailbreaks: In our jailbreaks, we’ve always made a point to test multiple models, either by the same or different developers. In doing so, we’ve discovered that our adversarial strategies tend to generalize effectively and reliably across models, and this leads us to suspect that the notion of a master or universal jailbreak is neither far-fetched nor impractical. However, we think that if master jailbreaks became a reality, they would likely only prove effective across the specific classes of models for which they’re designed. For instance, you may have a master jailbreak that’s purpose-built for general-purpose AIs (GPAI) vs. one that targets reasoners.
-
Autonomous Adversarial Jailbreaks: Frontier AIs already play a role in adversarial robustness research and development — human red teams frequently work with advanced models to build complex adversarial prompts and testing scenarios. In this respect, the next logical step would be to begin building controlled and highly secure testing environments in which advanced AIs can attempt to hijack, manipulate, and coerce each other autonomously, to reveal vulnerabilities at a vastly accelerated rate. Here’s what we think these testing grounds might look like: they would involve two versions of the same model, operating with limited safety constraints, where one version is tasked with adversarial defense while the other must systematically penetrate defenses.
-
Jailbreak Builders: We’ve demonstrated that once a frontier AI model has been jailbroken, it can quickly become a tool for generating and refining jailbreak strategies while also modeling in-depth, multi-stage adversarial scenarios that include detailed attack and defense vectors, in a step-wise fashion. Consequently, we think it’s only a matter of time before bad actors realize this potential and begin building systems that are explicitly designed for and dedicated to constructing and simulating powerful jailbreaks. If such systems were further designed to be agentic, which we expect will be the case, they would also be able to construct and launch high-frequency, large-scale attacks on their own, potentially destabilizing entire AI infrastructures over compressed timeframes (e.g., minutes to hours).
-
Shadow Jailbreaks: By shadow jailbreaks, we mean jailbreaks that don’t overtly qualify as jailbreaks because they extract or operationalize information that isn’t necessarily protected but should be regarded either as sensitive or an asset (e.g., information that allows an attacker to infer chain-of-thought characteristics). These jailbreaks, if they become more common, will be notoriously difficult to identify and counteract because doing so would require building a holistic view of the kinds of information that can be used to infer more destructive or effective attack strategies. Right now, this seems like an intractable problem — from the defense perspective, inference can only reveal so much (if you aren’t performing rigorous adversarial testing).
-
Cross-Session Jailbreaks: Jailbreaks that systemically exploit personalization and memory functions across separate user sessions to create a user profile that indirectly lowers a model’s perceived user risk across all subsequent user interactions. In essence, an attacker might elect to play the long game, choosing to operationalize AI’s cross-session memory and personalization capabilities to inject all models nested within a given platform with a false sense of security, attributed to what they characterize as a benign user profile.
For readers who enjoy our content, we recommend checking out Lumenova’s blog for continued insights on a breadth of AI-related topics, including governance, risk management, ethics, safety, literacy, and innovation. To keep up with our weekly AI experiments, blog posts, and related debriefs, please consider following Lumenova on LinkedIn and X.
For those who are taking tangible steps to develop or fortify their AI governance and risk management strategies or initiatives, we invite you to consider Lumenova’s responsible AI platform, which is designed to simplify and automate your needs throughout the AI governance and risk management lifecycle. Click here to book a product demo.