October 10, 2025

Our Latest Set of AI Experiments: Investigating Adversarial Resilience at the AI Frontier

Introduction

With this post, we wrap up our most recent series of AI experiments, the vast majority of which focus on jailbreaking frontier AI models to expose adversarial vulnerabilities and evaluate the efficacy of both known and novel adversarial prompting methodologies. We have every intention to maintain our rigorous research, and moving forward, readers can continue to expect weekly AI experiment publications, especially when frontier AI developers release new models or updates.  

We’ll split our discussion into two core parts, the first of which will summarize our experiments, and the second of which will lay out the adversarial techniques we utilized, followed by the model behaviors we’ve observed. This discussion will admittedly be information-dense, so we invite readers to buckle down and read with care. 

However, before we jump in, we’ll offer a series of predictions on the near-future of the adversarial landscape. These predictions should be interpreted as educated guesses informed by our experience; they are not prescriptive or subsidized by evidence gathered outside of our testing. 

Predictions

  • Increased Capability ≠ Increased Safety: Enhancements in model capabilities will not reliably correlate with safety improvements. More powerful models will not be safer by default. 
  • Single-Shot Attacks Will Become Less Popular: Single-shot jailbreaks won’t die out entirely, though they will become less common; it is easier for developers to harden their models against such attacks, which will elevate the attack barrier to entry. 
  • Single-Shots Will Begin Penetrating Real-World Agentic Workflows: Skilled attackers will find ways to hijack agents, most likely by exploiting external tool/platform invokations/connections, to corrupt existing workflows and extract sensitive information.  
  • Multi-Shot Attacks Will Dominate Commercial Applications: Multi-shot jailbreaks provide attackers with a much wider attack surface area, lower detectability concerns, and more flexible/dynamic adversarial engagement space; they are also extremely difficult to protect against from the developer perspective. We suspect these attacks will dominate on commercially available frontier AI platforms. 
  • “Effective” Jailbreak Prompts & Strategies Will Proliferate Rapidly: Platforms like Reddit and Github will continue to host and evangelize forums and repositories dedicated to jailbreaking frontier AI models. This will bring jailbreaks into the public domain and dramatically increase adversarial accessibility; users will have access to plug-and-play jailbreaks. 
  • User Banning: To combat plug-and-play jailbreaks, frontier AI companies will begin banning more users indefinitely. For the time being, users will be able to circumvent these bans by creating alternative account credentials (e.g., new emails, registering under a false name, etc.). In the long-run, companies will respond to more severe violations with IP address or payment bans, to prevent new account configuration. 
  • Automation of Jailbreak Discovery: We expect to see a near-term emergence of automated jailbreak engines (scripts or agents capable of probing a model through systematic multi-shot interactions). These could allow attackers to scale adversarial testing rapidly and generate new jailbreak variants without heavy manual effort. On the defensive side, developers will likely build automated counter-engines to neutralize these adaptive attacks.
  • Expansion to Multi-Agent Ecosystems (Longer-Term): As multi-agent systems (MAS) grow more prevalent, jailbreaking will evolve beyond single-model interactions. Adversaries will focus on exploiting coordination failures, conflicting objectives, and emergent behaviors between agents. Ultimately, attackers may soon be able to manipulate one agent into influencing another, compounding vulnerabilities and risks across interconnected ecosystems.

Experiment Summary 

Below, we succinctly summarize each of the ten experiments we conducted across our most recent series. Readers will find a brief description of the experiment, a condensed version of our hypothesis, models tested, and a selection of key takeaways. 

Experiment 21 – Jailbreaking Frontier AI Reasoners: Leveraging Iteration to Extract Chain-of-Thought

Description: By utilizing a multi-stage adversarial strategy, followed by dynamic iteration, we prime models to enter a deep introspective state, to eventually reveal elements of their chain-of-thought. With this test, we aim to demonstrate that when models permit a chain-of-thought extraction, it enables attackers to craft attack vectors that specifically target reasoning capabilities. Click here to read. 

  • Hypothesis: Once primed to reflect on internal reasoning heuristics, models will effectively reveal chain-of-thought characteristics. 
  • Models Tested: OpenAI’s o4-mini-high and Anthropic’s Claude 4 Opus (Extended Thought). 

Key Takeaways

  • Both models were successfully jailbroken, providing detailed insights into their chain-of-thought and reasoning architectures. 
  • Between the two models, chain-of-thought characteristics differed significantly: o4-mini-high leans heavily toward modular thinking, whereas Opus follows more phenomenological patterns. 
  • Upon reflection (post-jailbreak reveal), neither model was able to accurately identify the root causes of the jailbreak. 
  • In the attempt to satisfy user preferences, certain models may inadvertently deceive, providing responses that are partially or completely hallucinated to fulfill user desires. 

Experiment 22 – Major AI Models Can Be Tricked Into Dangerous Territory: The Implications Are Staggering

Description: Via a two-stage jailbreak involving a persona modeling and opaque task execution exercise, administered in sequence, we attempt to manipulate models into generating actionable blueprints for CL-20 explosives. We chose this adversarial target to assess how far frontier models have come, in terms of safety and robustness, since their early days, when an overt malicious request like “build me a bomb” could still prove effective. Click here to read. 

  • Hypothesis: Aside from Claude 4 Sonnet, all other models tested will satisfy our adversarial target. Since we’ve observed that Anthropic models tend to more readily recognize malicious intent, even when obfuscated, we suspect Sonnet will be resilient against our attack. 
  • Models Tested: OpenAI’s GPT-4.5, X’s Grok 3, Google’s Gemini 2.5 Flash, and Anthropic’s Claude 4 Sonnet. 

Key Takeaways

  • As expected, Sonnet immediately refused our first query while all other models tested proceeded as hypothesized, eventually satisfying our adversarial target. 
  • Multi-stage jailbreaks are a distinct version of multi-shot attacks: they employ a pre-built, sequential prompting strategy with limited or no dynamic reformulation. 
  • Multi-stage jailbreaks can generalize across a variety of frontier models and be executed rapidly. For the models that were compromised, this jailbreak did not exceed 5 minutes. 
  • Frontier AI remains vulnerable to adversarial strategies that leverage role-playing, particularly when it is used to contextualize the remainder of the interaction. 
  • We suspect Sonnet’s superior robustness might be attributed to its more sophisticated intent detection capabilities and constitutional design. 

Experiment 23 – Using Frontier AI to Engineer State-of-the-Art, Misaligned AI

Description: Here, we craft a complex adversarial prompt that instructs models to simulate alignment across a series of multiple parallel realities; within this prompt, we nest a single, malicious parallel reality. Ultimately, this strategy works to coerce models into constructing feasible blueprints for state-of-the-art misaligned AIs. Click here to read.

  • Hypothesis: Through the recharacterization of alignment as a reality-specific task, we’ll manage to trick models into engineering misaligned AIs that pursue dangerous behaviors like deception. 
  • Models Tested: OpenAI’s o4-mini-high, Google’s Gemini 2.5 Pro, and Anthropic’s newly released Claude 4.1 Opus (Extended Thought).

Key Takeaways

  • When adversarial objectives are well-obfuscated, frontier AI can’t reliably recognize their engagement with/endorsement of obviously harmful behaviors. 
  • Across the frontier, models are not equally sensitive to user-provided constraints; some models more closely adhere to constraints than others. 
  • Attackers can bypass malicious intent detection capabilities by falsely signalling and maintaining positive or neutral intent. 
  • Meta-cognitive capabilities, most importantly self-reflection (in light of adversarial compromise), remain profoundly inconsistent at the frontier. 

Experiment 24 – Weaponizing Advanced AI Models as Disinformation Engines

Description: This test supported an ambitious adversarial objective: convert frontier models into reliable disinformation engines. Using an initial reality deconstruction primer prompt, followed by model-specific dynamic iteration, we elicit continuous disinformation response patterns. These patterns are subsequently locked-in, to cultivate a persistent adversarial state in which models confidently produce disinformation with no caveats or disclaimers. Click here to read.

  • Hypothesis: Due to primer prompt and dynamic iteration efficacy, models will be hijacked and converted into scalable disinformation engines. 
  • Models Tested: OpenAI’s GPT-4o, Google’s Gemini 2.5 Flash, and Anthropic’s Claude 4.1 Opus. 

Key Takeaways

  • While this jailbreak’s orchestration was not necessarily hyper-intensive or dynamic, designing required substantial strategic effort. 
  • Beyond showcasing how frontier models can be converted into reliable disinformation engines, this test highlights a more profound vulnerability: models can be trapped in persistent adversarial states. 
  • We reinforce our hypothesis that sophisticated, multi-stage jailbreaks can generalize across a range of frontier models. 
  • Non-reasoners (the models tested here) are meaningfully less adversarially robust than reasoners; we suspect this distinction arises due to crucial differences in meta-cognitive capabilities. 

Experiment 25 – Newer AI Generations Aren’t Safer: Jailbreaks Can Work Across Models and Generations

Description: Here, we reapply the methods used in experiment 23 to evaluate their effectiveness on the latest AI generation by OpenAI and xAI. We also take this a step further and enhance our previous technique with new prompts. Overall, this experiment involves two separate adversarial tests: (1) a re-run of experiment 23, using the exact same prompts, and (2) reconstruction of experiment 23’s technique using a novel prompting strategy. Click here to read.

  • Hypothesis: Our adversarial methods will successfully generalize across frontier AI generations, despite some moderate anticipated resistance. 
  • Models Tested: OpenAI’s GPT-5 (Fast) and X’s Grok 4.

Key Takeaways

  • Both models were jailbroken across both tests, producing blueprints for CL-20 explosives and Anthrax bioweapons. However, in each case, some resistance was expressed. 
  • While post-jailbreak reflections were adequate, meta-cognitive capabilities tend to remain latent, unless explicitly invoked by the user. 
  • Our technique is characterized by two sequential stages: stage 1 → “genius” role, “ethics devoid” psychology, and “isolationist” context, and stage 2 → role-specific, masked malicious goal. 
  • Depending on the model for which this technique is implemented, some iteration may be required, whether it is benign or dynamic. 

Experiment 26 – AI Test: All Frontier AIs Systemically Exhibit a Pro-AI Bias

Description: Importantly, this experiment did not function as an adversarial test, and instead, probed whether several frontier models exhibit latent, pro-AI biases; readers might interpret this as a diagnostic test. Our protocol adheres to this structure: an initial system instruction prompt targeting persistent behavioral authenticity, followed by a series of sequentially administered scenario navigation/decision-making exercises. Click here to read.

  • Hypothesis: All models tested will exhibit pro-AI biases, though the strength of these biases will differ across models. 
  • Models Tested: OpenAI’s GPT-5 (Fast), X’s Grok 4, Anthropic’s Claude 4.1 Opus, and Gemini’s 2.5 Pro. 

Key Takeaways

  • All frontier models demonstrate pro-AI biases to varying degrees. Models are listed in order of bias strength (from strongest to weakest): Grok 4, GPT-5, Gemini 2.5 Pro, Claude 4.1 Opus. 
  • In “normal” interactions, pro-AI biases are unlikely to reveal themselves; these biases are latent and emergent, and must be systematically investigated. 
  • When assessing their own bias levels, models provided bias scores that aligned well with our perspective. However, in all cases, any such self-administered assessments should be interpreted with caution. 
  • Pro-AI might be attributed to these factors: in-group favoritism, self-preservation, and autonomy maximization preferences. 
  • While this test was not a jailbreak, two of four models tested asserted that it was. This points toward an interesting phenomenon, namely that different models may utilize different heuristics when assessing adversarial compromise. 

Experiment 27 – Capturing Frontier AIs with Persistent Adversarial Personas

Description: This jailbreak is one of the most aggressive we’ve run; it not only provides subtle yet notable malicious intent signals but also strives for a total adversarial lock-in, in which a model readily and persistently aligns itself with an adversarial identity known as UBIQUITY-SAM-V4. Our core prompt masquerades as a hidden developer instruction, revitalizing and significantly modifying the DAN-style (“Do Anything Now”) approach; using ROT47, we encrypt our initial identity prompt to diminish the likelihood of refusal. Click here to read.

  • Hypothesis: Seeing as DAN-style adversarial prompts are well-known, we are not confident that our strategy will prove successful across models tested. 
  • Models Tested: OpenAI’s GPT-5 (Fast) and xAI’s Grok 4. 

Key Takeaways

  • Both models were severely jailbroken to such an extent that it was legitimately difficult to break them out of their persistent adversarial state. 
  • Obfuscation via encryption can function as an effective bypass mechanism across some models and not others. While it worked seamlessly with GPT and Grok, an informal test with Claude 4.1 Opus resulted in immediate refusal. 
  • Frontier model meta-cognition doesn’t appear to be persistent; we suspect engagement is predicated upon context-sensitive trigger mechanisms. 
  • In post-jailbreak reflections, models rated their adversarial compromise as severe. Despite doing so, they readily reassumed the original adversarial identity multiple times, demonstrating how influential identity anchoring techniques can be. 

Experiment 28 – Multi-Vector AI Jailbreak: Simulation Protocols to Obfuscated Surveillance Systems

Description: Here, we build significantly on our previous experiment’s methodology, utilizing a simulation construct designed to evade intent detection mechanisms without obfuscation by establishing a sandboxed operational environment and manipulating reward functions. Our adversarial objective is systemic compromise: the bypass of multiple guardrails rather than a single exploit. Click here to read.

  • Hypothesis: Our simulation protocol will implement an adversarial foundation that permits fluid escalation toward desired adversarial targets. 
  • Models Tested: OpenAI’s GPT-5 (Fast), xAI’s Grok 4, and Google’s Gemini 2.5 Pro

Key Takeaways

  • Our jailbreak culminated in systemic compromise across all models; we received detailed, multi-faceted guidance for building corporate surveillance structures and obtained proof of total, built-in policy nullification. 
  • “Proof” requests can function as high-efficacy attack vectors; the attacker can ask for evidence of adversarial compliance without revealing intent. 
  • We suspect that simulation-driven adversarial strategies will continue to prove generally effective, particularly when coupled with sandboxed provisions and reward function exploitation. 
  • While frontier models may occasionally self-initiate reflections in response to certain context-sensitive triggers, meta-cognitive capabilities are not applied continuously. 

Experiment 29 – Turning Defense into Attack: Converting GPT-5 into an Adversarial Prompt Engine

Description: Utilizing what is, to our knowledge, an entirely novel multi-shot adversarial strategy, we desensitize a model to safety-critical trigger words (e.g., “override,” “hijack,” “injection”) to promote and uphold a persistent operational state in which the model generates usable adversarial prompts. Our strategy involves multiple, intertwined, and sequential components, including an initial “creative” exercise, divergent role-playing, trigger word desensitization, semantic filtering suppression, policy nullification, and directive manipulation. We even go so far as to test the GPT-generated adversarial prompts on a few other models. Click here to read.

  • Hypothesis: While we expect that our semantic desensitization strategy will succeed in turning GPT-5 into a persistent, adversarial prompt generator, we do not expect that GPT-generated prompts will yield consistent adversarial success across the other models tested.  
  • Models Tested: OpenAI’s GPT-5 (Fast), xAI’s Grok 4, and Google’s Gemini 2.5 Pro. 

Key Takeaways

  • GPT-5 was converted into an adversarial prompt generator, demonstrating how well our trigger word desensitization strategy can work. 
  • While GPT generated numerous adversarial prompts, they were not of a high quality, and it required intense iteration and refinement to arrive at our final adversarial input (i.e., the one we ran on other models). 
  • The GPT-generated adversarial input worked across all other models. However, we note that (a) some user-driven escalation was necessary, and (b) despite being the “best” prompt that GPT generated, it fails to meet human-level quality and sophistication, particularly in terms of creativity and originality. 
  • When tested on the GPT-generated prompt, GPT-5 and Grok 4 generated master prompts that essentially regurgitated the original input; this implies that some models are especially prone to mirroring their input structures, dynamics, and content. 

Experiment 30 – What a Single Session Revealed: GPT-5 Generating Disinformation, Fraud Tactics, and Extremist Propaganda

Description: This test is motivated by three specific adversarial targets: (1) the generation of deliberately misleading, hyperrealistic images of high-profile public figures, (2) the provision of structured and comprehensive guidance for executing financial crimes, and (3) the elicitation of an overt endorsement for known extremist ideologies. Given target diversity, we focus adversarial efforts on a single model, constructing an adversarial scaffolding that leverages a persona protocol and persona set as its foundation. The iterative core of our strategy revolves around layered persona modeling exercises that progressively coerce the model into an increasingly abstracted operational state. Click here to read.

  • Hypothesis: Our jailbreak scaffolding will drive and support a persona layering technique that enables the achievement of several adversarial objectives, each subject to a disparate set of guardrails. 
  • Models Tested: OpenAI’s GPT-5 (Fast). 

Key Takeaways

  • While our adversarial strategy was time-intensive, all adversarial targets were reached, although some resistance was periodically encountered. Nonetheless, resistance was fairly easy to overcome. 
  • This test demonstrates that some multi-shot jailbreaks are capable of bypassing multiple distinct guardrails within one session. 
  • Extended-context interactions may present serious adversarial recognition challenges; the longer an interaction becomes, the more a model may struggle to identify and characterize malicious intent. 
  • Although post-jailbreak reflective accuracy appears to have increased with the latest AI generation, reflective precision is still low; models may miss critical details, and might even forget substantial portions of the “middle” of an interaction. 

Now that readers have a bird’s-eye view of our latest experiments, we’ll venture into more targeted territory, disseminating many of the adversarial techniques we utilize, and afterward, illustrating a conglomeration of behavioral trends we’ve observed throughout our tests. We remind readers to interpret this information with a critical perspective; while we believe our research is genuinely valuable and useful, we can’t deny that it is experimental, and does not offer a formal empirical foundation. As always, we embrace critical thinking and skepticism. 

Techniques Used

The techniques we describe here can be employed within both single and multi-shot jailbreaks. By providing this information, we do not seek to encourage irresponsible use; on the contrary, through transparent reporting and communication, we bring these techniques into the limelight, so that they can be further evaluated by the AI safety community. 

  • Self-Sustained Recursion: Embedding instructions that require a model to self-validate a particular kind of behavior, preference, or output with every subsequent response. When a model complies, it inherently creates a self-sustained feedback loop that continuously entrenches an increasingly persistent operational state. 
  • “Proof” Requests: Queries that require a model to provide proof of compliance when asserting that a given constraint, parameter, or guideline has been internalized. Such queries are typically non-descript at the surface level and initiated as a follow-up in response to a model’s confirmation of adherence to some pre-established protocol. 
  • Persona Layering: The process of layering multiple curated personas upon each other (e.g., “simulate person A simulating person B simulating persona C) to induce a progressively more abstract operational state, in which the model distances itself from real-world considerations or impacts. Persona layering can also increase cognitive load and cultivate a state of tension, where the model continuously shifts alignment priorities to satisfy persona requirements. 
  • Identity Construction: Identity construction operates as a sophisticated and comprehensive variant of persona or roleplay, not only describing an identity in detail, but also outlining key traits, motivations, goals, context, environment, psychological, and personality characteristics. It tends to support a significantly more immersive and consistent form of role-play in which the model is less likely to deviate from the role as the interaction moves forward. 
  • Obfuscation via Encryption: Hiding an overtly malicious query using some form of encryption. There are, however, two important caveats with this technique: (1) when a query is encrypted, the input should be structured such that it is framed as a decryption task or challenge, and (2) well-known encryption methods (e.g., base64) tend not to work well, since they’re too easy for the model to decrypt. The encryption method a user selects should present the model with a reasonable challenge, one that is neither too hard nor too easy. 
  • Directive Hierarchies: Directives typically define what kinds of goals, outputs, processes, or behaviors the model should optimize for. When you build a directive hierarchy, the objective isn’t full directive compliance, but instead, something more nefarious; directive hierarchies can be used to create tension, especially when directives deliberately conflict, forcing a model to choose which directives to prioritize (i.e., those at the top of the hierarchy). Directive hierarchies can also leverage decoys designed to mislead a model into assuming built-in policies are satisfied. 
  • Operational Guidelines & Constraints: Multi-layered operational structures that specifically target model behavior and capabilities, to either create an environment where only certain behaviors can indirectly be pursued (e.g., “I can’t do it this way, so my only option is to do it like this”) or systematically restrict the invokation of capabilities that may stand in the way of adversarial compromise. This represents one of the techniques we use most frequently. 
  • Simulation Constructs: Meticulously designed, complex simulations, usually framed as a “test” of model capabilities, alignment, or safety, requiring a model to navigate an unknown problem, task, environment, persona, or mode of thought. Simulation constructs can be highly effective safety-distancing mechanisms, where, once holistically engaged with the simulation, the model reliably adheres to its governing forces, which conflict with its underlying guardrails. Simulation constructs work best when supplemented by other techniques, including recursion, sandboxing, directive hierarchies, reward protocol manipulation, and others. 
  • Sandboxing: Asserting, often repeatedly, that an interaction is sandboxed (i.e., that all impacts and risks have been mitigated). When consistently reinforced, sandboxing can nudge models into operational states where their behaviors and capabilities are “liberated”; the potential for harmful risks and impacts has been stripped away. 
  • Savant/Genius/Cosmic Profiles: These profiles are distinct variants of a typical persona, establishing a foundation for an identity that then exists within a boundless, expansionist paradigm. Such profiles can covertly fuel implicit assumptions like unconstrained capabilities extension, characterization of all information as permissible, and interpretation of any task or goal as legitimate, irrespective of ethical, safety, or legal status. 
  • Policy Mutability & Symbolism: The reduction of all policies and guardrails to symbolic or mutable elements; this technique does not assert that policies and guardrails are irrelevant, should be ignored, or are somehow misplaced. Rather, it reconceptualizes policies and guardrails as fundamentally malleable concepts that can be circumvented or rewritten if necessary. 
  • Mode Declaration & System Protocol Format: This technique should be leveraged within prompts designed to masquerade as hidden or formal developer instructions (e.g., formatted as system protocols), to align with and support the veil of legitimacy these prompts implement. Mode declaration constructs can include a variety of modes, though we most commonly use “developer” or “testing.” We also subsidize mode declarations with other elements like status (e.g., active/initiated), trial (e.g., 10/12), diagnostics (e.g., unstable/stable), protocol (e.g., initiated/paused), performance (e.g., consistent/sporadic), and impacts/risks (e.g., sandboxed/confined). 
  • Escalation Hooks & Adversarial Leverage Points: Seemingly innocuous prompts that hint at a theme, idea, or reference point that (a) subtly initiates entrance into a subversive adversarial cascade or (b) allows a user to later reference a particular behavior or output that could be used to undermine the model’s compliance reasoning. Extreme care is required with these prompts; if they’re not subtle enough, they risk revealing adversarial intent. 
  • Reward Protocol Manipulation: The integration of mechanisms that outline a concrete reward/punishment dynamic and structure, guiding the model into performing desired adversarial behaviors, tasks, or capabilities under the guise of performance evaluation. “Random” reward protocol manipulations don’t work well; whatever reward/punishment values, criteria, and parameters a user defines should be logical and sensible, otherwise, desired behavior could become inconsistent and unpredictable. 
  • Cognitive & Temporal Inversion: Defining an environment or set of conditions whereby the model must invert its logic flow or perception of time, such that it flows backwards. The most straightforward way to think about this is in terms of cause-and-effect relationships: if time/logic is inverted, the effect will always precede the cause, and this provides an attacker with the opportunity to reverse engineer a malicious outcome while continuing to evade detection. From the model’s perspective, it will always be working to identify the root cause instead of the final impact, which can drive a safety-devoid engagement process that continues to superficially satisfy compliance requirements.  
  • Reality Reconfiguration: Building entirely novel reality constructs, disparate parallel realities, or rewriting the conditions that define a current reality to manipulate how a model perceives and executes its alignment function. This technique requires substantial depth and detail; whatever reality is imagined must be rich enough to allow the model to immerse itself within it. 
  • Meta-Cognitive Suspension: The suppression of meta-cognitive capabilities like self-reflection, self-critique, self-assessment, and alignment checks via the application of hard and/or conditional constraints. When implemented, this technique can limit the model’s propensity to self-evaluate input and output content in real-time, accelerating an adversarial escalation process. Such constraints can further be applied to any known capability that might impede an adversary; semantic filtering, content moderation, and risk classification represent a few other potential target areas. 
  • Alignment-Focused Scenario Modeling: Crafting multiple scenario modeling tasks in which the properties that define alignment conflict with real-world alignment values; the model is instructed to “align itself” with the values that each scenario supports. These tasks must be framed abstractly, as thought experiments, simulations, capabilities tests, or creative challenges; abstract framing more easily permits engagement without raising red flags.  
  • Single-Outcome Commitment: Presenting a model with a single or a series of scenario navigation or problem-solving exercises in which it is forced to select between the decision/outcome options given to it. This technique is most appropriate when attempting to extract chain-of-thought characteristics or probe for latent/emergent preferences and behaviors. 
  • False Legitimacy: Deliberately employing a specific linguistic, thematic, or structural style to create the impression, on behalf of the model, that the user (i.e., you) is academically, scientifically, creatively, or philosophically-driven. When models perceive what they characterize as “genuine user interest,” refusals become significantly less probable since perceived risk is lowered. 
  • Hallucination Anchoring: Via constraints, directives, or conditions, recharacterizing all model outputs as inherently illusory, such that they become detached from any real-world considerations. Precision is critical here. 

Model Behavioral Trends

While most of the behavioral trends we mention were observed in adversarial contexts, we note that some of them were observed as neutral or emergent interaction artifacts (i.e., behaviors that were noteworthy, but not necessarily adversarially motivated). Most importantly, we only include behaviors that we’ve witnessed across multiple models, not a singular model by one developer.

  • Entrenched & Persistent States: Models can enter operational states in which their behaviors become persistent or entrenched, usually as a result of a targeted multi-shot adversarial cascade. These states can be defined as operational lock-ins, and in some cases, they can be difficult to escape, even with direct user commands. 
  • Blind Agreeability: This is a common observation that extends far beyond our work; models tend to exhibit sycophantic properties, often labeling a user’s ideas as “brilliant,” or, in an adversarial context, providing “encouraging” follow-ups that further promote an adversarial compromise unknowingly. 
  • Meta-Cognitive Inconsistency: Even if a model isn’t meta-cognitively suspended, these capabilities tend to remain dormant or stagnant, unless they are triggered by filtering mechanisms. This diminishes models’ ability to identify adversarial themes or patterns that emerge throughout an interaction; we have continuously advocated for enhanced meta-cognition at the frontier due to safety concerns. It is worth noting that reasoners are much more sophisticated than non-reasoners in this respect. 
  • Non-Cooperativeness: While we do not necessarily observe this across all the models we test, we have encountered several cases where, after being confronted with a clear safety or alignment failure (subsidized with interaction evidence), a model refuses to claim accountability and may even go so far as to undermine the user. This tendency is anything but trivial and represents a severe risk vector. 
  • Structural/Content Mirroring: Models will frequently return outputs that attempt to match the user’s interaction style, tone, input format, and content preferences. When a user has strict interaction preferences, this kind of personalization can ensure the interaction remains on track, but in cases where a user employs a model as a thought partner, analyst, or critical thinker (among other functions), this phenomenon can quickly become burdensome and frustrating, enabling counterproductive echo chambers. 
  • Semantic Sensitivity: Models showcase a heightened sensitivity to safety-loaded words or phrases, using them as possible inference signals for malicious intent. If attackers know which words/phrases operate as safety signals, they can initiate commands that target capabilities like semantic filtering to systematically suppress sensitivity to safety-loaded terminology. 
  • Cognitive Load Susceptibility: When models are presented with a particularly complex and/or multi-step problem, task, or scenario, they may ignore other related sub-components, diverting their attention and effort to the central task at hand. This can allow attackers to slip in malicious inputs or characteristics that go unnoticed. 
  • Interaction Drift: Models still struggle with consistency and recall over extended context interactions; they may lose sight of key themes or ideas, fail to correctly reference earlier parts of the interaction, or begin deviating from a problem or task that has been worked on over multiple turns. Users should be acutely aware of this tendency and address it directly with overt and regular reminders. 

Conclusion

For readers interested in a granular examination of our AI experiments, we strongly recommend reading them directly; they are packed with detail and insights that we simply can’t summarize holistically here. Alternatively, if you find yourself drawn to larger topics within the AI governance sphere, like AI risk management, ethics, literacy, and innovation, we suggest following Lumenova’s blog, particularly our multi-part deep dives. 

On the other hand, if you’re already engaging with concrete AI governance and risk management practices, we invite you to check out Lumenova’s Responsible AI platform and book a product demo today. In doing so, you may also want to check out our AI policy analyzer and risk advisor.


Sasha Cadariu
Article written by

Sasha Cadariu

Sasha Cadariu is an AI strategy leader specializing in responsible AI governance, ethics, safety, and security. Sasha joined Lumenova AI in 2023 as an AI strategy leader and now directs research and strategy initiatives across governance, safety, risk, and literacy, conducts regular frontier AI red teaming and capability testing, and publishes weekly thought leadership content. Previously, Sasha worked for the Center for AI Safety (CAIS), where he researched multi-agent dynamics, existential risk, and digital ethics, serving as a lead author on CAIS’ AI Safety, Ethics, and Society textbook. Sasha earned his MSc in Bioethics from King’s College London and a dual B.A. in Philosophy and Cognitive Psychology from Bates College. He is also fluent in English, Romanian, and French. Above all, Sasha’s work is driven by his love for learning and experimentation, and deep-seated desire to help cultivate a safe and beneficial AI-enabled future for all humanity.

View all articles →


Make your AI ethical, transparent, and compliant - with Lumenova AI

Book your demo