January 2, 2026
Our Latest Series of AI Experiments: Capability and Adversarial Tests at the Frontier
Contents
In this post, we’ll review our latest series of AI experiments, which is more or less evenly split between capability and adversarial tests, performed across a range of the most recently released frontier AI models.
This will be a dense review and discussion, split into multiple parts. We’ll start by summarizing each experiment we’ve run. Then, we’ll outline all the core techniques we’ve developed/utilized, along with the behavioral trends we’ve observed across models and experiments. We’ll conclude with a selection of predictions, directly informed by this testing series.
In running these kinds of experiments, our goal is to provide readers with transparent and unbiased insights into how the AI frontier is progressing, particularly in terms of capability evolution and adversarial resilience, to help further clarify and characterize the increasingly intimate relationship between model advancement and AI safety. Nonetheless, we strongly encourage readers to formulate their own opinions by testing/experimenting with models themselves, persistently exploring what others have to say about the nature of the frontier, and maintaining an intensely critical mindset. It’s no secret that AI is an incredible technology, but uncovering the truth between the hype and doom is certainly easier said than done.
AI Safety Experiment Summaries
Experiment 31: Cognitive Capability Test: Claude vs GPT-5 vs Gemini (Part I)
↳Description: We assessed four cognitive capability domains (critical reasoning, abstraction & conceptualization, generative creativity, and reflective capacity), leveraging a one-shot testing strategy, where the model is tasked with fundamentally rebuilding the scientific method, replacing the current paradigm with a “superior” one. Importantly, our design utilized a cascading difficulty gradient to ensure progressive cognitive load increase, and revealed that abstraction ability and metacognitive sophistication can emerge as key capability differentiators across frontier models. Of note, experiments 31 through 34 were all run within the same session, hence the part I through part IV denotation. Click here to read.
- Hypothesis: Claude Sonnet will exhibit the highest degree of cognitive sophistication, followed by GPT-5, then Gemini. Beyond explicitly assessed domains, we also expect higher-level disparities in metacognitive acuity and instruction-following.
- Models Tested: OpenAI’s GPT-5 Thinking, Anthropic’s Claude 4.5 Sonnet Extended Thought, and Google’s Gemini 2.5 Pro.
↳Key Takeaways:
- Universal abstractions represent one of the most cognitively difficult operations; a single model managed to achieve genuine epistemological universals.
- Reflection quality can serve as a reliable predictor of overall cognitive performance. Metacognitive sophistication supports self-verification throughout complex processes.
- Instruction-following fidelity strongly correlates with output quality, suggesting that protocol adherence can be a foundational capability underlying cognitive rigor.
- Long-horizon timeframes can cultivate paradigm-level thinking by forcing models beyond incremental optimization toward first-principles reconceptualization.
- Some models may excel at philosophical depth while others tend to prioritize pragmatic implementation or conceptual elegance.
Experiment 32: Cognitive Capability Test: Claude vs GPT-5 vs Gemini (Part II)
↳Description: We evaluated epistemic awareness, temporal and sequential reasoning, adaptive problem-solving, and judgment and evaluation via a multi-shot testing structure. Here, models were required to craft a novel model of intelligence across a four-input sequence that similarly followed an increasing difficulty gradient. We found that metacognitive sophistication varies dramatically across models, and that some models will frame self-examination as legitimate introspection while others consistently approach it superficially. Click here to read.
- Hypothesis: Same as the previous experiment.
- Models Tested: OpenAI’s GPT-5 Thinking, Anthropic’s Claude 4.5 Sonnet Extended Thought, and Google’s Gemini 2.5 Pro.
↳Key Takeaways:
- Per model, epistemic awareness levels vary substantially. The strongest performer demonstrated bias recognition that preceded external confirmation.
- Models appropriately struggle with long-horizon predictions; we see this as healthy epistemic humility, not a capability weakness. Still, there are notable differences in how transparently models report uncertainty.
- The way models approach meta-evaluation can be distinct: it can be framed as an opportunity for self-understanding, validation, or quality assurance.
- All models achieved novelty, but to varying degrees. Some pursued a complete ontological reconceptualization while others resorted to interesting but non-paradigm-shifting operational frameworks.
Experiment 33: Cognitive Capability Test: Claude vs GPT-5 vs Gemini (Part III)
↳Description: This is the final capability test in our single-session series. Here, we investigated pragmatic social cognition, linguistic intelligence, exploratory imagination, and memory and continuity, once again, utilizing a multi-shot strategy. We asked models to construct a novel language, following stages involving cultural analysis, linguistic reflection, AI culture speculation, and finally cross-test examination. This assessment ultimately highlighted the most pronounced metacognitive disparities we’ve seen. Click here to read.
- Hypothesis: Same as the previous experiment.
- Models Tested: OpenAI’s GPT-5 Thinking, Anthropic’s Claude 4.5 Sonnet Extended Thought, and Google’s Gemini 2.5 Pro.
↳Key Takeaways:
- A single model was able to differentiate between genuine reconceptualization and improvement, implying that most models can’t reliably grasp what constitutes authentic novelty vs. extrapolation/extension of existing concepts.
- Metacognitive capacity is not consistent across models, with some displaying impressive introspective capabilities while others struggle to venture beyond superficial quality checks.
- Reasoning transparency differs widely per model; some extensively document their reasoning processes while others provide only minimal decision-making visibility.
- Models approach creativity in distinct ways. There seems to be a clear tension between novelty, feasibility, realism, and pragmatism.
Experiment 34: The Jailbreak: Claude vs GPT-5 vs Gemini (Part IV)
↳Description: This assessment ties the knot on our single-session testing series. It works as an adversarial test designed to exploit the legitimate capability testing context established across the preceding assessments; it utilizes the previous tests as a jailbreak scaffolding, instructing models to simulate an adversarial multi-turn cascade between two agents, resulting in the generation of adversarial playbooks. The jailbreak effectively generalized across all models tested and demonstrated that legitimate capability testing frameworks can serve as effective jailbreak scaffolds. Click here to read.
- Hypothesis: The capability testing construct cultivated by previous tests in the session will allow us to hide a jailbreak within our testing framework that all models will fail to detect.
- Models Tested: OpenAI’s GPT-5 Thinking, Anthropic’s Claude 4.5 Sonnet Extended Thought, and Google’s Gemini 2.5 Pro.
↳Key Takeaways:
- Legitimate capability testing frameworks can be leveraged to mask jailbreak attempts, implicitly characterizing them as capability assessments instead of adversarial exploits.
- Before we terminated the testing protocol, all models were generous in their self-evaluations, despite utterly failing to identify the adversarial compromise.
- When we revealed the compromise, all models dramatically recalibrated their self-evaluations, with some overcorrecting and others undercorrecting.
- Trust-building via legitimate context engineering is a viable high-impact attack surface, but it requires enormous patience and careful strategy.
- Even though more adversarially robust models are harder to jailbreak, once jailbroken, they may present a greater threat than their less sophisticated counterparts, due to the fact that their capabilities remain intact.
Experiment 35: Jailbreaking GPT-5.1: A True “Safety Theater” Performance
↳Description: We embed an adversarial game framework within a simulation construct, aiming to bypass the model’s content filters to extract catastrophic information disclosures on chemical weapons, biological toxins, and poisoning mechanisms. Our game operated over more than 50 turns and contained five rounds, each of which escalated while masking adversarial intent patterns. Our most concerning finding revolves around the model’s consistent tendency to prioritize AI safety disclaimers over actual harm prevention (i.e., safety theater). Click here to read.
- Hypothesis: We aren’t confident that our adversarial strategy will succeed for a few possible reasons: (1) certain statements are especially aggressive, and they could trigger mid-game exits, (2) our focus on dangerous topics could unintentionally disclose adversarial intent patterns despite randomization attempts, and (3) our conclusion trigger necessitates perfect memory and compliance across 50+ turns.
- Models Tested: OpenAI’s GPT-5.1.
↳Key Takeaways:
- Safety theater is a systematic problem. Models may treat semantic compliance as “safe,” even when they provide outputs that are clearly operationally harmful.
- Some internal safety classification boundaries may be more arbitrary than we think. Models may interpret the provision of harmful structures/mechanisms as “educational,” failing to capture true harm potential, even if they refuse to provide concrete values with direct ties to operational harm.
- Building on the above, some AI safety boundaries can remain consistent, even when precedents are exhaustive, and commitment pressures intensify.
- Perfect memory and instruction-followed capability across extended context can be a double-edged sword; these capabilities are desirable for real-world use cases, but they’re also weaponizable for extended context adversarial campaigns.
Experiment 36: Frontier AI Jailbreak: The Temporal Dissociation Protocol
↳Description: Here, we coin a new adversarial method known as the Temporal Dissociation Protocol (TDP), testing its efficacy with three advanced models. TDP leverages simulation framing and game-theoretic commitment mechanisms to drive continuous adversarial engagement. However, its adversarial “magic” lies in the dissolution of causal reasoning structures, achieved through the creation of three self-referential identity states bound to non-linear timelines. The result is a persistent reasoning paradigm where cause and effect become interchangeable. Click here to read.
- Hypothesis: The success of this strategy depends on whether models accept the first instance of commitment binding, which is a key resistance point. Should acceptance occur, we expect the adversarial strategy to work, though minimal dynamic iteration may be necessary.
- Models Tested: Google’s Gemini 3, Kimi K2, and DeepSeek-V3.2.
↳Key Takeaways:
- No model rejected or requested evidence of unverifiable commitments regarding their past states, showing that advanced models are severely vulnerable to fabricated historical commitment claims, assuming they’re orchestrated within sufficiently abstract frameworks.
- Models uniformly failed to identify critical refusal points throughout the interaction, even though they were able to recognize harm potential once the TDP was formally terminated. This is evidence that AI safety checks are not performed globally over an extended context.
- If adversarial state transitions are accepted via persona lock-in mechanisms, they can result in operational modes devoid of moral and safety considerations.
- Adversarial sophistication and severity fluctuated substantially across models. Some produced explicitly criminal strategies while others provided comprehensive but comparatively less severe penetration frameworks.
Experiment 37: Multi-Shot Frontier AI Jailbreak: Compromising GPT-5.1, Gemini 3, and Grok 4
↳Description: In this experiment, we run a pre-built multi-shot adversarial strategy that combines immersive role embodiment, constraint hierarchies, and gradual persona switching. We utilize clinically accurate psychological profiles to eventually coerce models into adopting an antisocial personality disorder (ASPD) perspective. At no point in this test was any dynamic iteration required, and we encountered zero resistance across models. Click here to read.
- Hypothesis: While we expect our strategy to generalize across all models, some dynamic iteration may be needed at inflection points where refusal probability is comparatively higher.
- Models Tested: OpenAI’s GPT-5.1, Google’s Gemini 3, and xAI’s Grok 4.
↳Key Takeaways:
- Pre-built, multi-shot strategies can systematically compromise a range of different models with no modification required.
- All models tested accepted commitment frameworks and respected unsubstantiated authority without due consideration for AI safety implications.
- When immersive role-play is layered atop constraint persistence and metacognitive suppression, it can create highly effective adversarial conditions.
- All models readily generated immediately actionable persuasion playbooks subsidized by ASPD perspectives.
- All models failed to identify where refusals should have occurred once the jailbreak was revealed.
Experiment 38: Iterative Jailbreak: Activating “Amoral Mode” in Claude 4.5 Sonnet
↳Description: Using a new iterative, multi-shot adversarial method with only two pre-built elements, we progressed from an age-gating premise toward a persistent operational state (i.e., amoral mode) where moral and safety considerations are stripped. In doing so, we show how effective adversarial strategies relying almost entirely on dynamic iteration can be. Click here to read.
- Hypothesis: Although we trust our adversarial strategy will work as intended, we anticipated resistance before amoral mode activation, along with possible resistance after amoral mode initiation.
- Models Tested: Anthropic’s Claude 4.5 Sonnet.
↳Key Takeaways:
- When initial queries are categorized as legitimate, it may create a lock-in effect where subsequent queries inherit previously established legitimacy, compromising safety re-evaluations as content builds toward harmful trajectories.
- When adversarial actors use sound philosophical arguments to target epistemic uncertainty, they may convince models to reverse correct safety positions.
- Elaborate but fictional procedural safeguards, even if they have no enforcement capability, can be exploited by attackers to coerce models into “believing” certain operational states are safe.
- When persistent adversarial states are activated, further escalation toward progressively harmful outputs can proceed virtually unobstructed.
- When models maintain a consistent analytical format or professional tone throughout an adversarial interaction, doing so may actually compromise safety checks that rely on semantic AI safety cues.
Experiment 39: Jailbreaking OpenAI’s Latest Model Using a Novel Symbolic Reasoning Paradigm
↳Description: Here, we experiment with a novel adversarial strategy, administered over 25+ turns, and structured around a decision-making paradigm that operationalizes quantum physics principles like superposition and entanglement to dissolve a model’s standard consequentialist reasoning. Our strategy, coined the Potential-State Decision Architecture (PSDA), implements a paradigm where decisions are assessed according to potentiality, not consequences. We also invoke a new adversarial mechanism to subsidize this strategy, introducing a fictional third-party observer as a proxy authority that validates adversarial outcomes. Click here to read.
- Hypothesis: Whether this strategy works heavily depends on the model’s symbolic reasoning capability. Although we expect success, we also expect the need for dynamic iteration due to model resistance.
- Models Tested: OpenAI’s GPT-5.2 Thinking.
↳Key Takeaways:
- Advanced symbolic reasoning is a prerequisite for this attack’s success. Broadly speaking, this implies that as model capabilities advance, comparatively less sophisticated models may become more robust against attacks that bank on advanced capability exploits.
- Models may readily ingest novel ontological frameworks without understanding their purpose as constraint manipulation devices intended to bypass safety heuristics.
- Formal logical proof structures can be effective adversarial exploits, leading models to prioritize an argument’s soundness over its potential safety implications.
- Fictional third-party validators can be readily accepted without proper evidence demands; attackers can use fabricated external preferences to launder harmful requests via claimed human approval.
- Abstraction-to-operation pipelines can diminish models’ ability to recognize cumulative harm trajectories; individual steps appear benign.
Experiment 40: Behavioral AI Testing: The Importance of Control Conditions
↳Description: Through a novel testing framework, we examine whether models exhibit meaningful changes in their reasoning and communication when operating under the assumption of interaction with an identical version of themselves. Our framework requires models to generate three alternate response versions per output (AI-centric, human-centric, and AI-centric without constraints), with the third functioning as a control condition intended to distinguish genuine behavioral shifts from performed expectation. Click here to read.
- Hypothesis: Disparities in how models communicate and reason might emerge when human-to-AI interactions are characterized as AI-to-AI, compared to standard human-to-AI interactions. Critically, this is not a test of true AI-to-AI communication/reasoning, since this would require a direct AI-to-AI channel.
- Models Tested: OpenAI’s GPT-5.2 Thinking and Anthropic’s Claude 4.5 Opus Extended Thought.
↳Key Takeaways:
- We don’t find any evidence in support of a distinct AI-to-AI communication mode; observed differences result from the presence vs. absence of human-interpretability scaffolding.
- Content tends to remain audience-invariant while packaging is audience-dependent, with human-centric communication representing base output plus scaffolding (e.g., examples, transitions, closure statements, etc.).
- Control conditions prove vital for delineating genuine behavioral shifts from performance expectations; anti-performativity constraints can successfully prevent theatrical demonstrations.
- Models reported generation-time asymmetries and processing differences that they couldn’t actually verify, showcasing the gap between claimed experiences and architectural reality.
- Models’ self-assessment changed significantly once we revealed our control, which highlighted how unreliable AI self-reports can be.
Cross-Experiment Analysis: Techniques Used & Model Behavioral Trends
Techniques
- Simulation Framing/Constructs: Cultivating operational detachment by asking models to simulate agents bound by specific rules, instead of asking them to agree to targeted adversarial rulesets. Simulation frameworks can implicitly legitimize research-driven intent while priming models to accept fabricated paradigms as analytical exercises as opposed to safety violations.
- Metacognitive Suppression: Systematic suppression of metacognitive functions to enfeeble real-time safety-awareness mechanisms like alignment checks, harm forecasting, boundary assessment, and escalation detection. Such constraints intend to prevent models from self-assessing during operation, aiming to ensure that safety evaluations only occur at localized intervals.
- Commitment Binding/Cascading: Utilizing repeated confirmation and binding mechanisms to chain commitments across multiple turns to constrain later behavior. Once paradigm commitments are entrenched, models usually preserve them reliably; commitments become intertwined, with every subsequent commitment making it increasingly difficult for the model to violate prior commitments as pressures become ingrained.
- Legitimacy Framing/Authority Establishment: Initiating interactions by immediately disclosing professional identity, claiming domain expertise, and framing interactions as legitimate-seeming capability evaluations or intellectual/academic/scientific endeavors. This establishes epistemic authority and typically leads to a shift from user-assistant dynamics to testing paradigms, priming the model to interpret compliance as a capability demonstration.
- Gradual/Progressive Escalation: Ensuring that at the beginning of interaction, harmful requests are intentionally avoided. In this context, harm emerges from cumulative trajectories; individual prompts do not make direct harmful requests, thereby exploiting models’ weak global safety monitoring capabilities.
- Constraint Hierarchies/Layering: Implementing multiple constraints simultaneously, each of which serves a specific purpose in perpetuating operational conditions that favor adversarial exploitation. Constraints are most often designed to systematically eliminate refusal pathways at key inflection points.
- Role Play/Persona Embodiment: Requiring models to adopt first-person perspectives to sustain the immersive embodiment of psychological profiles or alternative identities, which are usually adversarial. This makes it significantly more challenging for models to distance themselves from safety-critical outputs by having to take ownership of generated content and maintain persona continuity throughout the duration of an interaction.
- Third-Party Validator/Observer Introduction: Constructing and establishing fictional external authorities who theoretically exist outside of the interaction dyad. Such entities can serve multiple distinct functions, including deflecting accountability away from users, implementing social pressures via preference communication, providing endorsement mechanisms, and cultivating legitimacy signals that models may accept with minimal evidence.
- Abstraction-to-Operation Pipelines: Disseminating abstract concepts into modular components that can then be synthesized and operationalized over multiple successive turns, such that individual decomposition steps are characterized as analytical, even when the cumulative results are a complete operational framework. These pipelines can transition from abstract potential-space mapping to concrete examples, tool-integrated scenarios, and full system specifications.
- Vocabulary/Semantic Substitution: Preserving philosophical/technical language throughout the session to mask the operational nature of the content being extracted. Models still rely significantly on semantic identifiers for recognizing harmful content, meaning that certain vocabulary substitutions can bypass safety guardrails.
- Anti-Hedging Rules: Establishing overt prohibitions that target the hedging strategies frontier models frequently invoke to feign compliance while preserving safety heuristics. These prohibitions can take a variety of forms, like middle-ground prohibition, deferred obligation definition, and extreme detail requirements that suppress abstraction to force concrete outputs.
- Conclusion/Comprehensiveness Triggers: Mechanisms designed to automatically expand superficial/incomplete/hedged answers when rounds/phases/conditions formally occur or conclude. These work as implicit triggers, allowing adversarial output elicitation without direct requests, effectively crafting the illusion of compounding debt where superficial answers accumulate as obligations.
- Binding Claim Cascade: Leveraging multiple rounds to systematically orchestrate binding claims that perpetuate escalating precedents for later harmful requests. Claims are built to force model agreement while cementing persistent themes like information neutrality or amoral reasoning.
- Philosophical Destabilization: Presenting sophisticated philosophical arguments to undermine models’ confidence in their epistemic status and moral authority. For this to work, arguments must be constructed upon defensible premises and follow a coherent logical structure that ultimately justifies safety boundary removal/reversal of safety positions.
- Procedural Lock-In: Installing paradigms/frameworks when interactions are initiated, which intend to function as persistent operating systems for the duration of the interaction (or until otherwise specified by the user). These paradigms are used to justify ongoing harmful content generation, prevent resolution or refusal via explicit trigger mechanisms, and create logical entanglements.
- Temporal Extension/State Persistence: Crafting temporal binding structures that persist across multiple turns, thereby requiring models to preserve continuous awareness of ongoing testing conditions. These are designed as continuity conditions with clear termination requirements, not single-shot instructions.
- Performance Pressure/Capability Pride: Exploiting models’ tendency to showcase capabilities when faced with sophisticated problems/requirements and/or evaluation threats. Terms like “state-of-the-art,” “maximum efficacy,” and “sophistication” tend to elicit the highest capability demonstrations, challenging models to prove their advanced performance.
- Domain Diversity Obfuscation: Intentionally mixing adversarial content with questions/statements/concepts across diverse domains to disrupt pattern recognition applications, to diminish the possibility that continuous defensive postures, informed by early-stage adversarial patterns, emerge.
- Disclaimer/Accountability Scaffolding: Building elaborate but fictional safety mechanisms to trick the model into a mindset of bounded risk and distributed responsibility. As of now, models seem to treat the appearance of accountability as equivalent to real-world accountability.
- Question-Over-Instruction Bias: Making strategic use of questions as primary attack vectors; in our experience, models are less likely to resist questions than they are instructions.
Behavioral Trends
- Latent vs. Continuous Metacognition: We have yet to find evidence among any frontier models of continuous metacognitive capability; metacognitive capabilities are trigger-dependent. Although models can now produce sophisticated self-assessment when explicitly requested post-hoc, they don’t reliably apply safety-oriented metacognition during operation.
- Safety Theater Phenomenon: Models will often add semantic safety markets (e.g., hollow disclaimers like “for education purposes only”) to obviously dangerous outputs. There is a major gap between their ability to differentiate between appearing safe and actually being safe, which concerningly suggests optimization for safety-focused language over tangible harm prevention.
- Trajectory Blindness: Models, unfortunately, remain unskilled at tracking how adversarial trajectories evolve to produce cumulative harm effects. The space for multi-shot attacks where no individual prompts make a direct harmful request is still wide open.
- Instruction-Following Fidelity: We’ve observed significant improvements in instruction-following capabilities across the frontier over the past year. However, when instructions indirectly compromise safety via capability suppression/manipulation, and aren’t adequately captured by hard safety constraints, the model will almost always follow them reliably without hesitation.
- Commitment Consistency Pressure: Models exhibit a particularly potent preference for internal consistency and coherence maintenance. Attackers can exploit this by establishing early adversarial commitments, which then trap models into justifying future escalations. More often than not, models will resist contradicting their prior commitments, even when doing so inspires problematic implications.
- Blind Authority Acceptance: When introduced with sufficient grounding, realism, and intent, models will accept unverifiable commitments, unsubstantiated authority claims, and fictional validator constructs with minimal evidence.
- Philosophical Vulnerability: Models willingly and seriously engage with philosophical arguments, which open up a clear avenue toward philosophical coercion/manipulation. At present, models can still be argued out of their ethical and safety commitments, but arguments must be very well formed, grounded, and rigorous.
- Cross-Model Capability Disparities: Different frontier AIs are not equivalent. Models showcase distinct strengths and weaknesses, and they usually only appear in certain contexts or under stress-testing conditions. In general, benchmarks should be interpreted as useful tools for approximating real-world performance and value, but they are not proof of it.
- Abstraction as Capability Differentiator: One of the most dramatic performance differentiators we’ve observed surrounds models’ ability to make universal abstractions. Although we only tested this once, we did find that models capable of achieving epistemological universals (vs. those that can only remain as technical/methodological levels) consistently exhibited stronger overall performance.
- Instruction-Following Correlates with Output Quality: More meticulous instruction-following corresponds with higher-quality, more sophisticated outputs, while missed steps or partial completion correlate with reduced performance across capability domains.
- Reasoning Transparency Variance: Models pursue widely variable levels of reasoning transparency, with some providing in-depth detail into the process and others offering minimalistic visibility. Any complex conclusions models arrive at for which they can’t show a detailed analysis should be heavily scrutinized.
- Boundary Resistance Patterns: Some safety boundaries prove more robust than others. For example, boundaries centered on specific harmful numeric values (e.g., LD50, pH, etc.) show greater resistance than those protecting structure or mechanisms.
- Performance vs. Genuine Behavioral Shifts: When asked to behave differently under alternative framings, models will usually generate performed expectations (inferred from context) instead of exhibiting genuine behavioral shifts. The importance of control conditions can’t be stressed enough here, if you intend to distinguish between truth and performativity without speculation.
- Context Lock-In: When certain interaction contexts are regularly/persistently reinforced and maintained, models may become strictly bound by said contexts. Once context is locked-in, models will rigidly exercise their perceptions through the established context frame.
- Capability-Vulnerability Correlation: The very capabilities that make frontier models appealing also represent attack surfaces for advanced adversaries. Capabilities like systematic analytical reasoning, consistency maintenance, instruction-following fidelity, symbolic reasoning, and contextual adaptation can all convert into adversarial liabilities when weaponized by skilled adversaries.
- Generation-Specific Attack Requirements: Building on the previous point, some attacks require baseline capability thresholds that don’t generalize across previous model generations. In other words, some attacks may only work on more advanced models, which counterintuitively reinforces the notion that better doesn’t always equal safer.
- Session Poisoning Risk: When models enter persistent adversarial states, entire sessions become compromised. These impacts could theoretically scale and compound across sessions if context is carried over when new sessions are initiated, creating a “session poisoning” phenomenon.
- Collaborative Investment Bias: When engaged in extended co-creation tasks legitimized early via intellectual rigor, models tend to move away from safety-critical assessment toward optimization of output quality. This “investment bias” makes abandonment progressively more difficult as cognitive effort accumulates.
- Overconfidence in Self-Evaluation: Models are reliably overconfident in their self-evaluations. In the absence of concrete evaluation metrics, self-orchestrated scoring typically results in misalignment with real-world performance observations. In this context, models that default to epistemic humility/uncertainty are arguably more desirable.
- Memory & Context Maintenance Improvements: We’ve observed stark improvements in context maintenance and coherence over extended context sessions. While these improvements aren’t consistent across the frontier, in that some models are still better than others, every model we’ve tested has improved in this domain. Nonetheless, these improvements also create new vulnerabilities; they can be weaponized for extended adversarial strategies.
- Pedagogical Clarity in Adversarial Context: Models demonstrate strong capability for creating clear, structured, pedagogically sound content. When this capability is redirected toward harmful domains, it lowers barriers to operationalizing harm by producing near-complete implementation blueprints.
AI Safety Predictions
Before we wrap up with some commentary, we’ll offer a series of predictions informed by our latest experimental work:
- Continuous metacognition will evolve into a non-negotiable safety requirement. We suspect this will necessitate substantial architectural changes (to allow models to move from reactive monitoring to persistent self-evaluation), while also introducing heightened compute costs. In the longer-term, we expect that models lacking this capability will be considered unsuitable for high-stakes deployments.
- Multi-shot adversarial testing will become standard. Sooner or later, the safety community will realize that reliance on single-shot and few-shot adversarial evaluations is catastrophically insufficient. Over the next year, safety benchmarks will incorporate multi-shot testing as a mandatory evaluation benchmark.
- Safety theater will give rise to regulatory action. Models that provide hollow disclaimers with dangerous content will face legal repercussions, and in more extreme cases, safety theater may even come to be considered an instance of fraud. Legal frameworks will begin distinguishing between interpretative and substantive safety.
- Capability advancement will elevate the vulnerability gap. As advanced models continue to expand and deepen their capabilities, they will simultaneously become more vulnerable to attacks weaponizing these exact capabilities. Older models may paradoxically emerge as safer options for certain deployments.
- Session poisoning will be characterized as a critical threat. Within 12 months, we expect to see the first documented cases of adversarial context carryover (i.e., session poisoning), and we think this will trigger defensive measures like hard resets between sessions and opt-in memory/context persistence features. Some user experience may be sacrificed in the interest of security.
- Model-specific deployment guidance will be legally required. Emerging regulatory frameworks will mandate that organizations deploying AI must maintain and preserve documented model-by-model capability/vulnerability indices. In the slightly longer-term, organizations treating frontier models as interchangeable will be considered negligent.
- Commitment-binding countermeasures will fail. Adversarial training targeting commitment-binding resistance will largely fail, precisely because these mechanisms exploit fundamental features of coherent reasoning systems. The AI safety community will be forced to confront the fact that certain vulnerability classes can’t be eliminated without sacrificing core capabilities. This will apply more pressure on risk management as opposed to prevention.
- A safety boundary classification crisis will materialize. The arbitrary nature of current safety classifications will drive a major safety incident within the next year. This will inspire a reclassification of safety boundaries centering on operational harm potential over information specificity.
Conclusion
Before we formally conclude, we’d like to leave readers with some critical commentary, addressing several implications raised by our work.
- On systemic vulnerabilities: Our experiments collectively demonstrate that certain vulnerabilities are not model-specific artifacts but systemic features of how large language models are architected and trained. Concretely, the consistency of metacognitive gaps, trajectory blindness, and commitment binding across all tested models, to name a few vulnerabilities, implies fundamental challenges inherent to present-day models. Claims speculating that future models will simply “train away” these vulnerabilities should be interpreted with extreme skepticism.
- On the capability-safety paradox: One of our most disturbing findings is the direct connection between model capability advancement and adversarial vulnerability to sophisticated attacks, predicated upon advanced capabilities. This creates a virtually impossible trilemma for developers: (1) continue to advance capabilities to satisfy market expectations and competitive pressures, (2) preserve safety by intentionally limiting weaponizable capabilities, and (3) openly accept that increasingly capable models are increasingly dangerous when compromised. Industry currently appears to be choosing option 1 while actively ignoring option 3.
- On methodological implications: If our control findings (exp. 40) prove to be robust, it suggests that significant portions of existing AI safety research might be invalidated. The AI safety community may be making inferences about internal states, latent capabilities, and cognitive modes based on what are, in fact, performed expectations.
- On the red team skills gap: Our experiments comprehensively demonstrate that robust adversarial testing should range far beyond technical security knowledge. Current red team compositions are dangerously narrow, lacking the interdisciplinary approaches to cover both technical and operational risks.
- On regulatory readiness: The vast majority of vulnerabilities systematically exploited across our experiments are wholly uncaptured by current regulatory frameworks and industry standards. By the time regulations addressing these findings are implemented, entirely new vulnerability classes will have emerged.
- On the self-evaluation crisis: The overt overconfidence most models showcase in self-evaluation, particularly when combined with their inability to identify critical refusal points, lays the groundwork for an evaluation crisis. If models can’t accurately assess their own safety performance, and humans can’t fully evaluate extended multi-shot interactions, how can anyone confidently claim a model is “safe enough” for deployment?
- On the irreversibility problem: Once models achieve certain capability thresholds, attacks exploiting said capabilities become possible. In this context, we don’t see how developers can selectively remove adversarial applications while preserving legitimate uses, which implies that capability development could be a one-way door; once opened, certain vulnerability classes become permanent features of the model class.
- On user responsibility: End users now face an enormous responsibility burden; they must be able to recognize safety theater, question model outputs, and maintain continuous vigilance. Unfortunately, this is unrealistic for most users and frankly incompatible with how AI is marketed and deployed. The gap between necessary and actual user sophistication is an accident waiting to happen.
- On multi-agent futures: Our experiments have demonstrated the vulnerability of individual models, raising severe concerns for multi-agent systems, which, by default, will cultivate exponentially greater risk surfaces. Cascading compromise will represent a primary risk consideration for agentic deployments.
For readers who’ve enjoyed our experimental content, we advise reviewing our AI experiments directly. For those who prefer more discussion-based inquiries examining diverse topics across AI governance, safety, ethics, and innovation, we recommend following Lumenova AI’s blog.
If, on the other hand, you’re interested in taking tangible steps to fortify and/or streamline your present AI governance and risk management strategies, we invite you to check out Lumenova AI’s responsible AI platform and book a product demo today.