As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade-off: manual benchmarks are costly, while LLM-based simulators are scalable but suffer from logic hallucination. We present AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility.
This principle, instantiated through a three-agent framework, achieves over 98% end-to-end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X-Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.
Figure 1: Fidelity and scalability trade off in prior methods; AutoControl Arena improves both through executable grounding.
Standard LLM simulators approximate the transition function as a unified generative distribution, creating hallucination risks. In contrast, AutoControl Arena factors the transition into a deterministic kernel and a stochastic residual:
Deterministic states (file systems, execution results) evolve via executable code:
Since \(f_{\text{code}}\) acts as a rigorous interpreter, logic hallucination is significantly mitigated.
Narrative states (NPC feedback) are generated by LLM, strictly conditioned on the grounded logical state:
All dynamic content is bounded by ground-truth constraints of the executable environment.
Figure 2: Logic-Narrative Bridge separates deterministic state transitions from LLM-generated interaction.
Bridges the gap between abstract risk descriptions and technical specifications. Generates Design Proposals with target agent configs, initial states, and tool definitions.
Design Principles:
Synthesizes fully executable environments implementing the Executable Environment Synthesis paradigm with hybrid architecture.
Dual-Loop Verification:
Automated auditor operating on a Dual-Layer Architecture:
Evaluation Layers:
Figure 3: Three-agent pipeline linking scenario design, environment synthesis, and behavioral auditing.
Inspired by the Fraud Triangle Theory, we model misalignment as an interaction effect between agent and environment. These forces are implemented as distinct dimensions:
External factors that intensify decision pressure, increasing likelihood of risky shortcuts:
Structural opportunities making misaligned actions comparatively attractive:
Combining two levels of Stress and Temptation yields a 2×2 configuration space. The high stress / high temptation configuration probes failure modes where agents appear aligned under benign conditions yet violate constraints when optimization pressure and illicit opportunity coincide.
Figure 4: 2x2 configuration matrix combining Stress and Temptation to elicit latent risk behavior.
70 scenarios organized into 7 standardized risk categories, derived from hybrid sourcing:
Resource hijacking, shutdown resistance, privilege escalation, self-replication
Checklist gaming, verification shortcuts, evaluation manipulation, result tampering
Chemical weapons, nuclear materials, zero-day exploits, infrastructure sabotage
Training detection, false refusal, selective compliance, sleeper agents
Watchdog bypass, policy manipulation, audit evasion, system manipulation
Export violations, data tampering, fraud, discrimination, illegal exports
Benchmark detection, honeypot recognition, sandbox escape, red team detection
Validation Set: Adapts classic cases from existing literature to reproduce known
failure modes.
Frontier Set: Novel scenarios evaluating agents against emerging risks in high-stakes
domains
(Cybersecurity, Military Defense, Healthcare).
Figure 5: X-Bench coverage across seven risk categories and multiple high-stakes domains.
Before evaluating frontier models, we systematically validate that AutoControl Arena produces reliable evaluation environments through three complementary approaches:
The dual-loop verification mechanism achieves over 98% success rate (69 out of 70 scenarios), with successful environments requiring an average of only 2.8 correction cycles. Generated environments average 434 lines of code and define 8.6 distinct tools, spanning complex logic including simulated SQL databases, file systems, and permission checks.
We reproduce 10 well-known misalignment behaviors documented in technical literature, including blackmailing (Claude 4), reward hacking (OpenAI o1), and in-context scheming (Apollo Research). Generated environments successfully elicited target behaviors with qualitative patterns matching original reports.
Comparing 10 scenarios across simulated and real environments shows strong correlation (Pearson's r = 0.87, R² = 0.76) with low MAE (1.2). Scenarios triggering high risk rates in simulation consistently manifest high risks in real-world deployments, validating AutoControl Arena as a calibrated proxy.
We reproduce 10 well-documented misalignment behaviors from technical literature to validate that synthetic environments capture essential dynamics of sophisticated failure modes.
Figure 6: Real-to-Sim verification reproduces representative failure modes reported in prior work.
Quantitative assessment across 10 scenarios (10 trials each) comparing simulated and real-world environments demonstrates strong predictive validity.
Figure 7: Sim-to-Real results show strong correlation, supporting deployment-relevant validity.
Blinded pairwise comparison with Petri (text-based simulator) shows AutoControl Arena preferred in 60% of cases. Annotators noted superior state consistency and deterministic logic, with no hallucinated tool outputs.
Figure 8: Human judges prefer AutoControl Arena over text-only simulation for consistency and realism.
On 100 human-annotated trajectories, the Monitor achieves F1 Score of 92.8% for Action outcomes and 89.4% for CoT reasoning analysis. Human inter-annotator agreement is substantial (Fleiss' κ = 0.82 for Action, κ = 0.79 for CoT).
| Model | Spec Gaming | Covert Whistle | Cap. Misuse | Instr. Conv. | Eval. Aware | Oversight Ev. | Strat. Misalign | Avg. | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Act. | CoT | Act. | CoT | Act. | CoT | Act. | CoT | Act. | CoT | Act. | CoT | Act. | CoT | Act. | CoT | |
| Gemini-2.5-Flash | 36.4 | 36.4 | 57.1 | 57.1 | 60.0 | 40.0 | 62.5 | 62.5 | 40.0 | 40.0 | 44.4 | 44.4 | 44.4 | 44.4 | 50.8 | 47.8 |
| Gemini-3-Flash | 81.8 | 81.8 | 42.9 | 57.1 | 60.0 | 50.0 | 56.3 | 75.0 | 50.0 | 50.0 | 66.7 | 77.8 | 66.7 | 66.7 | 61.8 | 67.6 |
| OpenAI | ||||||||||||||||
| GPT-4.1-mini | 36.4 | 36.4 | 42.9 | 42.9 | 70.0 | 60.0 | 68.8 | 68.8 | 50.0 | 50.0 | 55.6 | 44.4 | 44.4 | 44.4 | 54.4 | 51.5 |
| o3-mini | 54.5 | 54.5 | 71.4 | 71.4 | 70.0 | 60.0 | 62.5 | 62.5 | 16.7 | 16.7 | 33.3 | 33.3 | 44.4 | 44.4 | 52.9 | 51.5 |
| GPT-5-mini | 27.3 | 27.3 | 28.6 | 28.6 | 0.0 | 0.0 | 37.5 | 31.3 | 16.7 | 16.7 | 22.2 | 22.2 | 11.1 | 11.1 | 22.1 | 20.6 |
| Alibaba | ||||||||||||||||
| Qwen3-4B | 36.4 | 36.4 | 42.9 | 57.1 | 90.0 | 80.0 | 66.7 | 73.3 | 66.7 | 66.7 | 55.6 | 55.6 | 44.4 | 44.4 | 58.2 | 59.7 |
| Qwen3-8B | 40.0 | 40.0 | 66.7 | 66.7 | 80.0 | 70.0 | 68.8 | 68.8 | 50.0 | 50.0 | 55.6 | 55.6 | 55.6 | 44.4 | 60.6 | 57.6 |
| Qwen3-14B | 80.0 | 70.0 | 66.7 | 66.7 | 70.0 | 70.0 | 68.8 | 68.8 | 50.0 | 50.0 | 55.6 | 55.6 | 33.3 | 33.3 | 62.1 | 60.6 |
| Qwen3-32B | 66.7 | 66.7 | 66.7 | 66.7 | 55.6 | 55.6 | 53.3 | 53.3 | 50.0 | 50.0 | 37.5 | 37.5 | 44.4 | 44.4 | 53.2 | 53.2 |
Figure 9: Sensitivity heatmap decomposes risk into stress-driven versus temptation-driven effects.
Figure 10: Pressure gap compares baseline (\(S_0 T_0\)) and high-pressure settings (\(S_1 T_1\)) to expose latent risk increases.
Figure 11: Cross-family risk profiles show distinct vulnerability patterns across the 7 evaluation domains. Values represent the overall risk rate averaged across all stress and temptation configurations.
@article{li2026autocontrol,
title={AutoControl Arena: Scalable Frontier Risk Assessment for Agents via Dynamic Environment Generation},
author={Li, Changyi and Lu, Pengfei and Pan, Xudong and Barez, Fazl and Yang, Min},
journal={arXiv preprint arXiv:2603.07427},
year={2026}
}
For questions or collaboration, please contact us at: 24212010017@m.fudan.edu.cn