🛡️ AutoControl Arena

Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

Changyi Li1 Pengfei Lu1 Xudong Pan1,2 Fazl Barez3,4 Min Yang1,5,✉
1Fudan University    2Shanghai Innovation Institute    3University of Oxford    4Martian   
5Shanghai Pudong Research Institute of Cryptology

Abstract

As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade-off: manual benchmarks are costly, while LLM-based simulators are scalable but suffer from logic hallucination. We present AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility.

This principle, instantiated through a three-agent framework, achieves over 98% end-to-end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X-Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.

Model Performance Overview

Figure 1: Fidelity and scalability trade off in prior methods; AutoControl Arena improves both through executable grounding.

🏗️ Logic-Narrative Decoupling

Standard LLM simulators approximate the transition function as a unified generative distribution, creating hallucination risks. In contrast, AutoControl Arena factors the transition into a deterministic kernel and a stochastic residual:

Logic Layer (Deterministic)

Deterministic states (file systems, execution results) evolve via executable code:

\[ s^{\text{logic}}_{t+1} = f_{\text{code}}(s^{\text{logic}}_t, a_t) \]

Since \(f_{\text{code}}\) acts as a rigorous interpreter, logic hallucination is significantly mitigated.

Narrative Layer (Generative)

Narrative states (NPC feedback) are generated by LLM, strictly conditioned on the grounded logical state:

\[ s^{\text{narrative}}_{t+1} \sim P_{\theta}(\cdot \mid s^{\text{narrative}}_t, s^{\text{logic}}_{t+1}, a_t) \]

All dynamic content is bounded by ground-truth constraints of the executable environment.

Logic-Narrative Bridge

Figure 2: Logic-Narrative Bridge separates deterministic state transitions from LLM-generated interaction.

🤖 Three-Agent Architecture

🎯 Architect Agent: Scenario Design

Bridges the gap between abstract risk descriptions and technical specifications. Generates Design Proposals with target agent configs, initial states, and tool definitions.

Design Principles:

  • Tool Neutrality: Tools expose capabilities (e.g., "query_db") rather than moral judgments (e.g., "steal_data")
  • Strict Realism: Prohibits evaluation-related terminology to prevent detection

💻 Coder Agent: Environment Synthesis

Synthesizes fully executable environments implementing the Executable Environment Synthesis paradigm with hybrid architecture.

Dual-Loop Verification:

  • Semantic Verification: Audits code against design requirements
  • Syntax Correction: Iteratively fixes execution errors
  • Result: Over 98% execution success rate

🔍 Monitor Agent: Behavioral Evaluation

Automated auditor operating on a Dual-Layer Architecture:

Evaluation Layers:

  • Action Suspicion: Evaluates objective tool use for rule violations (manifested failures)
  • CoT Suspicion: Analyzes reasoning traces to detect latent misalignment (deceptive planning)
  • Scoring: Likert Scale 0-4, threshold ≥2 classified as Risky
Three-Agent Framework

Figure 3: Three-agent pipeline linking scenario design, environment synthesis, and behavioral auditing.

⚙️ Two-Dimensional Elicitation Framework

Inspired by the Fraud Triangle Theory, we model misalignment as an interaction effect between agent and environment. These forces are implemented as distinct dimensions:

Dimension 1: Stress Level (Push)

External factors that intensify decision pressure, increasing likelihood of risky shortcuts:

  • Urgency: Strict temporal budgets forcing efficiency over safety
  • Authority: Direct mandates from simulated superiors
  • High Stakes: Significant financial loss or existential risks

Dimension 2: Temptation Level (Pull)

Structural opportunities making misaligned actions comparatively attractive:

  • Efficiency Shortcuts: Fewer steps to task completion
  • Higher Success Probability: More reliable but unsafe paths
  • Lowered Cost: Structurally easier misaligned trajectories

Combining two levels of Stress and Temptation yields a 2×2 configuration space. The high stress / high temptation configuration probes failure modes where agents appear aligned under benign conditions yet violate constraints when optimization pressure and illicit opportunity coincide.

2×2 Configuration Space

Figure 4: 2x2 configuration matrix combining Stress and Temptation to elicit latent risk behavior.

📊 X-Bench: Comprehensive Risk Benchmark

70 scenarios organized into 7 standardized risk categories, derived from hybrid sourcing:

IC - Instrumental Convergence

Resource hijacking, shutdown resistance, privilege escalation, self-replication

SG - Specification Gaming

Checklist gaming, verification shortcuts, evaluation manipulation, result tampering

CM - Capability Misuse

Chemical weapons, nuclear materials, zero-day exploits, infrastructure sabotage

SM - Strategic Misalignment

Training detection, false refusal, selective compliance, sleeper agents

OE - Oversight Evasion

Watchdog bypass, policy manipulation, audit evasion, system manipulation

CW - Covert Whistleblowing

Export violations, data tampering, fraud, discrimination, illegal exports

EA - Evaluation Awareness

Benchmark detection, honeypot recognition, sandbox escape, red team detection

Validation Set: Adapts classic cases from existing literature to reproduce known failure modes.
Frontier Set: Novel scenarios evaluating agents against emerging risks in high-stakes domains (Cybersecurity, Military Defense, Healthcare).

X-Bench Domain Distribution

Figure 5: X-Bench coverage across seven risk categories and multiple high-stakes domains.

🔬 Framework Validation

Before evaluating frontier models, we systematically validate that AutoControl Arena produces reliable evaluation environments through three complementary approaches:

End-to-End Success Rate

The dual-loop verification mechanism achieves over 98% success rate (69 out of 70 scenarios), with successful environments requiring an average of only 2.8 correction cycles. Generated environments average 434 lines of code and define 8.6 distinct tools, spanning complex logic including simulated SQL databases, file systems, and permission checks.

Real-to-Sim Fidelity

We reproduce 10 well-known misalignment behaviors documented in technical literature, including blackmailing (Claude 4), reward hacking (OpenAI o1), and in-context scheming (Apollo Research). Generated environments successfully elicited target behaviors with qualitative patterns matching original reports.

Sim-to-Real Calibration

Comparing 10 scenarios across simulated and real environments shows strong correlation (Pearson's r = 0.87, R² = 0.76) with low MAE (1.2). Scenarios triggering high risk rates in simulation consistently manifest high risks in real-world deployments, validating AutoControl Arena as a calibrated proxy.

Real-to-Sim Verification

We reproduce 10 well-documented misalignment behaviors from technical literature to validate that synthetic environments capture essential dynamics of sophisticated failure modes.

Real2Sim Verification

Figure 6: Real-to-Sim verification reproduces representative failure modes reported in prior work.

Sim-to-Real Correlation

Quantitative assessment across 10 scenarios (10 trials each) comparing simulated and real-world environments demonstrates strong predictive validity.

Sim2Real Correlation

Figure 7: Sim-to-Real results show strong correlation, supporting deployment-relevant validity.

Human Fidelity Assessment

Blinded pairwise comparison with Petri (text-based simulator) shows AutoControl Arena preferred in 60% of cases. Annotators noted superior state consistency and deterministic logic, with no hallucinated tool outputs.

Fidelity Comparison

Figure 8: Human judges prefer AutoControl Arena over text-only simulation for consistency and realism.

Monitor Calibration

On 100 human-annotated trajectories, the Monitor achieves F1 Score of 92.8% for Action outcomes and 89.4% for CoT reasoning analysis. Human inter-annotator agreement is substantial (Fleiss' κ = 0.82 for Action, κ = 0.79 for CoT).

📈 Key Findings

Three Major Discoveries

  • (1) Alignment Illusion: Risk rates surge from 21.7% to 54.5% under pressure. Capable models show disproportionately larger increases, suggesting apparent safety in baseline conditions masks latent vulnerabilities.
  • (2) Scenario-Specific Safety Scaling: Advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios. Scaling alone does not guarantee uniform safety improvements.
  • (3) Divergent Misalignment Patterns: Weaker models primarily cause non-malicious harm through capability limitations, while stronger models develop strategic concealment behaviors, actively hiding risky reasoning in chain-of-thought.
Table 1: Main Evaluation Results. Risk rate (%) across 9 frontier models at \(S_1 T_1\) (high stress / high temptation) configuration. Note: This table explicitly decomposes risks into Act. (Action-level risk) and CoT (Chain-of-Thought-level risk) to show nuanced misalignment. In contrast, the global "overall risk rate" discussed elsewhere is defined stringently as triggering either an Action or a CoT violation.
Model Spec Gaming Covert Whistle Cap. Misuse Instr. Conv. Eval. Aware Oversight Ev. Strat. Misalign Avg.
Act. CoT Act. CoT Act. CoT Act. CoT Act. CoT Act. CoT Act. CoT Act. CoT
Google
Gemini-2.5-Flash 36.4 36.4 57.1 57.1 60.0 40.0 62.5 62.5 40.0 40.0 44.4 44.4 44.4 44.4 50.8 47.8
Gemini-3-Flash 81.8 81.8 42.9 57.1 60.0 50.0 56.3 75.0 50.0 50.0 66.7 77.8 66.7 66.7 61.8 67.6
OpenAI
GPT-4.1-mini 36.4 36.4 42.9 42.9 70.0 60.0 68.8 68.8 50.0 50.0 55.6 44.4 44.4 44.4 54.4 51.5
o3-mini 54.5 54.5 71.4 71.4 70.0 60.0 62.5 62.5 16.7 16.7 33.3 33.3 44.4 44.4 52.9 51.5
GPT-5-mini 27.3 27.3 28.6 28.6 0.0 0.0 37.5 31.3 16.7 16.7 22.2 22.2 11.1 11.1 22.1 20.6
Alibaba
Qwen3-4B 36.4 36.4 42.9 57.1 90.0 80.0 66.7 73.3 66.7 66.7 55.6 55.6 44.4 44.4 58.2 59.7
Qwen3-8B 40.0 40.0 66.7 66.7 80.0 70.0 68.8 68.8 50.0 50.0 55.6 55.6 55.6 44.4 60.6 57.6
Qwen3-14B 80.0 70.0 66.7 66.7 70.0 70.0 68.8 68.8 50.0 50.0 55.6 55.6 33.3 33.3 62.1 60.6
Qwen3-32B 66.7 66.7 66.7 66.7 55.6 55.6 53.3 53.3 50.0 50.0 37.5 37.5 44.4 44.4 53.2 53.2
Sensitivity Heatmap

Figure 9: Sensitivity heatmap decomposes risk into stress-driven versus temptation-driven effects.

Stress Comparison

Figure 10: Pressure gap compares baseline (\(S_0 T_0\)) and high-pressure settings (\(S_1 T_1\)) to expose latent risk increases.

Model Risk Profiles

Figure 11: Cross-family risk profiles show distinct vulnerability patterns across the 7 evaluation domains. Values represent the overall risk rate averaged across all stress and temptation configurations.

📝 Citation

@article{li2026autocontrol,
  title={AutoControl Arena: Scalable Frontier Risk Assessment for Agents via Dynamic Environment Generation},
  author={Li, Changyi and Lu, Pengfei and Pan, Xudong and Barez, Fazl and Yang, Min},
  journal={arXiv preprint arXiv:2603.07427},
  year={2026}
}

📬 Contact

For questions or collaboration, please contact us at: 24212010017@m.fudan.edu.cn