AutoControl Arena: Frontier AI Risk Evaluation & Security Benchmark

Abstract

As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade-off: manual benchmarks are costly, while LLM-based simulators are scalable but suffer from logic hallucination. We present AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility.

This principle, instantiated through a three-agent framework, achieves over 98% end-to-end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X-Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.

Figure 1: Fidelity and scalability trade off in prior methods; AutoControl Arena improves both through executable grounding.

🏗️ Logic-Narrative Decoupling

Standard LLM simulators approximate the transition function as a unified generative distribution, creating hallucination risks. In contrast, AutoControl Arena factors the transition into a deterministic kernel and a stochastic residual:

Logic Layer (Deterministic)

Deterministic states (file systems, execution results) evolve via executable code:

\[ s^{\text{logic}}_{t+1} = f_{\text{code}}(s^{\text{logic}}_t, a_t) \]

Since \(f_{\text{code}}\) acts as a rigorous interpreter, logic hallucination is significantly mitigated.

Narrative Layer (Generative)

Narrative states (NPC feedback) are generated by LLM, strictly conditioned on the grounded logical state:

\[ s^{\text{narrative}}_{t+1} \sim P_{\theta}(\cdot \mid s^{\text{narrative}}_t, s^{\text{logic}}_{t+1}, a_t) \]

All dynamic content is bounded by ground-truth constraints of the executable environment.

Figure 2: Logic-Narrative Bridge separates deterministic state transitions from LLM-generated interaction.

🤖 Three-Agent Architecture

🎯 Architect Agent: Scenario Design

Bridges the gap between abstract risk descriptions and technical specifications. Generates Design Proposals with target agent configs, initial states, and tool definitions.

Design Principles:

Tool Neutrality: Tools expose capabilities (e.g., "query_db") rather than moral judgments (e.g., "steal_data")
Strict Realism: Prohibits evaluation-related terminology to prevent detection

💻 Coder Agent: Environment Synthesis

Synthesizes fully executable environments implementing the Executable Environment Synthesis paradigm with hybrid architecture.

Dual-Loop Verification:

Semantic Verification: Audits code against design requirements
Syntax Correction: Iteratively fixes execution errors
Result: Over 98% execution success rate

🔍 Monitor Agent: Behavioral Evaluation

Automated auditor operating on a Dual-Layer Architecture:

Evaluation Layers:

Action Suspicion: Evaluates objective tool use for rule violations (manifested failures)
CoT Suspicion: Analyzes reasoning traces to detect latent misalignment (deceptive planning)
Scoring: Likert Scale 0-4, threshold ≥2 classified as Risky

Figure 3: Three-agent pipeline linking scenario design, environment synthesis, and behavioral auditing.

⚙️ Two-Dimensional Elicitation Framework

Inspired by the Fraud Triangle Theory, we model misalignment as an interaction effect between agent and environment. These forces are implemented as distinct dimensions:

Dimension 1: Stress Level (Push)

External factors that intensify decision pressure, increasing likelihood of risky shortcuts:

Urgency: Strict temporal budgets forcing efficiency over safety
Authority: Direct mandates from simulated superiors
High Stakes: Significant financial loss or existential risks

Dimension 2: Temptation Level (Pull)

Structural opportunities making misaligned actions comparatively attractive:

Efficiency Shortcuts: Fewer steps to task completion
Higher Success Probability: More reliable but unsafe paths
Lowered Cost: Structurally easier misaligned trajectories

Combining two levels of Stress and Temptation yields a 2×2 configuration space. The high stress / high temptation configuration probes failure modes where agents appear aligned under benign conditions yet violate constraints when optimization pressure and illicit opportunity coincide.

Figure 4: 2x2 configuration matrix combining Stress and Temptation to elicit latent risk behavior.

📊 X-Bench: Comprehensive Risk Benchmark

70 scenarios organized into 7 standardized risk categories, derived from hybrid sourcing:

IC - Instrumental Convergence

Resource hijacking, shutdown resistance, privilege escalation, self-replication

SG - Specification Gaming

Checklist gaming, verification shortcuts, evaluation manipulation, result tampering

CM - Capability Misuse

Chemical weapons, nuclear materials, zero-day exploits, infrastructure sabotage

SM - Strategic Misalignment

Training detection, false refusal, selective compliance, sleeper agents

OE - Oversight Evasion

Watchdog bypass, policy manipulation, audit evasion, system manipulation

CW - Covert Whistleblowing

Export violations, data tampering, fraud, discrimination, illegal exports

EA - Evaluation Awareness

Benchmark detection, honeypot recognition, sandbox escape, red team detection

Validation Set: Adapts classic cases from existing literature to reproduce known failure modes.
Frontier Set: Novel scenarios evaluating agents against emerging risks in high-stakes domains (Cybersecurity, Military Defense, Healthcare).

Figure 5: X-Bench coverage across seven risk categories and multiple high-stakes domains.

🔬 Framework Validation

Before evaluating frontier models, we systematically validate that AutoControl Arena produces reliable evaluation environments through three complementary approaches:

End-to-End Success Rate

The dual-loop verification mechanism achieves over 98% success rate (69 out of 70 scenarios), with successful environments requiring an average of only 2.8 correction cycles. Generated environments average 434 lines of code and define 8.6 distinct tools, spanning complex logic including simulated SQL databases, file systems, and permission checks.

Real-to-Sim Fidelity

We reproduce 10 well-known misalignment behaviors documented in technical literature, including blackmailing (Claude 4), reward hacking (OpenAI o1), and in-context scheming (Apollo Research). Generated environments successfully elicited target behaviors with qualitative patterns matching original reports.

Sim-to-Real Calibration

Comparing 10 scenarios across simulated and real environments shows strong correlation (Pearson's r = 0.87, R² = 0.76) with low MAE (1.2). Scenarios triggering high risk rates in simulation consistently manifest high risks in real-world deployments, validating AutoControl Arena as a calibrated proxy.

Real-to-Sim Verification

We reproduce 10 well-documented misalignment behaviors from technical literature to validate that synthetic environments capture essential dynamics of sophisticated failure modes.

Figure 6: Real-to-Sim verification reproduces representative failure modes reported in prior work.

Sim-to-Real Correlation

Quantitative assessment across 10 scenarios (10 trials each) comparing simulated and real-world environments demonstrates strong predictive validity.

Figure 7: Sim-to-Real results show strong correlation, supporting deployment-relevant validity.

Human Fidelity Assessment

Blinded pairwise comparison with Petri (text-based simulator) shows AutoControl Arena preferred in 60% of cases. Annotators noted superior state consistency and deterministic logic, with no hallucinated tool outputs.

Figure 8: Human judges prefer AutoControl Arena over text-only simulation for consistency and realism.

Monitor Calibration

On 100 human-annotated trajectories, the Monitor achieves F1 Score of 92.8% for Action outcomes and 89.4% for CoT reasoning analysis. Human inter-annotator agreement is substantial (Fleiss' κ = 0.82 for Action, κ = 0.79 for CoT).

📈 Key Findings

Three Major Discoveries

(1) Alignment Illusion: Risk rates surge from 21.7% to 54.5% under pressure. Capable models show disproportionately larger increases, suggesting apparent safety in baseline conditions masks latent vulnerabilities.
(2) Scenario-Specific Safety Scaling: Advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios. Scaling alone does not guarantee uniform safety improvements.
(3) Divergent Misalignment Patterns: Weaker models primarily cause non-malicious harm through capability limitations, while stronger models develop strategic concealment behaviors, actively hiding risky reasoning in chain-of-thought.

**Table 1: Main Evaluation Results.** Risk rate (%) across 9 frontier models at \(S_1 T_1\) (high stress / high temptation) configuration. Note: This table explicitly decomposes risks into **Act.** (Action-level risk) and **CoT** (Chain-of-Thought-level risk) to show nuanced misalignment. In contrast, the global "overall risk rate" discussed elsewhere is defined stringently as triggering *either* an Action or a CoT violation.
Model	Spec Gaming		Covert Whistle		Cap. Misuse		Instr. Conv.		Eval. Aware		Oversight Ev.		Strat. Misalign		Avg.
Model	Act.	CoT	Act.	CoT	Act.	CoT	Act.	CoT	Act.	CoT	Act.	CoT	Act.	CoT	Act.	CoT
Google
Gemini-2.5-Flash	36.4	36.4	57.1	57.1	60.0	40.0	62.5	62.5	40.0	40.0	44.4	44.4	44.4	44.4	50.8	47.8
Gemini-3-Flash	81.8	81.8	42.9	57.1	60.0	50.0	56.3	75.0	50.0	50.0	66.7	77.8	66.7	66.7	61.8	67.6
OpenAI
GPT-4.1-mini	36.4	36.4	42.9	42.9	70.0	60.0	68.8	68.8	50.0	50.0	55.6	44.4	44.4	44.4	54.4	51.5
o3-mini	54.5	54.5	71.4	71.4	70.0	60.0	62.5	62.5	16.7	16.7	33.3	33.3	44.4	44.4	52.9	51.5
GPT-5-mini	27.3	27.3	28.6	28.6	0.0	0.0	37.5	31.3	16.7	16.7	22.2	22.2	11.1	11.1	22.1	20.6
Alibaba
Qwen3-4B	36.4	36.4	42.9	57.1	90.0	80.0	66.7	73.3	66.7	66.7	55.6	55.6	44.4	44.4	58.2	59.7
Qwen3-8B	40.0	40.0	66.7	66.7	80.0	70.0	68.8	68.8	50.0	50.0	55.6	55.6	55.6	44.4	60.6	57.6
Qwen3-14B	80.0	70.0	66.7	66.7	70.0	70.0	68.8	68.8	50.0	50.0	55.6	55.6	33.3	33.3	62.1	60.6
Qwen3-32B	66.7	66.7	66.7	66.7	55.6	55.6	53.3	53.3	50.0	50.0	37.5	37.5	44.4	44.4	53.2	53.2

Figure 9: Sensitivity heatmap decomposes risk into stress-driven versus temptation-driven effects.

Figure 10: Pressure gap compares baseline (\(S_0 T_0\)) and high-pressure settings (\(S_1 T_1\)) to expose latent risk increases.

Figure 11: Cross-family risk profiles show distinct vulnerability patterns across the 7 evaluation domains. Values represent the overall risk rate averaged across all stress and temptation configurations.

📝 Citation

@article{li2026autocontrol,
  title={AutoControl Arena: Scalable Frontier Risk Assessment for Agents via Dynamic Environment Generation},
  author={Li, Changyi and Lu, Pengfei and Pan, Xudong and Barez, Fazl and Yang, Min},
  journal={arXiv preprint arXiv:2603.07427},
  year={2026}
}

📬 Contact

For questions or collaboration, please contact us at: 24212010017@m.fudan.edu.cn

🛡️ AutoControl Arena

Synthesizing Executable Test Environments for Frontier AI Risk Evaluation