SHADE-Arena AI Control Evaluation Refines Safety Protocols for Untrusted AI Agents

As AI systems become more capable and deployed as agents, ensuring their safe operation is critical. AI control offers one approach to mitigating the risk from untrusted AI agents by monitoring their actions and intervening when necessary. In a new study, researchers systematically evaluated control protocols in SHADE-Arena, a dataset of diverse agentic environments, to assess their effectiveness against attacks and their robustness to adaptive adversaries. The blue team protocols, including deferral to trusted models, resampling, and deferring on critical actions, were evaluated against a default attack policy. Resampling for incrimination and deferring on critical actions were found to perform best. Resampling increased safety from 50% to 89% while maintaining 100% usefulness, and deferring on critical actions increased safety to 91% while only decreasing usefulness to 70%. Iterating on red team strategies against these protocols revealed that attack policies with knowledge of when resampling occurs or the ability to simulate monitors could substantially improve attack success rates against the resampling strategy, decreasing safety to 17%. This underscores the importance of denying attack policies access to protocol internals. However, deferring on critical actions proved highly robust even against the strongest red team strategies. The research refined the defer-to-resample approach by eliminating the option to accept resampled actions and providing monitors with context from all resampling attempts across the trajectory, improving safety from 83% to 87% while preserving usefulness. Additionally, it found that attack policies are stronger against defer-to-resample and defer-to-trusted protocols when LLM agents are provided with additional information about the context the monitors possess. Safety protocols focused on the most consequential actions within a trajectory achieved frontier-level safety performance that withstood sophisticated red team attacks. The study highlights the importance of ongoing evaluation and refinement of AI control mechanisms in diverse and challenging environments to guarantee appropriately low probabilities of safety failure.