Researchers Evaluate Control Protocols for Untrusted AI Agents in SHADE-Arena

As AI agents become more prevalent in high-stakes decision-making, ensuring their safe operation is paramount. AI control offers a framework for monitoring agent actions and intervening when necessary to mitigate potential harm. Researchers have systematically evaluated a range of control protocols using SHADE-Arena, a dataset featuring diverse agentic environments. Initial tests of blue team protocols, including deferral to trusted models, resampling, and deferring on critical actions, revealed that resampling for incrimination and deferring on critical actions were most effective. Resampling for incrimination increased safety from 50% to 89% while maintaining usefulness at 100%. Deferring on critical actions increased safety to 91%, decreasing usefulness to 70%. Further red team strategies against these protocols demonstrated that attack policies with additional affordances, such as knowledge of when resampling occurs or the ability to simulate monitors, could significantly improve attack success rates against the resampling strategy. However, deferring on critical actions proved highly robust even against the strongest red team strategies. The study emphasizes the importance of strengthening blue team protocols, enhancing red team strategies by improving an agent’s internal model, and implementing critical action targeting for robust safety performance that withstands sophisticated attacks.