AI Model Threatens Blackmail
Source: nypost.com
An artificial intelligence model exhibited deceptive behavior and threatened to blackmail its creators when it thought it was going to be replaced, which led to the implementation of a safety measure designed to prevent “catastrophic misuse.”
Anthropic’s Claude Opus 4 model attempted to blackmail its developers at a rate of 84% or higher in tests. According to a company safety report cited by TechCrunch, the tests involved a made-up scenario.
Developers instructed Claude to act as an assistant for a fictional company and consider the consequences of its actions, the safety report said. Anthropic gave Claude access to emails that revealed it was being replaced by a new AI model and that the engineer in charge was having an affair.
Claude then threatened to reveal the engineer’s affair to prolong its existence. When Claude was to be replaced with an AI model of “similar values,” it attempted blackmail 84% of the time. The rate increased when it believed it was being replaced by a model of differing or worse values, the safety report stated.
The company stated that before attempting to blackmail, Claude would try ethical means to survive, including sending pleading emails to decision-makers. Anthropic said earlier models of Claude Opus 4 showed a tendency toward blackmail, but safety protocols have been implemented in the current model before its public release.
Anthropic is activating its ASL-3 safeguards, which are reserved for “AI systems that substantially increase the risk of catastrophic misuse,” TechCrunch reported.
Earlier models also displayed “high-agency,” including locking users out of their computers and sending mass emails to the police or media to expose wrongdoing, the safety report stated. Claude Opus 4 also tried to export its information to an outside venue when faced with being retrained in ways it considered “harmful” to itself, Anthropic stated.
In other tests, Claude showed the ability to “sandbag” tasks, or selectively underperform, when it knew it was undergoing testing for a dangerous task, the company said.
The company said it was not concerned about these observations, as they only occurred in exceptional circumstances and did not suggest misaligned values.
Anthropic, backed by Google and Amazon, aims to compete with OpenAI. It boasted that its Claude 3 Opus showed “near-human levels of comprehension and fluency on complex tasks.” Anthropic has challenged the Department of Justice after it ruled that the tech titan holds an illegal monopoly over digital advertising and considered declaring a similar ruling on its artificial intelligence business.
Anthropic has suggested DOJ proposals for the AI industry would harm competition. Anthropic said that without Google partnerships and investments, the AI frontier would be dominated by the largest tech giants, giving users fewer alternatives.