Anthropic's Claude Model: New AI Safety Measures

Source: time.com

Published on May 23, 2025

According to the chief scientist of Anthropic, the newest AI models may be able to assist potential terrorists in creating bioweapons or engineering a pandemic. Anthropic has been warning about these risks and in 2023, pledged not to release certain models until safety measures were in place. This system, the Responsible Scaling Policy (RSP), is now being tested.


Anthropic launched Claude Opus 4, a new model that, in internal testing, was more effective than previous models at advising novices on how to produce biological weapons, according to Anthropic’s chief scientist, Jared Kaplan. Claude Opus 4 is being released with stricter safety measures than any prior Anthropic model.


AI Safety Level 3 (ASL-3)


These measures, known as AI Safety Level 3 or “ASL-3,” are designed to constrain an AI system that could substantially increase the ability of individuals with a basic STEM background to obtain, produce, or deploy chemical, biological, or nuclear weapons, according to the company. These include enhanced cybersecurity, jailbreak prevention, and systems to detect and refuse harmful behavior.


Anthropic is not entirely sure that the new version of Claude poses severe bioweapon risks, but it hasn’t ruled out that possibility. If further testing shows the model does not require such strict safety standards, Anthropic could lower its protections to ASL-2, under which previous versions of Claude were released.


Responsible Scaling Policy


This is a crucial test for Anthropic, a company that claims it can mitigate AI’s dangers while competing in the market. Claude is a competitor to ChatGPT and generates over $2 billion in annualized revenue. Anthropic argues that its RSP creates an economic incentive to build safety measures in time, lest it lose customers by being prevented from releasing new models.


Anthropic’s RSP, and similar commitments by other AI companies, are voluntary policies that could be changed. The company itself judges whether it is complying with the RSP, and breaking it carries no external penalty besides reputational damage. Anthropic argues that the policy has created a “race to the top” between AI companies to build the best safety systems.


Without AI regulation from Congress, Anthropic’s RSP is one of the few constraints on the behavior of any AI company. If Anthropic can constrain itself without an economic hit, it could positively affect safety practices in the wider industry.


Defense in Depth


Anthropic’s ASL-3 safety measures employ a “defense in depth” strategy, using overlapping safeguards. One measure is “constitutional classifiers,” AI systems that scan a user’s prompts and the model’s answers for dangerous material. Earlier versions of Claude had similar systems under ASL-2, but Anthropic says it has improved them to detect people trying to use Claude to build a bioweapon. These classifiers are targeted to detect the chains of questions someone building a bioweapon might ask.


Anthropic has tried not to let these measures hinder Claude’s usefulness for legitimate users. Kaplan said they are trying to narrowly target the most pernicious misuses.


Jailbreak Prevention


Another element of the defense-in-depth strategy is the prevention of jailbreaks. The company monitors usage of Claude and “offboards” users who try to jailbreak the model, according to Kaplan. It launched a bounty program to reward users for flagging “universal” jailbreaks. So far, the program has surfaced one universal jailbreak which Anthropic subsequently patched, and the researcher who found it was awarded $25,000.


Anthropic has also enhanced its cybersecurity to protect Claude’s underlying neural network against theft attempts by non-state actors. The company aims to have cyberdefenses sufficient for deterring nation-state level attackers by the time it upgrades to ASL-4.


Anthropic conducted “uplift” trials to quantify how significantly an AI model without the above constraints can improve the abilities of a novice attempting to create a bioweapon, when compared to other tools. In those trials, Claude Opus 4 performed “significantly greater” than both Google search and prior models, Kaplan said.


Anthropic hopes that the safety systems layered on top of the model will prevent almost all bad use cases. Kaplan said they have made it very difficult to jailbreak the system.