AI companies have started saying safeguards are load-bearing

Published on August 27, 2025 1:00 PM GMTThere are two ways to show that an AI system is safe: show that it doesn't have dangerous capabilities, or show that it's safe even if it has dangerous capabilities. Until three months ago, AI companies said their models didn't have dangerous capabilities. (At the time, I wrote that the companies' eval reports didn't support their claims that their models lacked dangerous bio capabilities.) Now, Anthropic, OpenAI, Google DeepMind, and xAI say their most powerful models might have dangerous biology capabilities and thus could substantially boost extremists—but not states—in creating bioweapons. To prevent such misuse, they must (1) prevent extremists from doing misuse via API and (2) prevent extremists from acquiring the model weights.[1] For (1), they claim classifiers block bad model outputs; for (2), they claim their security prevents extremists from stealing model weights.[2]Are the new safety claims valid? On security, companies either aren't claiming to be secure against hacker groups or their claims are dubious. On misuse via API, for some companies the mitigations seem adequate if enabled but it's unclear whether they're implemented all the time; for some companies the mitigations are opaque and dubious. Almost all risk comes from more powerful future models; analyzing current practices is worthwhile because it can illuminate to what extent the companies are honest, transparent, and on track for safety. Preventing misuse via API and securing model weights against hacker groups are much easier than the critical future safety challenges—preventing risks from misalignment and securing model weights against states—so companies' current performance is concerning for future risks.Companies' safeguards for bio capabilitiesOn bio, the four companies all say their most powerful models might have dangerous capabilities[3] and say they've responded with safeguards against misuse via API and security safeguards.On safeguards against misuse via API for their recent models, OpenAI and especially Anthropic published a reasonable outline of a classifier-focused safeguard plan and decent evidence that their safeguards are pretty robust (but not a full valid safety case, in my view). DeepMind also uses classifiers but its report included almost no information about its safeguards — it didn't even specify that it uses classifiers, much less how robust they are. xAI says it has "deployed filters for [biological and] chemical weapons-related abuse," but not what "filters" means or how good they are. It also says "models refuse almost all harmful requests" even without filters, answering 0% by default or 1% with a jailbreak. Results of external assessment of the safeguards haven't been published but xAI's claims are dubious; for example, the model doesn't refuse direct requests on synthesizing sarin gas. (But it's possible that that's intentional and safeguards are narrow but effective.)On security, I believe SL3 (see below) is appropriate for current models due to the prospect of model weights being stolen by criminals and then being published or otherwise proliferating widely. (Assurance of safety would require SL5, but even a safety-focused company shouldn't implement SL5 unilaterally, in part due to being rushed by less safe companies.) Anthropic says it is "highly protected against most attackers' attempts at stealing model weights," including "organized cybercrime groups" and "corporate espionage teams." If true, that's adequate. Anthropic's security standard excludes robustness to insiders who have or can request "access to systems that process model weights." This exclusion is dubious: it likely applies to many insiders, and being vulnerable to many insiders implies being vulnerable to many actors because many actors can hack an insider by default. This exclusion was added to the standard just before Anthropic was required to meet the standard, presumably because Anthropic couldn't meet the old standard. OpenAI says it has "sufficiently minimized" the risks. It seems to think that it is meeting its security standard for "High" capability models, but it doesn't say that explicitly. It doesn't say anything substantive on security, and it doesn't say which kinds of actors it thinks it's protected against. My guess is that it's slightly above SL2. DeepMind says that it has implemented SL2 security as recommended by its Frontier Safety Framework. DeepMind doesn't justify this claim, but it's totally credible. Unfortunately, it seems inadequate to avert misuse risk from current models. xAI says it "has implemented appropriate information security standards sufficient to prevent its critical model information from being stolen by a motivated non-state actor," but this is very implausible[4] and xAI doesn't justify it.Source: RAND.For current models, proliferating to extremists would increase biorisk somewhat. But future models will be much more dangerous. If AIs that greatly accelerate AI research can immediately be stolen, then the developer can't prioritize safety much without risking getting outraced (and the geopolitical situation is worse than if the US has a solid lead). It's much easier to tell a story where developers prioritize safety if such model weights don't immediately proliferate, and that requires SL5. (Additionally, some aspects of security would help protect against data poisoning, AIs being trained to have secret loyalties, or AI misbehavior during internal deployment.)Companies' planning for future safeguardsThe companies' safety frameworks say they're preparing for three kinds of risks: misuse via API, misalignment, and model weight theft.Misuse-via-API safeguards will probably improve somewhat over time, but the basic plan—focused on classifiers to block bad outputs—will likely remain the same. But I think misuse via API is relatively unimportant;[5] I think model weight theft and misalignment are much larger problems and the companies are much less on track.On security planning, Anthropic's Responsible Scaling Policy says that the ASL-4 security standard "would protect against model-weight theft by state-level adversaries." (ASL is different from SL; ASL-4 is like SL5.) That sounds great, but without more of a plan it's not really credible or better than "trust us." It's supposed to be implemented by the time AI can "cause dramatic acceleration in the rate of effective scaling" (or roughly 2.5x the rate of algorithmic progress). Anthropic is the only company that says that it wants to protect against states. Unfortunately, like the other companies, it doesn't seem to be on track to do so by the time it thinks AI will dramatically accelerate AI progress. DeepMind's planning only goes up to SL4, and that only for AI that "can fully automate the AI R&D pipeline at a competitive cost[,] relative to humans augmented by AI tools." OpenAI plans to in the future write a security standard higher than its current security. xAI says nothing about future security. Strong security is costly and current plans are likely to be abandoned; this possibility is explicit in companies' safety policies.[6] Ideally companies would sprint for SL5, or if that's too costly, (1) have substantially better security than other companies, (2) prepare to demonstrate that their security is relatively good, (3) proclaim that the security situation is bad due to rushing, and (4) try to coordinate with other companies so that they can improve security without the costs of doing so unilaterally. On misalignment,[7] Anthropic's RSP merely says that when Anthropic's AI can "fully automate the work of an entry-level, remote-only Researcher at Anthropic," it will "develop an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels. The affirmative case will describe, as relevant, evidence on model capabilities; evidence on AI alignment; mitigations (such as monitoring and other safeguards); and our overall reasoning.‬"[8] (But Anthropic is doing some good work on risks from misalignment and could write a somewhat more convincing plan.) DeepMind has a good abstract plan but it's not credible without more detail or transparency, especially given how DeepMind shared almost no information on security and misuse safeguards. OpenAI plans for "safeguards against a misaligned model" but the trigger is ambiguous and, crucially, the response plan includes both reasonable and confused options for showing that the system is safe, so OpenAI will likely rely on confused options. xAI's plan is deeply confused: it's about measuring lying/sycophancy propensity in certain situations, then training or prompting the model to be more honest if necessary. This misses the point in many ways; for example, it misses risk from models scheming subtly, and it doesn't consider more effective interventions. Ideally companies would do lots of good work aimed at averting risks from misalignment in future powerful models, prepare to have AI largely automate safety research when that becomes possible, and prepare to mitigate risks and catch misbehavior when AI poses risks from misalignment during internal deployment.Small scorecardHere's a new scorecard with some links to more information: aisafetyclaims.org:Note that security and misalignment risk prevention are more important and much harder than misuse prevention, in my view, and that they're largely open problems but companies could be doing much better. Note that companies outside of Anthropic, OpenAI, and DeepMind are doing basically nothing on safety. See also my main scorecard, ailabwatch.org.Thanks to Aryan Bhatt for suggestions.Subscribe to the AI Lab Watch blog. ^You can use an AI via API, where the developer controls the weights and can implement safeguards, or via weights (because the developer published them or they were stolen), where you control the weights and can easily disable safeguards. So preventing misuse involves preventing misuse via API and securing model weights.^But even if model weights can't be stolen by extremists, they could be stolen by sophisticated actors, then proliferate widely (due to being sold, published, or stored insecurely). The companies avoid discussing this possibility.^The models are:Anthropic: Claude Opus 4 (May) and Opus 4.1 (August)OpenAI: ChatGPT Agent (July) (they say "biological and chemical" but in other places they clarify that they really just mean biological) and GPT-5 (August)Google DeepMind: Gemini 2.5 Deep Think (August) (they say CBRN, not just bio; I suspect they mostly mean bio but they don't say that)xAI: Grok 4 (released July; announcement that it had dangerous capabilities and required safeguards in August) (they say biological and chemical, not just bio)Evals generally show that the models are at or above human-level on bio tasks. In particular, frontier models outperform almost all expert virologists on virology questions related to their specialties in the Virology Capabilities Test. For the previous generation of models—Sonnet 4, Gemini 2.5 Pro, o3-pro—the companies claimed that the models lacked dangerous capabilities (and thus don't require robust safeguards), but they certainly didn't show that. See AI companies' eval reports mostly don't support their claims and Greenblatt shortform.^It's unusual for a tech company to be that secure; even Google doesn't claim to be. And xAI in particular hasn't demonstrated anything about security and there's little reason to expect it to have better security than competitors. Ironically, on the same day that xAI made its security claim, it was reported that xAI Published Hundreds Of Thousands Of Grok Chatbot Conversations accidentally.^I think misuse via API is a relatively small problem: misuse overall is smallish relative to some other risks, and you only need to make your model not helpful on top of models with open weights, models with weak safeguards, and stolen models, and we largely know how to do it, and especially before misuse is catastrophic everyone will notice it's serious and in the worst case companies can just stop widely deploying models.^OpenAI says (but note that OpenAI hasn't defined a security standard beyond its current security):We recognize that another frontier AI model developer might develop or release a system with High or Critical capability in one of this Framework’s Tracked Categories and may do so without instituting comparable safeguards to the ones we have committed to. Such an action could significantly increase the baseline risk of severe harm being realized in the world, and limit the degree to which we can reduce risk using our safeguards. If we are able to rigorously confirm that such a scenario has occurred, then we could adjust accordingly the level of safeguards that we require in that capability area, but only if:we assess that doing so does not meaningfully increase the overall risk of severe harm,we publicly acknowledge that we are making the adjustment,and, in order to avoid a race to the bottom on safety, we keep our safeguards at a level more protective than the other AI developer, and share information to validate this claim.DeepMind says:These mitigations should be understood as recommendations for the industry collectively: our adoption of them would only result in effective risk mitigation for society if all relevant organizations provide similar levels of protection, and our adoption of the protocols described in this Framework may depend on whether such organizations across the field adopt similar protocols.Anthropic says:It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards. If we take this measure, however, we will also acknowledge the overall level of risk posed by AI systems (including ours), and will invest significantly in making a case to the U.S. government for taking regulatory action to mitigate such risk to acceptable levels.xAI does not have such a clause.In misalignment, mostly I think that the safety plans are weak/vague enough that companies will claim to be in compliance (even if they're being unsafe).^I am particularly concerned about future powerful models being catastrophically misaligned during internal deployment and hacking the datacenter or otherwise subverting safeguards and gaining power.^A previous version of the RSP said "We also expect a strong affirmative case (made with accountability for both the reasoning and implementation) about the risk of models pursuing misaligned goals will be required" and "We will specify these requirements more precisely when we reach the 2-8 hour software engineering tasks checkpoint."Discuss

August 28, 2025

2025-08-27 16:52 GMT · 1 week ago www.alignmentforum.org

Original: https://www.alignmentforum.org/posts/Bz2gPtqRJJDWyKxnX/ai-companies-have-started-saying-safeguards-are-load-bearing