Archives AI News

Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)

Published on August 27, 2025 5:04 PM GMTHere's a relatively important question regarding transparency requirements for AI companies: At which points in time should AI companies be required to disclose information? (While I focus on transparency, this question is also applicable to other safety-relevant requirements, and is applicable to norms around voluntary actions rather than requirements.) A natural option would be to attach transparency requirements to the existing processes of pre-deployment testing and releasing a model card when a new model is released. As in, companies would be required to include the relevant information whenever they release a new model (likely in the model card). This is convenient because pre-deployment testing and model cards are already established norms in the AI industry, which makes it easier to attach something new to these existing processes rather than creating a new process. However, I think attaching requirements to model releases (such that the requirement must happen before release proceeds) has large downsides: Right now, AI companies are often in a massive rush to deploy models, and deployment is already a very complex process. If something is a blocker for model release, then there will be massive pressure to get that thing done on time (regardless of the cost in terms of quality). Thus, adding something to the requirements for deployment is likely to make that new thing get done poorly in a heavily rushed way (and also makes it more likely that time of safety researchers is used inefficiently to get it done in a short period of time). Probably most of the risk from AI comes from internal rather than external deployment[1], so requiring transparency prior to external deployment isn't particularly important anyway (as the model was very likely already internally deployed and if nothing else already could have been stolen). Correspondingly, another cost of attaching requirements to external releases of models is that it implies that external deployment is where relevant risks exist and that other activities are safe so long as a model isn't externally deployed. This could be a pretty costly misconception, and even if relevant people agree that internal deployment risks are important (and understand that this is a misconception), I still think that tying relevant processes to external deployment might make these people end up acting as though this is where the risk lives. Building on the prior bullet, transparency about models which aren't externally deployed is very important in some cases and attaching transparency to external model releases makes it awkward to require transparency about models which aren't released externally (in the cases where this is important). In particular, this is important if internal models are way more capable than externally deployed models (much more capable on key metrics, much more capable qualitatively, or they are the equivalent of more than a year ahead at the 2025 rate of AI progress), especially if the absolute level of capability is high (e.g., human engineers in AI companies are now mostly managing teams of AI agents). Internal models being much more capable than externally deployed models could be caused by increased secrecy or rapid progress from AI automation of AI R&D. Rather than associating requirements with model release, you could instead require quarterly releases or require information to be released within some period (e.g. a month) of a model release. I'd prefer something like quarterly, but either of these options seem substantially better to me. (The cadence would need to be increased if AI progress is made substantially more rapid due to AI automation of AI R&D.) Ideally, you'd also include a mechanism for requiring transparency about models which aren't externally deployed in some situations. E.g., you could require transparency about models which are non-trivially more capable than externally deployed models if they've been internally deployed for longer than some period (this could be required in the next quarterly report).[2] Whatever this requirement is, you could make this requirement only trigger above some absolute capability threshold to reduce costs (if you sufficiently trust the process for evaluating this capability threshold) and you could allow companies to delay for some period before having to release the information so that it's less costly[3]. By default, I predict that more requirements will be pushed into the release cycle of models and that AI companies will pressure (safety-relevant) employees to inefficiently rush to meet these requirements in a low-quality way. It's not obvious this is a mistake: perhaps it's too hard to establish some other way of integrating requirements. But, at the very least, we should be aware of the costs. Why I don't like making (transparency) requirements trigger at the point of internal deployment (prior to rapid acceleration in AI progress) All of these approaches (including attaching requirements to external model release) have the downside that requirements don't necessarily trigger for a given model prior to risks from that specific model. In principle, you could resolve this by having requirements trigger prior to internal deployment (or prior to training that model for security-relevant requirements). However, this would make the rush issue even worse, the details of this get increasingly complex as processes for training and internally deploying models get more complex, and companies would resist requiring (unilateral) transparency around models which they haven't yet released.[4] Having things operate at a regular cadence or within some period of model releases (rather than having requirements trigger before the situation is risky) is equivalent to using the results from an earlier model as evidence about the safety of a later model. As long as AI development is sufficiently slow and continuous, and there is some safety buffer in the mitigations, this is fine. This is one reason why I think it would be good if the expectation was that safety measures should have some buffer: they should suffice for handling AIs which are as capable as the company has a reasonable chance of training within the next several months (ideally within the next 6 months).[5] If the rate of progress greatly increased (perhaps due to AI automating AI R&D or a new paradigm), this buffer wouldn't be feasible, so more regular reporting would be important. At least this is true prior to large economic effects or obvious AI-enabled military buildup. At this point we can possibly adjust our approach. Also, transparency requirements (especially transparency requirements done in advance) are less important in worlds with massive obvious external effects of AI prior to potential catastrophe. ↩︎ Other options include: You could also require transparency within some period of the point when risk-relevant thresholds have been crossed (regardless of whether this model is externally deployed), but currently AI companies don't have granular enough thresholds for this to be sufficient and it's hard to figure out what these thresholds should be in advance. You could also try to operationalize "way more capable than externally deployed models", but this might be tricky to do and it might be easier for companies to stretch the relevant threshold without this being brazen. ↩︎ Allowing for a (substantial) delay reduces costs in multiple ways. It makes it so the company has more time to figure out what to disclose and it also means the company is less likely to be forced to leak information about a future release. A delay of a few months is probably acceptable unless AI progress has sped up massively. ↩︎ As discussed above, transparency around models which aren't deployed externally is important in some cases, but companies would probably be less resistant if this was scoped more narrowly to these particular cases rather than always requiring immediate transparency when more capable models are deployed internally. ↩︎ Interestingly, using the results from an earlier model as evidence for a later model is effectively how the Anthropic RSP v2 works. Security requirements are only triggered by evaluations which are only required prior to model release, even though security is in principle important since the model was trained. Precisely, security is only required once a model is determined to reach some ASL level and this determination is done as part of the release process. This seems fine so long as there is effectively some buffer (such that we'd expect security requirements to trigger early enough that not having secured this model for some period still keeps risk sufficiently low). ↩︎ Discuss

Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)

Published on August 27, 2025 5:04 PM GMTHere's a relatively important question regarding transparency requirements for AI companies: At which points in time should AI companies be required to disclose information? (While I focus on transparency, this question is also applicable to other safety-relevant requirements, and is applicable to norms around voluntary actions rather than requirements.) A natural option would be to attach transparency requirements to the existing processes of pre-deployment testing and releasing a model card when a new model is released. As in, companies would be required to include the relevant information whenever they release a new model (likely in the model card). This is convenient because pre-deployment testing and model cards are already established norms in the AI industry, which makes it easier to attach something new to these existing processes rather than creating a new process. However, I think attaching requirements to model releases (such that the requirement must happen before release proceeds) has large downsides: Right now, AI companies are often in a massive rush to deploy models, and deployment is already a very complex process. If something is a blocker for model release, then there will be massive pressure to get that thing done on time (regardless of the cost in terms of quality). Thus, adding something to the requirements for deployment is likely to make that new thing get done poorly in a heavily rushed way (and also makes it more likely that time of safety researchers is used inefficiently to get it done in a short period of time). Probably most of the risk from AI comes from internal rather than external deployment[1], so requiring transparency prior to external deployment isn't particularly important anyway (as the model was very likely already internally deployed and if nothing else already could have been stolen). Correspondingly, another cost of attaching requirements to external releases of models is that it implies that external deployment is where relevant risks exist and that other activities are safe so long as a model isn't externally deployed. This could be a pretty costly misconception, and even if relevant people agree that internal deployment risks are important (and understand that this is a misconception), I still think that tying relevant processes to external deployment might make these people end up acting as though this is where the risk lives. Building on the prior bullet, transparency about models which aren't externally deployed is very important in some cases and attaching transparency to external model releases makes it awkward to require transparency about models which aren't released externally (in the cases where this is important). In particular, this is important if internal models are way more capable than externally deployed models (much more capable on key metrics, much more capable qualitatively, or they are the equivalent of more than a year ahead at the 2025 rate of AI progress), especially if the absolute level of capability is high (e.g., human engineers in AI companies are now mostly managing teams of AI agents). Internal models being much more capable than externally deployed models could be caused by increased secrecy or rapid progress from AI automation of AI R&D. Rather than associating requirements with model release, you could instead require quarterly releases or require information to be released within some period (e.g. a month) of a model release. I'd prefer something like quarterly, but either of these options seem substantially better to me. (The cadence would need to be increased if AI progress is made substantially more rapid due to AI automation of AI R&D.) Ideally, you'd also include a mechanism for requiring transparency about models which aren't externally deployed in some situations. E.g., you could require transparency about models which are non-trivially more capable than externally deployed models if they've been internally deployed for longer than some period (this could be required in the next quarterly report).[2] Whatever this requirement is, you could make this requirement only trigger above some absolute capability threshold to reduce costs (if you sufficiently trust the process for evaluating this capability threshold) and you could allow companies to delay for some period before having to release the information so that it's less costly[3]. By default, I predict that more requirements will be pushed into the release cycle of models and that AI companies will pressure (safety-relevant) employees to inefficiently rush to meet these requirements in a low-quality way. It's not obvious this is a mistake: perhaps it's too hard to establish some other way of integrating requirements. But, at the very least, we should be aware of the costs. Why I don't like making (transparency) requirements trigger at the point of internal deployment (prior to rapid acceleration in AI progress) All of these approaches (including attaching requirements to external model release) have the downside that requirements don't necessarily trigger for a given model prior to risks from that specific model. In principle, you could resolve this by having requirements trigger prior to internal deployment (or prior to training that model for security-relevant requirements). However, this would make the rush issue even worse, the details of this get increasingly complex as processes for training and internally deploying models get more complex, and companies would resist requiring (unilateral) transparency around models which they haven't yet released.[4] Having things operate at a regular cadence or within some period of model releases (rather than having requirements trigger before the situation is risky) is equivalent to using the results from an earlier model as evidence about the safety of a later model. As long as AI development is sufficiently slow and continuous, and there is some safety buffer in the mitigations, this is fine. This is one reason why I think it would be good if the expectation was that safety measures should have some buffer: they should suffice for handling AIs which are as capable as the company has a reasonable chance of training within the next several months (ideally within the next 6 months).[5] If the rate of progress greatly increased (perhaps due to AI automating AI R&D or a new paradigm), this buffer wouldn't be feasible, so more regular reporting would be important. At least this is true prior to large economic effects or obvious AI-enabled military buildup. At this point we can possibly adjust our approach. Also, transparency requirements (especially transparency requirements done in advance) are less important in worlds with massive obvious external effects of AI prior to potential catastrophe. ↩︎ Other options include: You could also require transparency within some period of the point when risk-relevant thresholds have been crossed (regardless of whether this model is externally deployed), but currently AI companies don't have granular enough thresholds for this to be sufficient and it's hard to figure out what these thresholds should be in advance. You could also try to operationalize "way more capable than externally deployed models", but this might be tricky to do and it might be easier for companies to stretch the relevant threshold without this being brazen. ↩︎ Allowing for a (substantial) delay reduces costs in multiple ways. It makes it so the company has more time to figure out what to disclose and it also means the company is less likely to be forced to leak information about a future release. A delay of a few months is probably acceptable unless AI progress has sped up massively. ↩︎ As discussed above, transparency around models which aren't deployed externally is important in some cases, but companies would probably be less resistant if this was scoped more narrowly to these particular cases rather than always requiring immediate transparency when more capable models are deployed internally. ↩︎ Interestingly, using the results from an earlier model as evidence for a later model is effectively how the Anthropic RSP v2 works. Security requirements are only triggered by evaluations which are only required prior to model release, even though security is in principle important since the model was trained. Precisely, security is only required once a model is determined to reach some ASL level and this determination is done as part of the release process. This seems fine so long as there is effectively some buffer (such that we'd expect security requirements to trigger early enough that not having secured this model for some period still keeps risk sufficiently low). ↩︎ Discuss

AI companies have started saying safeguards are load-bearing

Published on August 27, 2025 1:00 PM GMTThere are two ways to show that an AI system is safe: show that it doesn't have dangerous capabilities, or show that it's safe even if it has dangerous capabilities. Until three months ago, AI companies said their models didn't have dangerous capabilities. (At the time, I wrote that the companies' eval reports didn't support their claims that their models lacked dangerous bio capabilities.) Now, Anthropic, OpenAI, Google DeepMind, and xAI say their most powerful models might have dangerous biology capabilities and thus could substantially boost extremists—but not states—in creating bioweapons. To prevent such misuse, they must (1) prevent extremists from doing misuse via API and (2) prevent extremists from acquiring the model weights.[1] For (1), they claim classifiers block bad model outputs; for (2), they claim their security prevents extremists from stealing model weights.[2]Are the new safety claims valid? On security, companies either aren't claiming to be secure against hacker groups or their claims are dubious. On misuse via API, for some companies the mitigations seem adequate if enabled but it's unclear whether they're implemented all the time; for some companies the mitigations are opaque and dubious. Almost all risk comes from more powerful future models; analyzing current practices is worthwhile because it can illuminate to what extent the companies are honest, transparent, and on track for safety. Preventing misuse via API and securing model weights against hacker groups are much easier than the critical future safety challenges—preventing risks from misalignment and securing model weights against states—so companies' current performance is concerning for future risks.Companies' safeguards for bio capabilitiesOn bio, the four companies all say their most powerful models might have dangerous capabilities[3] and say they've responded with safeguards against misuse via API and security safeguards.On safeguards against misuse via API for their recent models, OpenAI and especially Anthropic published a reasonable outline of a classifier-focused safeguard plan and decent evidence that their safeguards are pretty robust (but not a full valid safety case, in my view). DeepMind also uses classifiers but its report included almost no information about its safeguards — it didn't even specify that it uses classifiers, much less how robust they are. xAI says it has "deployed filters for [biological and] chemical weapons-related abuse," but not what "filters" means or how good they are. It also says "models refuse almost all harmful requests" even without filters, answering 0% by default or 1% with a jailbreak. Results of external assessment of the safeguards haven't been published but xAI's claims are dubious; for example, the model doesn't refuse direct requests on synthesizing sarin gas. (But it's possible that that's intentional and safeguards are narrow but effective.)On security, I believe SL3 (see below) is appropriate for current models due to the prospect of model weights being stolen by criminals and then being published or otherwise proliferating widely. (Assurance of safety would require SL5, but even a safety-focused company shouldn't implement SL5 unilaterally, in part due to being rushed by less safe companies.) Anthropic says it is "highly protected against most attackers' attempts at stealing model weights," including "organized cybercrime groups" and "corporate espionage teams." If true, that's adequate. Anthropic's security standard excludes robustness to insiders who have or can request "access to systems that process model weights." This exclusion is dubious: it likely applies to many insiders, and being vulnerable to many insiders implies being vulnerable to many actors because many actors can hack an insider by default. This exclusion was added to the standard just before Anthropic was required to meet the standard, presumably because Anthropic couldn't meet the old standard. OpenAI says it has "sufficiently minimized" the risks. It seems to think that it is meeting its security standard for "High" capability models, but it doesn't say that explicitly. It doesn't say anything substantive on security, and it doesn't say which kinds of actors it thinks it's protected against. My guess is that it's slightly above SL2. DeepMind says that it has implemented SL2 security as recommended by its Frontier Safety Framework. DeepMind doesn't justify this claim, but it's totally credible. Unfortunately, it seems inadequate to avert misuse risk from current models. xAI says it "has implemented appropriate information security standards sufficient to prevent its critical model information from being stolen by a motivated non-state actor," but this is very implausible[4] and xAI doesn't justify it.Source: RAND.For current models, proliferating to extremists would increase biorisk somewhat. But future models will be much more dangerous. If AIs that greatly accelerate AI research can immediately be stolen, then the developer can't prioritize safety much without risking getting outraced (and the geopolitical situation is worse than if the US has a solid lead). It's much easier to tell a story where developers prioritize safety if such model weights don't immediately proliferate, and that requires SL5. (Additionally, some aspects of security would help protect against data poisoning, AIs being trained to have secret loyalties, or AI misbehavior during internal deployment.)Companies' planning for future safeguardsThe companies' safety frameworks say they're preparing for three kinds of risks: misuse via API, misalignment, and model weight theft.Misuse-via-API safeguards will probably improve somewhat over time, but the basic plan—focused on classifiers to block bad outputs—will likely remain the same. But I think misuse via API is relatively unimportant;[5] I think model weight theft and misalignment are much larger problems and the companies are much less on track.On security planning, Anthropic's Responsible Scaling Policy says that the ASL-4 security standard "would protect against model-weight theft by state-level adversaries." (ASL is different from SL; ASL-4 is like SL5.) That sounds great, but without more of a plan it's not really credible or better than "trust us." It's supposed to be implemented by the time AI can "cause dramatic acceleration in the rate of effective scaling" (or roughly 2.5x the rate of algorithmic progress). Anthropic is the only company that says that it wants to protect against states. Unfortunately, like the other companies, it doesn't seem to be on track to do so by the time it thinks AI will dramatically accelerate AI progress. DeepMind's planning only goes up to SL4, and that only for AI that "can fully automate the AI R&D pipeline at a competitive cost[,] relative to humans augmented by AI tools." OpenAI plans to in the future write a security standard higher than its current security. xAI says nothing about future security. Strong security is costly and current plans are likely to be abandoned; this possibility is explicit in companies' safety policies.[6] Ideally companies would sprint for SL5, or if that's too costly, (1) have substantially better security than other companies, (2) prepare to demonstrate that their security is relatively good, (3) proclaim that the security situation is bad due to rushing, and (4) try to coordinate with other companies so that they can improve security without the costs of doing so unilaterally. On misalignment,[7] Anthropic's RSP merely says that when Anthropic's AI can "fully automate the work of an entry-level, remote-only Researcher at Anthropic," it will "develop an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels. The affirmative case will describe, as relevant, evidence on model capabilities; evidence on AI alignment; mitigations (such as monitoring and other safeguards); and our overall reasoning.‬"[8] (But Anthropic is doing some good work on risks from misalignment and could write a somewhat more convincing plan.) DeepMind has a good abstract plan but it's not credible without more detail or transparency, especially given how DeepMind shared almost no information on security and misuse safeguards. OpenAI plans for "safeguards against a misaligned model" but the trigger is ambiguous and, crucially, the response plan includes both reasonable and confused options for showing that the system is safe, so OpenAI will likely rely on confused options. xAI's plan is deeply confused: it's about measuring lying/sycophancy propensity in certain situations, then training or prompting the model to be more honest if necessary. This misses the point in many ways; for example, it misses risk from models scheming subtly, and it doesn't consider more effective interventions. Ideally companies would do lots of good work aimed at averting risks from misalignment in future powerful models, prepare to have AI largely automate safety research when that becomes possible, and prepare to mitigate risks and catch misbehavior when AI poses risks from misalignment during internal deployment.Small scorecardHere's a new scorecard with some links to more information: aisafetyclaims.org:Note that security and misalignment risk prevention are more important and much harder than misuse prevention, in my view, and that they're largely open problems but companies could be doing much better. Note that companies outside of Anthropic, OpenAI, and DeepMind are doing basically nothing on safety. See also my main scorecard, ailabwatch.org.Thanks to Aryan Bhatt for suggestions.Subscribe to the AI Lab Watch blog. ^You can use an AI via API, where the developer controls the weights and can implement safeguards, or via weights (because the developer published them or they were stolen), where you control the weights and can easily disable safeguards. So preventing misuse involves preventing misuse via API and securing model weights.^But even if model weights can't be stolen by extremists, they could be stolen by sophisticated actors, then proliferate widely (due to being sold, published, or stored insecurely). The companies avoid discussing this possibility.^The models are:Anthropic: Claude Opus 4 (May) and Opus 4.1 (August)OpenAI: ChatGPT Agent (July) (they say "biological and chemical" but in other places they clarify that they really just mean biological) and GPT-5 (August)Google DeepMind: Gemini 2.5 Deep Think (August) (they say CBRN, not just bio; I suspect they mostly mean bio but they don't say that)xAI: Grok 4 (released July; announcement that it had dangerous capabilities and required safeguards in August) (they say biological and chemical, not just bio)Evals generally show that the models are at or above human-level on bio tasks. In particular, frontier models outperform almost all expert virologists on virology questions related to their specialties in the Virology Capabilities Test. For the previous generation of models—Sonnet 4, Gemini 2.5 Pro, o3-pro—the companies claimed that the models lacked dangerous capabilities (and thus don't require robust safeguards), but they certainly didn't show that. See AI companies' eval reports mostly don't support their claims and Greenblatt shortform.^It's unusual for a tech company to be that secure; even Google doesn't claim to be. And xAI in particular hasn't demonstrated anything about security and there's little reason to expect it to have better security than competitors. Ironically, on the same day that xAI made its security claim, it was reported that xAI Published Hundreds Of Thousands Of Grok Chatbot Conversations accidentally.^I think misuse via API is a relatively small problem: misuse overall is smallish relative to some other risks, and you only need to make your model not helpful on top of models with open weights, models with weak safeguards, and stolen models, and we largely know how to do it, and especially before misuse is catastrophic everyone will notice it's serious and in the worst case companies can just stop widely deploying models.^OpenAI says (but note that OpenAI hasn't defined a security standard beyond its current security):We recognize that another frontier AI model developer might develop or release a system with High or Critical capability in one of this Framework’s Tracked Categories and may do so without instituting comparable safeguards to the ones we have committed to. Such an action could significantly increase the baseline risk of severe harm being realized in the world, and limit the degree to which we can reduce risk using our safeguards. If we are able to rigorously confirm that such a scenario has occurred, then we could adjust accordingly the level of safeguards that we require in that capability area, but only if:we assess that doing so does not meaningfully increase the overall risk of severe harm,we publicly acknowledge that we are making the adjustment,and, in order to avoid a race to the bottom on safety, we keep our safeguards at a level more protective than the other AI developer, and share information to validate this claim.DeepMind says:These mitigations should be understood as recommendations for the industry collectively: our adoption of them would only result in effective risk mitigation for society if all relevant organizations provide similar levels of protection, and our adoption of the protocols described in this Framework may depend on whether such organizations across the field adopt similar protocols.Anthropic says:It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards. If we take this measure, however, we will also acknowledge the overall level of risk posed by AI systems (including ours), and will invest significantly in making a case to the U.S. government for taking regulatory action to mitigate such risk to acceptable levels.xAI does not have such a clause.In misalignment, mostly I think that the safety plans are weak/vague enough that companies will claim to be in compliance (even if they're being unsafe).^I am particularly concerned about future powerful models being catastrophically misaligned during internal deployment and hacking the datacenter or otherwise subverting safeguards and gaining power.^A previous version of the RSP said "We also expect a strong affirmative case (made with accountability for both the reasoning and implementation) about the risk of models pursuing misaligned goals will be required" and "We will specify these requirements more precisely when we reach the 2-8 hour software engineering tasks checkpoint."Discuss

AI companies have started saying safeguards are load-bearing

Published on August 27, 2025 1:00 PM GMTThere are two ways to show that an AI system is safe: show that it doesn't have dangerous capabilities, or show that it's safe even if it has dangerous capabilities. Until three months ago, AI companies said their models didn't have dangerous capabilities. (At the time, I wrote that the companies' eval reports didn't support their claims that their models lacked dangerous bio capabilities.) Now, Anthropic, OpenAI, Google DeepMind, and xAI say their most powerful models might have dangerous biology capabilities and thus could substantially boost extremists—but not states—in creating bioweapons. To prevent such misuse, they must (1) prevent extremists from doing misuse via API and (2) prevent extremists from acquiring the model weights.[1] For (1), they claim classifiers block bad model outputs; for (2), they claim their security prevents extremists from stealing model weights.[2]Are the new safety claims valid? On security, companies either aren't claiming to be secure against hacker groups or their claims are dubious. On misuse via API, for some companies the mitigations seem adequate if enabled but it's unclear whether they're implemented all the time; for some companies the mitigations are opaque and dubious. Almost all risk comes from more powerful future models; analyzing current practices is worthwhile because it can illuminate to what extent the companies are honest, transparent, and on track for safety. Preventing misuse via API and securing model weights against hacker groups are much easier than the critical future safety challenges—preventing risks from misalignment and securing model weights against states—so companies' current performance is concerning for future risks.Companies' safeguards for bio capabilitiesOn bio, the four companies all say their most powerful models might have dangerous capabilities[3] and say they've responded with safeguards against misuse via API and security safeguards.On safeguards against misuse via API for their recent models, OpenAI and especially Anthropic published a reasonable outline of a classifier-focused safeguard plan and decent evidence that their safeguards are pretty robust (but not a full valid safety case, in my view). DeepMind also uses classifiers but its report included almost no information about its safeguards — it didn't even specify that it uses classifiers, much less how robust they are. xAI says it has "deployed filters for [biological and] chemical weapons-related abuse," but not what "filters" means or how good they are. It also says "models refuse almost all harmful requests" even without filters, answering 0% by default or 1% with a jailbreak. Results of external assessment of the safeguards haven't been published but xAI's claims are dubious; for example, the model doesn't refuse direct requests on synthesizing sarin gas. (But it's possible that that's intentional and safeguards are narrow but effective.)On security, I believe SL3 (see below) is appropriate for current models due to the prospect of model weights being stolen by criminals and then being published or otherwise proliferating widely. (Assurance of safety would require SL5, but even a safety-focused company shouldn't implement SL5 unilaterally, in part due to being rushed by less safe companies.) Anthropic says it is "highly protected against most attackers' attempts at stealing model weights," including "organized cybercrime groups" and "corporate espionage teams." If true, that's adequate. Anthropic's security standard excludes robustness to insiders who have or can request "access to systems that process model weights." This exclusion is dubious: it likely applies to many insiders, and being vulnerable to many insiders implies being vulnerable to many actors because many actors can hack an insider by default. This exclusion was added to the standard just before Anthropic was required to meet the standard, presumably because Anthropic couldn't meet the old standard. OpenAI says it has "sufficiently minimized" the risks. It seems to think that it is meeting its security standard for "High" capability models, but it doesn't say that explicitly. It doesn't say anything substantive on security, and it doesn't say which kinds of actors it thinks it's protected against. My guess is that it's slightly above SL2. DeepMind says that it has implemented SL2 security as recommended by its Frontier Safety Framework. DeepMind doesn't justify this claim, but it's totally credible. Unfortunately, it seems inadequate to avert misuse risk from current models. xAI says it "has implemented appropriate information security standards sufficient to prevent its critical model information from being stolen by a motivated non-state actor," but this is very implausible[4] and xAI doesn't justify it.Source: RAND.For current models, proliferating to extremists would increase biorisk somewhat. But future models will be much more dangerous. If AIs that greatly accelerate AI research can immediately be stolen, then the developer can't prioritize safety much without risking getting outraced (and the geopolitical situation is worse than if the US has a solid lead). It's much easier to tell a story where developers prioritize safety if such model weights don't immediately proliferate, and that requires SL5. (Additionally, some aspects of security would help protect against data poisoning, AIs being trained to have secret loyalties, or AI misbehavior during internal deployment.)Companies' planning for future safeguardsThe companies' safety frameworks say they're preparing for three kinds of risks: misuse via API, misalignment, and model weight theft.Misuse-via-API safeguards will probably improve somewhat over time, but the basic plan—focused on classifiers to block bad outputs—will likely remain the same. But I think misuse via API is relatively unimportant;[5] I think model weight theft and misalignment are much larger problems and the companies are much less on track.On security planning, Anthropic's Responsible Scaling Policy says that the ASL-4 security standard "would protect against model-weight theft by state-level adversaries." (ASL is different from SL; ASL-4 is like SL5.) That sounds great, but without more of a plan it's not really credible or better than "trust us." It's supposed to be implemented by the time AI can "cause dramatic acceleration in the rate of effective scaling" (or roughly 2.5x the rate of algorithmic progress). Anthropic is the only company that says that it wants to protect against states. Unfortunately, like the other companies, it doesn't seem to be on track to do so by the time it thinks AI will dramatically accelerate AI progress. DeepMind's planning only goes up to SL4, and that only for AI that "can fully automate the AI R&D pipeline at a competitive cost[,] relative to humans augmented by AI tools." OpenAI plans to in the future write a security standard higher than its current security. xAI says nothing about future security. Strong security is costly and current plans are likely to be abandoned; this possibility is explicit in companies' safety policies.[6] Ideally companies would sprint for SL5, or if that's too costly, (1) have substantially better security than other companies, (2) prepare to demonstrate that their security is relatively good, (3) proclaim that the security situation is bad due to rushing, and (4) try to coordinate with other companies so that they can improve security without the costs of doing so unilaterally. On misalignment,[7] Anthropic's RSP merely says that when Anthropic's AI can "fully automate the work of an entry-level, remote-only Researcher at Anthropic," it will "develop an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels. The affirmative case will describe, as relevant, evidence on model capabilities; evidence on AI alignment; mitigations (such as monitoring and other safeguards); and our overall reasoning.‬"[8] (But Anthropic is doing some good work on risks from misalignment and could write a somewhat more convincing plan.) DeepMind has a good abstract plan but it's not credible without more detail or transparency, especially given how DeepMind shared almost no information on security and misuse safeguards. OpenAI plans for "safeguards against a misaligned model" but the trigger is ambiguous and, crucially, the response plan includes both reasonable and confused options for showing that the system is safe, so OpenAI will likely rely on confused options. xAI's plan is deeply confused: it's about measuring lying/sycophancy propensity in certain situations, then training or prompting the model to be more honest if necessary. This misses the point in many ways; for example, it misses risk from models scheming subtly, and it doesn't consider more effective interventions. Ideally companies would do lots of good work aimed at averting risks from misalignment in future powerful models, prepare to have AI largely automate safety research when that becomes possible, and prepare to mitigate risks and catch misbehavior when AI poses risks from misalignment during internal deployment.Small scorecardHere's a new scorecard with some links to more information: aisafetyclaims.org:Note that security and misalignment risk prevention are more important and much harder than misuse prevention, in my view, and that they're largely open problems but companies could be doing much better. Note that companies outside of Anthropic, OpenAI, and DeepMind are doing basically nothing on safety. See also my main scorecard, ailabwatch.org.Thanks to Aryan Bhatt for suggestions.Subscribe to the AI Lab Watch blog. ^You can use an AI via API, where the developer controls the weights and can implement safeguards, or via weights (because the developer published them or they were stolen), where you control the weights and can easily disable safeguards. So preventing misuse involves preventing misuse via API and securing model weights.^But even if model weights can't be stolen by extremists, they could be stolen by sophisticated actors, then proliferate widely (due to being sold, published, or stored insecurely). The companies avoid discussing this possibility.^The models are:Anthropic: Claude Opus 4 (May) and Opus 4.1 (August)OpenAI: ChatGPT Agent (July) (they say "biological and chemical" but in other places they clarify that they really just mean biological) and GPT-5 (August)Google DeepMind: Gemini 2.5 Deep Think (August) (they say CBRN, not just bio; I suspect they mostly mean bio but they don't say that)xAI: Grok 4 (released July; announcement that it had dangerous capabilities and required safeguards in August) (they say biological and chemical, not just bio)Evals generally show that the models are at or above human-level on bio tasks. In particular, frontier models outperform almost all expert virologists on virology questions related to their specialties in the Virology Capabilities Test. For the previous generation of models—Sonnet 4, Gemini 2.5 Pro, o3-pro—the companies claimed that the models lacked dangerous capabilities (and thus don't require robust safeguards), but they certainly didn't show that. See AI companies' eval reports mostly don't support their claims and Greenblatt shortform.^It's unusual for a tech company to be that secure; even Google doesn't claim to be. And xAI in particular hasn't demonstrated anything about security and there's little reason to expect it to have better security than competitors. Ironically, on the same day that xAI made its security claim, it was reported that xAI Published Hundreds Of Thousands Of Grok Chatbot Conversations accidentally.^I think misuse via API is a relatively small problem: misuse overall is smallish relative to some other risks, and you only need to make your model not helpful on top of models with open weights, models with weak safeguards, and stolen models, and we largely know how to do it, and especially before misuse is catastrophic everyone will notice it's serious and in the worst case companies can just stop widely deploying models.^OpenAI says (but note that OpenAI hasn't defined a security standard beyond its current security):We recognize that another frontier AI model developer might develop or release a system with High or Critical capability in one of this Framework’s Tracked Categories and may do so without instituting comparable safeguards to the ones we have committed to. Such an action could significantly increase the baseline risk of severe harm being realized in the world, and limit the degree to which we can reduce risk using our safeguards. If we are able to rigorously confirm that such a scenario has occurred, then we could adjust accordingly the level of safeguards that we require in that capability area, but only if:we assess that doing so does not meaningfully increase the overall risk of severe harm,we publicly acknowledge that we are making the adjustment,and, in order to avoid a race to the bottom on safety, we keep our safeguards at a level more protective than the other AI developer, and share information to validate this claim.DeepMind says:These mitigations should be understood as recommendations for the industry collectively: our adoption of them would only result in effective risk mitigation for society if all relevant organizations provide similar levels of protection, and our adoption of the protocols described in this Framework may depend on whether such organizations across the field adopt similar protocols.Anthropic says:It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards. If we take this measure, however, we will also acknowledge the overall level of risk posed by AI systems (including ours), and will invest significantly in making a case to the U.S. government for taking regulatory action to mitigate such risk to acceptable levels.xAI does not have such a clause.In misalignment, mostly I think that the safety plans are weak/vague enough that companies will claim to be in compliance (even if they're being unsafe).^I am particularly concerned about future powerful models being catastrophically misaligned during internal deployment and hacking the datacenter or otherwise subverting safeguards and gaining power.^A previous version of the RSP said "We also expect a strong affirmative case (made with accountability for both the reasoning and implementation) about the risk of models pursuing misaligned goals will be required" and "We will specify these requirements more precisely when we reach the 2-8 hour software engineering tasks checkpoint."Discuss

Unlocking enterprise agility in the API economy

Across industries, enterprises are increasingly adopting an on-demand approach to compute, storage, and applications. They are favoring digital services that are faster to deploy, easier to scale, and better integrated with partner ecosystems. Yet, one critical pillar has lagged: the network. While software-defined networking has made inroads, many organizations still operate rigid, pre-provisioned networks. As…

The AI Hype Index: AI-designed antibiotics show promise

Separating AI reality from hyped-up fiction isn’t always easy. That’s why we’ve created the AI Hype Index—a simple, at-a-glance summary of everything you need to know about the state of the industry. Using AI to improve our health and well-being is one of the areas scientists and researchers are most excited about. The last month…