Anthropic Finds LLMs Can Be Poisoned Using Small Number of Documents

November 11, 2025

2025-11-11 05:00 GMT · 7 months ago aimagpro.com

Anthropic’s Alignment Science team released a study on poisoning attacks on LLM training. The experiments covered a range of model sizes and datasets, and found that only 250 malicious examples in pre-training data were needed to create a “backdoor” vulnerability. Anthropic concludes that these attacks actually become easier as models scale up. By Anthony Alford