Anthropic Finds LLMs Can Be Poisoned Using Small Number of Documents
Anthropic’s Alignment Science team released a study on poisoning attacks on LLM training. The experiments covered a range of model sizes and datasets, and found that only 250 malicious examples in pre-training data were needed to create a “backdoor” vulnerability.…
