Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments
arXiv:2511.13788v1 Announce Type: new Abstract: Large language models (LLMs) increasingly operate in multi-agent and safety-critical settings, raising open questions about how their vulnerabilities scale when models interact adversarially. This study examines whether larger models can systematically jailbreak smaller ones –…
