Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies
arXiv:2604.09189v1 Announce Type: cross Abstract: LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their…
