Detecting and Filtering Unsafe Training Data via Data Attribution with Denoised Representation
arXiv:2502.11411v2 Announce Type: replace Abstract: Large language models (LLMs) are highly sensitive to even small amounts of unsafe training data, making effective detection and filtering essential for trustworthy model development. Current state-of-the-art (SOTA) detection approaches primarily rely on moderation classifiers,…
