Archives AI News

Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

arXiv:2604.26360v1 Announce Type: new Abstract: Reinforcement learning (RL) systems typically optimize scalar reward functions that assume precise and reliable evaluation of outcomes. However, real-world objectives–especially those derived from human preferences–are often uncertain, context-dependent, and internally inconsistent. This mismatch can lead…

MoRFI: Monotonic Sparse Autoencoder Feature Identification

arXiv:2604.26866v1 Announce Type: cross Abstract: Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving rise to hallucinations. While…

The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety

arXiv:2603.02259v2 Announce Type: replace-cross Abstract: Multi-agent systems provide mature methodologies for role decomposition, coordination, and normative governance, capabilities that remain essential as increasingly powerful autonomous decision components are embedded within agent-based systems. While learned and generative models substantially expand system…

FedSLoP: Memory-Efficient Federated Learning with Low-Rank Gradient Projection

arXiv:2604.24012v2 Announce Type: replace Abstract: Federated learning enables a population of clients to collaboratively train machine learning models without exchanging their raw data, but standard algorithms such as FedAvg suffer from slow convergence and high communication and memory costs in…

Causally Sufficient and Necessary Feature Expansion for Class-Incremental Learning

arXiv:2603.09145v3 Announce Type: replace Abstract: Current expansion-based methods for Class Incremental Learning (CIL) effectively mitigate catastrophic forgetting by freezing old features. However, such task-specific features learned from the new task may collide with the old features. From a causal perspective,…