Archives AI News

Beyond Rewards in Reinforcement Learning for Cyber Defence

arXiv:2602.04809v3 Announce Type: replace-cross Abstract: Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered…

OPRD: On-Policy Representation Distillation

arXiv:2606.06021v1 Announce Type: cross Abstract: On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen’s ~150k tokens)…

Query-efficient model evaluation using cached responses

arXiv:2605.07096v2 Announce Type: replace-cross Abstract: Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice,…

Beyond Rewards in Reinforcement Learning for Cyber Defence

arXiv:2602.04809v3 Announce Type: replace-cross Abstract: Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered…

Learning Adaptive Parallel Execution for Efficient Code Localization

arXiv:2601.19568v2 Announce Type: replace Abstract: Code localization constitutes a key bottleneck in automated software development pipelines. While concurrent tool execution can enhance discovery speed, current agents demonstrate a 34.9% redundant invocation rate, which negates parallelism benefits. We propose FuseSearch, reformulating…

OPRD: On-Policy Representation Distillation

arXiv:2606.06021v1 Announce Type: cross Abstract: On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen’s ~150k tokens)…

Query-efficient model evaluation using cached responses

arXiv:2605.07096v2 Announce Type: replace-cross Abstract: Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice,…