Archives AI News

LongFlow: Efficient KV Cache Compression for Reasoning Models

arXiv:2603.11504v2 Announce Type: replace Abstract: Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased…

Always Tell Me The Odds: Fine-grained Conditional Probability Estimation

arXiv:2505.01595v2 Announce Type: replace-cross Abstract: We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context. Recent advances in large language models (LLMs) have significantly enhanced their reasoning capabilities, particularly on well-defined tasks with complete information. However,…

AutoCompress: Critical Layer Isolation for Efficient Transformer Compression

arXiv:2604.22786v1 Announce Type: new Abstract: We present AutoCompress, a transformer compression method motivated by an empirical finding: in small transformers, Layer 0 carries disproportionately high task-critical information, with an NTK-based importance score of 3.6 compared to a maximum of 0.054…

Extreme bandits

arXiv:2604.24545v1 Announce Type: cross Abstract: In many areas of medicine, security, and life sciences, we want to allocate limited resources to different sources in order to detect extreme values. In this paper, we study an efficient way to allocate these…

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

arXiv:2604.22782v1 Announce Type: new Abstract: Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to…