Archives AI News

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

arXiv:2605.05225v1 Announce Type: new Abstract: Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) suffer from a significant efficiency bottleneck during Expert Parallelism (EP) inference due to the straggler effect. This issue is worsened in the multimodal context, as existing token-count-based load…

Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving

arXiv:2601.21351v2 Announce Type: replace Abstract: Attentio-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources,…