MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
arXiv:2510.18830v1 Announce Type: cross Abstract: The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse…
