Rate-Distortion Optimization for Transformer Inference
arXiv:2601.22002v3 Announce Type: replace Abstract: Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process across multiple devices, which, in turn, requires compressing…
