MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models
arXiv:2508.02343v2 Announce Type: replace Abstract: Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format.…
