tldr: 1.22x – 1.28x training acceleration with MXFP8, equivalent convergence compared to BF16. We recently worked with a Crusoe B200 cluster with 1856 GPUs, giving us a first look at…
Accelerating 2K scale pre-training up to 1.28x with TorchAO, MXFP8 and TorchTitan on Crusoe B200 Cluster
tldr: 1.22x – 1.28x training acceleration with MXFP8, equivalent convergence compared to BF16. We recently worked with a Crusoe B200 cluster with 1856 GPUs, giving us a first look at...
