Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn’t Matter (Much)
arXiv:2502.04499v2 Announce Type: replace Abstract: Knowledge distillation (KD) is a popular method of transferring knowledge from a large “teacher” model to a small “student” model. Previous work has explored various layer-selection strategies (e.g., forward matching and in-order random matching) for…
