UniF$^2$ace: A Unified Fine-grained Face Understanding and Generation Model

2026-01-13 20:00 GMT · 5 months ago aimagpro.com

arXiv:2503.08120v5 Announce Type: replace-cross
Abstract: Unified multimodal models (UMMs) have emerged as a powerful paradigm in fundamental cross-modality research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily faces two challenges: $textbf{(1)}$ $textbf{fragmentation development}$, with existing methods failing to unify understanding and generation into a single one, hindering the way to artificial general intelligence. $textbf{(2) lack of fine-grained facial attributes}$, which are crucial for high-fidelity applications. To handle those issues, we propose $textbf{UniF$^2$ace}$, $textit{the first UMM specifically tailored for fine-grained face understanding and generation}$. $textbf{First}$, we introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood. Moreover, this D3Diff significantly enhances the model’s ability to synthesize high-fidelity facial details aligned with text input. $textbf{Second}$, we propose a multi-level grouped Mixture-of-Experts architecture, adaptively incorporating the semantic and identity facial embeddings to complement the attribute forgotten phenomenon in representation evolvement. $textbf{Finally}$, to this end, we construct UniF$^2$aceD-1M, a large-scale dataset comprising 130K fine-grained image-caption pairs and 1M visual question-answering pairs, spanning a much wider range of facial attributes than existing datasets. Extensive experiments demonstrate that UniF$^2$ace outperforms existing models with a similar scale in both understanding and generation tasks, with 7.1% higher Desc-GPT and 6.6% higher VQA-score, respectively.