VIRTUE: Visual-Interactive Text-Image Universal Embedder
arXiv:2510.00523v1 Announce Type: new Abstract: Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify…
