arXiv:2605.03346v1 Announce Type: cross
Abstract: Embedding-based representations in Euclidean space $mathbb{R}^d$ are a cornerstone of modern machine learning, where a major goal is to use the emph{smallest dimension} that faithfully captures data relations. In this work, we prove sharp dimension–accuracy tradeoffs and identify a fundamental information-theoretic limitation: unless the embedding dimension $d$ is chosen close to the ground-truth dimension $D$, accuracy undergoes a sudden collapse. Our main result shows that this phenomenon arises even in standard contrastive learning settings, where supervision is limited to a set of $m$ anchor–positive–negative triplets $(i,j,k)$ encoding distance comparisons $mathrm{dist}(i,j) < mathrm{dist}(i,k)$. Specifically, given triplets realizable by an unknown ground-truth embedding in $D$ dimensions, we prove that there exists constant $c < 1$, such that emph{every embedding of dimension at most $cD$ violates half of the triplets}, yielding accuracy as low as a trivial one-dimensional solution that ignores the input. We complement our information-theoretic bounds with strong computational hardness results: under the Unique Games Conjecture, even if the given triplets are nearly realizable in $D=1$ dimension, no polynomial-time algorithm — textit{regardless of its dimension} — can achieve accuracy above the trivial $50%$ baseline.
