From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
arXiv:2603.26839v1 Announce Type: new Abstract: How do multimodal models solve visual spatial tasks — through genuine planning, or through brute-force search in token space? We introduce textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and…
