The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs
arXiv:2509.14704v1 Announce Type: new Abstract: Benchmark saturation and contamination undermine confidence in LLM evaluation. We present Nazonazo, a cost-effective and extensible benchmark built from Japanese children’s riddles to test insight-based reasoning. Items are short (mostly one sentence), require no specialized…
