Hugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs

September 15, 2025

2025-09-14 23:55 GMT · 7 months ago aimagpro.com

Hugging Face has unveiled FinePDFs, the largest publicly available corpus built entirely from PDFs. The dataset spans 475 million documents in 1,733 languages, totaling roughly 3 trillion tokens. At 3.65 terabytes in size, FinePDFs introduces a new dimension to open training datasets by tapping into a resource long considered too complex and expensive to process. By Robert Krzaczyński