LSHBloom: Memory-efficient, Extreme-scale Document Deduplication
arXiv:2411.04257v3 Announce Type: replace Abstract: Contemporary large language model (LLM) training pipelines require the assembly of internet-scale databases full of text data from a variety of sources (e.g., web, academic, and publishers). Preprocessing these datasets via deduplication — detecting and…
