A Partition Cover Approach to Tokenization
arXiv:2501.06246v3 Announce Type: replace-cross Abstract: Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the…
