One-Liner “Patches in groups of tokenization scale better than tokens” Motivation / Novelty typical byte-level LMs don’t are very expensive because many tokens its hard to go beyond 4-6 bytes per token: Zipf’s Law so, we model them as token patches Notable Methods token patch “how do we segment the byte sequence into patches?” — insight: group predicable tokens after every hard choice! i.e., once you train a model, there are “obvious” patcher and unpatcher cross attend Key Figs New Concepts Notes

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?