ACL2025 Pagoni: Patches Scale Better Than Tokens — Jemoka Knowledge Base

One-Liner “Patches in groups of tokenization scale better than tokens” Motivation / Novelty typical byte-level LMs don’t are very expensive because many tokens its hard to go beyond 4-6 bytes per token: Zipf’s Law so, we model them as token patches Notable Methods token patch “how do we segment the byte sequence into patches?” — insight: group predicable tokens after every hard choice! i.e., once you train a model, there are “obvious” patcher and unpatcher cross attend Key Figs New Concepts Notes