usually we use N to denote the number of tokens, and V the “vocab” or set of word types. Corpora is usually considered in context of: specific writers at specific time for specific varieties of specific languages for a specific function Particularly hard: code switching, gender, demographics, variety, etc. Herdan’s Law \begin{equation} |V| = kN^{\beta} \end{equation} with \beta being a constant between 0.67 < \beta < 0.75. The vocab size is roughly proportional to the number of tokens.

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?