Generative AI models don’t process text the same way humans do. Understanding their “Tokenization-based internal environments may help explain some of their strange behaviors and stubborn limitations.
Transformer Architecture and Tokenization
Most models, from small on-device ones like Gemma to OpenAI’s industry-leading GPT-4, are built on an architecture known as the transformer. Because transformers conjure up associations between text and other types of data, they can’t take in or output raw text—at least not without massive computation.
How Tokenization Works
Tokens can be words like “fantastic.” Or they can be syllables, like “fan,” “tas,” and “tic.” Depending on the tokenizer — the model that does the tokenizing — they might even be individual characters in words (e.g., “f,” “a,” “n,” “t,” “a,” “s,” “t,” “i,” “c”).
Using this method, transformers can take in more information (in the semantic sense) before they reach an upper limit known as the context window. But tokenization can also introduce biases.
Challenges with Tokenization
Some tokens have odd spacing, which can derail a transformer. A tokenizer might encode “once upon a time” as “once,” “upon,” “a,” or “time,” for example, while encoding “once upon a ” (which has trailing whitespace) as “once,” “upon,” “a,” ” .” Depending on how a model is prompted — with “once upon a” or “once upon a” — the results may be completely different because the model doesn’t understand (as a person would) that the meaning is the same.
Expert Insight on Tokenization Issues
“It’s kind of hard to get around the question of what exactly a ‘word’ should be for a language model, and even if we got human experts to agree on a perfect token vocabulary, models would probably still find it useful to ‘chunk’ things even further,” Sheridan Feucht, a PhD student studying large language model interpretability at Northeastern University. “I would guess that there’s no such thing as a perfect tokenizer due to this kind of fuzziness.”
Language-Specific Tokenization Problems
This “fuzziness” creates even more problems in languages other than English.
Many tokenization methods assume that a space in a sentence denotes a new word. That’s because they were designed with English in mind. However, not all languages use spaces to separate words. Chinese and Japanese don’t — nor do Korean, Thai, or Khmer.
A 2023 Oxford study found that, because of differences in how non-English languages are tokenized, it can take a transformer twice as long to complete a task phrased in a non-English language versus that in English. The same study—and another—found that users of less “token-efficient” languages are likely to see worse model performance yet pay more for usage, given that many AI vendors charge per token.
Tokenization in Different Writing Systems
Tokenizers often treat each character in logographic systems of writing — systems in which printed symbols represent words without relating to pronunciation, like Chinese — as a distinct token, leading to high token counts. Similarly, tokenizers processing agglutinative languages — languages where words are made up of small meaningful word elements called morphemes, such as Turkish — tend to turn each morpheme into a token, increasing overall token counts. (The equivalent word for “hello” in Thai, สวัสดี, is six tokens.)
Comparative Analysis of Tokenization
In 2023, Google DeepMind AI researcher Yennie Jun conducted an analysis comparing the tokenization of different languages and its downstream effects. Using a dataset of parallel texts translated into 52 languages, Jun showed that some languages needed up to 10 times more tokens to capture the same meaning in English.
Tokenization and Mathematical Understanding
Beyond language inequities, tokenization might explain why today’s models are wrong in math.
Rarely are digits tokenized consistently. Because they don’t know the numbers, tokenizers might treat “380” as one token. But represent “381” as a pair (“38” and “1”) — effectively destroying the relationships between digits and results in equations and formulas. The result is transformer confusion; a recent paper showed that models. Struggle to understand repetitive numerical patterns and context, particularly temporal data. (See: GPT-4 thinks 7,735 is more significant than 7,926).
That’s also why models can’t solve anagram problems or reverse words.
Potential Solutions to Tokenization Issues
So, tokenization presents challenges for generative AI. Can they be solved?
Maybe.
Byte-Level Models and Future Directions
Feucht points to “byte-level” state space models like MambaByte, which can ingest far more data than transformers without a performance penalty by eliminating tokenization entirely. MambaByte works directly with raw bytes representing text and other data. It is competitive with some transformer models on language-analyzing tasks. While better handling “noise” like words with swapped characters, spacing, and capitalized characters.
However, models like MambaByte are in the early research stages.
“It’s probably best to let models look at characters directly without imposing tokenization. But right now, that’s just computationally infeasible for transformers,” Feucht said. “For transformer models in particular, computation scales quadratically with sequence length, so we want to use short text representations.”
SUMMARY: Looking Ahead
Barring a tokenization breakthrough, it seems that new model architectures will be the key to overcoming tokenization’s limitations. As research progresses, it may lead to more efficient and accurate AI models capable of handling diverse languages and complex tasks without the current constraints imposed by tokenization.