What Are LLM Tokens?
Instead of processing whole words like we do, Large Language Models break text down into tokens, which can be parts of words or characters. For example, "I don't like eggs" might be split into tokens like I, don, 't, like, egg, s.
How Text Becomes Tokens
Your text is turned into tokens through a smart splitting process:
- Smart Splitting: Text is divided into tokens based on common word parts, patterns, and punctuation.
- Example - "can't": This common word often becomes two tokens: 'can' and ''t', because ''t' changes the meaning of 'can'.
- Recognizing Word Structure: Words like Brightness and Darkness are split into smaller parts (e.g., 'Bright' + 'ness', 'Dark' + 'ness'). Because they both end in 'ness', the model can recognize similarities in word formation and meaning
- Note that not all of these concepts always apply. Many models tokenize "can't" as one token
Tokens in Different Languages
The number of tokens can change a lot between languages, even for the same concept:
- Language Efficiency: Languages that use fewer characters to say more (like Chinese) often appear to need fewer tokens than English for the same meaning. However, since SOTA tokenizers are primarily targeted at English, other languages often tokenize into more tokens.
Understanding tokens can help you get more out of LLMs. You can see how text is tokenized using tools like OpenAI's Tokenizer.
Source: https://learnprompting.org/docs/basics/chatbot_basics#tokens