What are Tokens?

2025-05-12

What Are LLM Tokens?

Instead of processing whole words like we do, Large Language Models break text down into tokens, which can be parts of words or characters. For example, "I don't like eggs" might be split into tokens like I, don, 't, like, egg, s.

How Text Becomes Tokens

Your text is turned into tokens through a smart splitting process:

Smart Splitting: Text is divided into tokens based on common word parts, patterns, and punctuation.
Example - "can't": This common word often becomes two tokens: 'can' and ''t', because ''t' changes the meaning of 'can'.
Recognizing Word Structure: Words like Brightness and Darkness are split into smaller parts (e.g., 'Bright' + 'ness', 'Dark' + 'ness'). Because they both end in 'ness', the model can recognize similarities in word formation and meaning
Note that not all of these concepts always apply. Many models tokenize "can't" as one token

Tokens in Different Languages

The number of tokens can change a lot between languages, even for the same concept:

Language Efficiency: Languages that use fewer characters to say more (like Chinese) often appear to need fewer tokens than English for the same meaning. However, since SOTA tokenizers are primarily targeted at English, other languages often tokenize into more tokens.

Understanding tokens can help you get more out of LLMs. You can see how text is tokenized using tools like OpenAI's Tokenizer.

Source: https://learnprompting.org/docs/basics/chatbot_basics#tokens

What are Tokens?

2025-05-12

What Are LLM Tokens?

How Text Becomes Tokens

Your text is turned into tokens through a smart splitting process:

Smart Splitting: Text is divided into tokens based on common word parts, patterns, and punctuation.
Example - "can't": This common word often becomes two tokens: 'can' and ''t', because ''t' changes the meaning of 'can'.
Recognizing Word Structure: Words like Brightness and Darkness are split into smaller parts (e.g., 'Bright' + 'ness', 'Dark' + 'ness'). Because they both end in 'ness', the model can recognize similarities in word formation and meaning
Note that not all of these concepts always apply. Many models tokenize "can't" as one token

Tokens in Different Languages

The number of tokens can change a lot between languages, even for the same concept:

Language Efficiency: Languages that use fewer characters to say more (like Chinese) often appear to need fewer tokens than English for the same meaning. However, since SOTA tokenizers are primarily targeted at English, other languages often tokenize into more tokens.

Understanding tokens can help you get more out of LLMs. You can see how text is tokenized using tools like OpenAI's Tokenizer.

Source: https://learnprompting.org/docs/basics/chatbot_basics#tokens

Invite Friends

Join Our Community

What are Tokens?

What Are LLM Tokens?

How Text Becomes Tokens

Tokens in Different Languages

Invite Friends

Join Our Community

What are Tokens?

What Are LLM Tokens?

How Text Becomes Tokens

Tokens in Different Languages