Tokenizer¶

The tokenizer feature flag adds two public functions for counting tokens before a request is sent: count_tokens for raw text and count_request_tokens for a full chat completion request.

Tokenizers are loaded from HuggingFace Hub on first use and cached in a process-global RwLock<HashMap> for the process lifetime. Subsequent calls to the same model family reuse the cached tokenizer with only a read-lock.

Enabling¶

[dependencies]
liter-llm = { version = "...", features = ["tokenizer"] }

API¶

`count_tokens`¶

Count tokens in a plain string using the tokenizer for a given model name.

use liter_llm::tokenizer;

let n = tokenizer::count_tokens("gpt-4o", "Hello, world!")?;
// n ≈ 4

`count_request_tokens`¶

Count tokens across all messages in a ChatCompletionRequest. Adds 4 tokens per message as overhead for role tag, separators, and formatting metadata. This matches the OpenAI tokenization overhead estimate. Multimodal content parts (images, audio, documents) are not counted; only text content contributes.

use liter_llm::{tokenizer, types::{ChatCompletionRequest, Message, SystemMessage, UserMessage, UserContent}};

let req = ChatCompletionRequest {
    model: "gpt-4o".to_owned(),
    messages: vec![
        Message::System(SystemMessage { content: "You are helpful.".into(), name: None }),
        Message::User(UserMessage {
            content: UserContent::Text("What is 2+2?".into()),
            name: None,
        }),
    ],
    ..Default::default()
};

let n = tokenizer::count_request_tokens("gpt-4o", &req)?;
// n = text_tokens + (2 messages × 4 overhead tokens)

Model-to-tokenizer mapping¶

liter-llm maps model name prefixes to HuggingFace tokenizer repository IDs. When no prefix matches, the GPT-4o tokenizer is used as a reasonable approximation for modern LLMs.

Model prefix	HuggingFace tokenizer	Notes
`gpt-4`, `gpt-3.5`, `chatgpt`, `o1`, `o3`, `o4`	`Xenova/gpt-4o`	Covers all OpenAI chat and reasoning models
`claude`, `anthropic`	`Xenova/claude-tokenizer`	Anthropic models
`gemini`, `vertex_ai`	`google/gemma-2b`	Google / Vertex AI models
`mistral`, `codestral`	`mistralai/Mistral-7B-v0.1`	Mistral models
`command`, `cohere`	`Cohere/command-r-plus-tokenizer`	Cohere models
`llama`, `meta-llama`	`meta-llama/Meta-Llama-3-8B`	Meta Llama models
(all others)	`Xenova/gpt-4o`	Fallback; approximation only

These counts are estimates. Every provider tokenizes slightly differently; the numbers are useful for pre-flight checks (avoiding obvious context-window overflows) but not for exact billing.

Caching behavior¶

The first call for a model family downloads the tokenizer from HuggingFace Hub and stores it in a process-global cache. This requires network access and takes roughly 100-500 ms depending on model size and connection speed. All subsequent calls use the cached tokenizer.

The cache uses a two-phase locking strategy to avoid redundant downloads under concurrent access:

Read lock: check if the tokenizer is already cached.
On miss, write lock: double-check (another task may have raced), then download and insert.

Poisoned locks return LiterLlmError::BadRequest. Tokenizer download failures (network errors, missing repository) also return LiterLlmError::BadRequest with the HuggingFace error message.

Use cases¶

Token counting is useful in two scenarios:

Pre-flight context-window checks. Count tokens before sending to avoid a ContextWindowExceeded error and trim the conversation history if needed.
Budget estimation. Combine with Cost Estimation to project cost before committing to a request.