Hello everyone, in this section we are going to look at tokenization for Large Language Models (LLMs) some of the thoery and idea behind it necessary for our understanding when its time for us to implement this idea in opencv's dnn module dnn:Tokenization.
One natural question that comes to mind is what is tokenization?
Tokenization if the crucial step that translates human-readable text into the tokens that Large Language Models (LLMs) consume. It acts as a bridge between raw strings and the numneric IDs that the model processes. In LLMs, tokenization often means breaking text into subword units rather than whole words or characters, striking a balance between efficiency and expressiveness. Over the years, various tokenization algoeithms have been developed. This writeup looks at four major methods Byte Pair Encoding (BPE), WordPiece, SentencePiece, and the Unigram model which explains how each works and why one might be chosen over another. For now I look into how thse tokenizers integrate with vision-language pipeline, specifically using OpenCV/s DNN module for tasks like OCR (Optical Character Recognition) and feeding text into transformers.
Modern LLms almost universally rely on subword tokenization. Unlike character-level tokenizers (such as in Karpathy's makemore project, which treats each character as a token), subword methods break text into pieces that are often larger than a single character but smaller than a word. This yields shorter sequences than character models while still handling unkown or rare words gracefully for example by breaking them into familiar pieces.
Byte pair encoding is one of the earliest and most influential subword tokenization methods for LLMs. BPE was originally a data compression algorithm that got adopted for NLP in a 2015 paper by Sennrich et al. The idea is simple: start with an initial vocabulary of all individual characters or bytes, then iteratively merge the most frequent pair of adjacent symbols into a new token. Each merge adds one new token to the vocabulary, which can represent a commonly occurring subword. This process repeats until a target vocabulary size is reached.
How it works: Suppose we have words like hug, hugs, bun in the training corpus. We'd start with single characters: h, u, g, s, b, n. BPE finds the most frequent pair of symbols in this example, "u" + "g" appears often in hug, pug and merges them into a new token "ug". Now hug would be tokenized as ["h", "ug"] "hugs", as ["h", "ug", "s"], and so on. The algorithm continues merging frequent pairs ("u" + "n" -> "un"), ..., until the vocobulary has grown to the desired size.
OpenAI's GPT-2 introduced a subtle but important tweak: applying BPE at the byte level instead of Unicode characters. By treating every byte, 256 possible values, as a base symbol, the tokenizer can encode any text, including emojis or unkown characters, without ever producing an out-of-vocabulary token. GPT-2's tokenizer starts with 256 byte values and then has ~50,000 merges learned from data. This ensures losless, reversible encoding for arbitrary text, eliminating the need for an <unk> token for unseen characters.
Why use BPE: BPE was popularized in LLMs by GPT-2 and remains widely used in models like GPT-3, GPT-4, LLaMA, and more. It has some good benefit to it such as it's reversible and lossless, it works on arbitrary text, and it compresses text efficiently, yielding token sequences that are on average 4 times shorter than raw characters. Another benefit is that BPE tends to group common substrings, so the model sees meaningful chuncks. (e.g splitting "encoding" into "encod" + "ing") letting the model recognize the common suffix. This helps the model generalize by exposing frquent subword patterns. BPE's simplicity and effectiveness made it the de facto starting point for tokenization in LLMs
Many libraries provide BPE implementations. OpenAI's tiktoken is a fast BPE tokenizer for GPT models. HuggingFace's GPT2TokenizerFast, built on the tokenizers libraray, also implements byte-level BPE. Karpathy's minBPE project I am using as a basic template and inspiration for ways to plan a layout for cv::dnn::Tokenizer. In practice, if you use a model like GPT-2 or GPT-3 via HuggingFace's AutoTokenizer, it will automatically apply BPE under the hood which I plan to review and study of how this process happens.
WordPiece is another subword tokenization method, orginally developed at Google for speech and text models (Schuster et al., 2012) and famously used in BERT. At first glance it is similar to BPE is also builds a vocabulary of subwords by iteratively adding tokens-but it uses a different criterion for which pair to merge next. Instead of merging the most frequent pair as BPE does, WordPiece chooses the pair that maximizes the likelihood of the training corpus when that pair is added to the vocabulary. In pratice, this often means WordPiece will avoid merging very high-frequency pairs if their components are also high-frequency in other contexts. It perfers merges that give a big jump in explaninng the data. One way to think of it is WordPiece evaluates what information would be lost by not merging a pair, and only merges if the gain outweighs the loss.
How it works: WordPiecew also starts with all individual characters in the vocab. It then tries potential merges and calculates a score, often frequency of the pair divided by frequencies of its parts. The pair with the best score-meaning it is disproportionately frequent together gets merges first. This process continues until the vocab is a desired size or likelihood gain falls off. The end result is a set of subword tokens. To tokenize a new word, WordPiece uses a greedy longest-match algorithm: e.g. for "hugging", it would try to match the longest subword in the vocabulary that is a prefix of the word, then continues with the remainder. BERT's WordPiece vocabulary, for instance contains tokens like "hug", "##ging", where "##" is a marker indicating the token is a continuation of a word. So "hugging" becomes "hug" + "##ging".
Why use WordPiece: WordPiece was chosen for BERT and its descendants such as DistilBERT, Electra, and more, ans has been proven to work well in practice. Compared to BPE, its merge decisions might yield a slightly different vocabulary often WordPiece avoids very common affixes being fused with stems if those affixes appear broadly. In other words, WordPiece is a bit more conservative in merging, trying not to overfit to particular training words. The difference between BPE and WordPiece, however, are subtle; in fact, both algorithms often produce quite similar segmentations for most words. One notable difference is that WordPiece does not need to store the merge list once training is done, all it needs is the final vocabulary. This can simplify the runtime tokenization: a new word is broken by finding the longest vocab entry that fits, then the next, etc., which is straightforward to implement. BPE, on the other hand, typically merges in a fixed orde, though one can also simulate it with longest-match since the merges produce a deterministic vocabulary.
SentencePiece is not exactly a new algorithm, but rather an open-source tokenizer library from Google that supports multiple approaches, BPR, Unigram, and more, with some useful innovations. It was introduced by Kudo and Richardson (2018) to provide language-independent tokenization. One key feature is that SentencePiece treats the input as a stream of characters without pre-segmentation on whitespace. It uses a special symbol, _ by default, to represent spaces. This means you don't need to preprocess the text to split by spaces or punctuation. SentencePiece handles it internally. This is helpful for languages without clear word boundaries and simplifies training pipelines. By default, the algorithm behind SentencePiece is the Unigram language model, described next. However, SentencePiece can also train a BPE model if requested, and it will output a smiliar merges list and vocab as other BPE implementations. When SentencePiece is refered in context of LLMs like T5 or GPT-J, it often means the Unigram-bases tokenizer trained via the SentencePiece library.
Why use SentencePiece: The library became popular because it's easy to use and very flexible. It's used in many Google models such as ALBERT, T5, mT5, Pegaus, and other like the BigScience BLOOM tokenizer, and more, but often with the Unigram method. Advantages include: handling of whitespace and Unicode seamlessly, direct training on raw text, no need for prior tokenization, and support for the sampling strategies, subword regulariztion, that the Unigram model enables.
The Unigram tokenization algorithm, often used via SentencePiece, takes a fundamentally different approach from BPE or WordPiece. Instead of starting with minimal units and merging to grow the vocabulary, Unigram starts with a large vocabulary and prunes it down to the disered size. It was introduced by Kudo (2018) as part of the Subword Regularization research. Let's go through how it works: We being with a a very large pool of candidate tokens - this courld be all substrings up to a certain length, or all BPE tokens from an overly large vocab, or all whole words plus pieces. The exact init doesn't matter too much, as long as it's comprehensive. Then we assume a probabilistic unigram language model over tokens: each token in the voacb has a probability. Any string can be tokenized in many ways as a sequence of these tokens, and we can assign the string a likelihood equal to the product of the token's probabilities. The training objective is to maximize the likelihood of the training corpus under this model. In pratice, Unigram training proceeds by an iterative pruning: evaluate how important each token is to the corpus likelihood, and remove the least useful tokens in batches e.g., remove 10% of tokens that cause the smallest increase in loss. After each pruning step, the remaining tokens probabilites are re-estimated with maximum likelihood given the corpus segmentation. This repeats until the vocab size reaches the target. Notably, the algorithm ensures that basic characters remain in the vocab so that any input text can still be represented even if it means spelling out a word character by character. At the end of training, we have a vocabulary and an associated probability for each token. To tokenize new text, we choose the token segmentation that maximizes the product of token probabilites. In other words, we break the text into tokens in the way that the model finds most probable. For example, given a Unigram model with vocab containing "hug", "ug", ... the word "hugs" could be segmented as ["hug", "s"] or ["h", "ug", "s"] or ["h", "u", "g", "s"]. The model will calculate the probability of each segmentation and choose the highes-likely "hug" + "s" is a learned token.
Why use Unigram: The Unigram model often yields tokenizations similar to BPE/WordPiece in practice, but it has a couple of advantages. First, by evaluating token combinations with a probability model, it can theoretically find a more optimal set of tokens that greedy merges might. Second, it naturally supports multipke valid tokenizations of the same text with different probabilites. This means during training of a language model, one can sample different tokenizations, according to their probabilites, for data augmentation - this is the subword regularization idea. That can make the model more robust with some cost of added complexity.
HuggingFace's Transformers library provides a unified interface: one can simply do tokenizer = AutoTokenizer.from_pretrained(model_name) and it will download the appropriate tokenizer (BPE, WordPiece, or SentencePiece) for that model. For example, AutoTokenizer.from_pretrained("bert-case-uncases") will give a WordPiece tokenizer, whereas "gpt2" gives a byte-level BPE tokenizer. The heavy lifting is done by the Tokenizers librabry, whic is developed in Rust that we can use as a reference when developing the OpenCV c++ version. Since I aim to develop a tokenizer that is fast with parallelism which the HuggingFace tokenizer accomplish its a great place to look and reference the techniques fo the rust implementation more on this as the project progresses. OpenAI's tiktoken library is another specialized tool, tailored to OpenAI's models, GPT-3, GPT-4 and more. It implements the GPT-2/GPT-3 BPE extremely fast in Python and I believe with some Rust also. For instance, tiktoken's cl00k_base encoding is used for GPT-4 and is essentially BPE with some OpenAI-specific quirks. For how to train certain models we can see how HuggingFace's Tokenizers is used to train BPE, WordPiece, or Unigram on certain data. I will reference some documentation provided by HugginFace to train using BPE, WordPiece, and Unigram. But will focus mainly on BPE to start with as the first model. On the other hand Google's SentencePiece tool can be used standalone to train a tokenizer and then the model can load it. Threre is another good resource as well in llama.cpp
Beyond text, we often need to handle vision-language inputs e.g., reading text from images and feeding it into an LLM, or using an LLM to reason about visual data. OpenCV is known for computer vision but has now included a DNN module that can run deep learning models, and it can be part of a pipeline that bridges images to text tokens.
OpenCV's DNN module allows loading and running pretrained neural networks within the opencv env. It supports models from various frameworks, Caffe, TensorFlow, ONNX, PyTorch, and more, and can run them on CPU or GPU. I further discuss the topic here OpenCV DNN
I propose to integrate native text tokenization support into OpenCV’s DNN module so that Large Language Models (LLMs) can be run end-to-end directly in OpenCV. Currently, to feed text into models like GPT-2 or GPT-3, users must rely on external libraries (e.g. tiktoken) for tokenization. My project will create a built-in tokenizer utility—implemented in C++ with Python bindings—that can convert raw text into token IDs and decode model outputs back into text. The core deliverable is a new cv::dnn::Tokenizer class supporting Byte Pair Encoding (BPE), along with the ability to load typical vocabulary/merges files and handle special tokens. Once implemented, this will allow a seamless pipeline for text-based or multimodal models within OpenCV: raw text -> tokens -> DNN inference -> tokens -> decoded text. I will also write comprehensive documentation and test this functionality against known tokenizer outputs (e.g., Hugging Face), ensuring correctness and good performance. By adding native tokenization, OpenCV’s DNN module becomes more versatile for cutting-edge AI tasks involving both vision and language. This reduces external dependencies, makes it easier to deploy LLMs in pure C++ or Python environments, and helps unify image and text processing in a single library—further strengthening OpenCV as a comprehensive open-source toolkit.
I will outline useful research papers and relevant material that covers what we must know and use to build the outline of the project based on theory and architecture.