Trimming embedding vocabularies: half the parameters, ~99% of the score
A multilingual embedding model keeps most of its parameters in one place: the token-embedding matrix. EmbeddingGemma-300M, for example, devotes roughly half of its weights to representing a vocabulary that spans a hundred languages. If you only ever serve one language, the other languages’ embeddings are dead weight, memory and bandwidth you pay for on every load and never use.
embedding-vocab-trimmer removes that dead weight. It shrinks a model’s vocabulary to a single language while leaving the encoder bit-for-bit unchanged, with no training and no GPU.
The result, first
On Portuguese, trimming EmbeddingGemma-300M to a 64k vocabulary keeps 98.8% of the full model’s MTEB(por) score at about half the parameters:
| Vocabulary | Parameters | MTEB(por) score | % of full |
|---|---|---|---|
| 64k | ~157M | 0.7172 | 98.8% |
| 128k | ~207M | 0.7192 | 99.1% |
| Full model | ~308M | 0.7257 | 100% |
64k is the sweet spot: essentially full-model quality at half the parameters. The score it is measured against (0.7257 for the full EmbeddingGemma-300M) is the same 16-task mean that model earns on the MTEB-PT leaderboard, which is what makes a compression claim like “98.8% of quality” checkable rather than rhetorical.
How it works
Four steps, all on CPU:
- Token frequency mining. Run a sample of target-language text through the tokenizer and count how often each token appears.
- Vocabulary selection. Keep the top-K most frequent tokens plus the special tokens (pad / bos / eos / unk), and re-index them contiguously.
- BPE merge filtering. Drop any merge rule that references a removed token, so the tokenizer can never emit an id that no longer exists.
- Embedding-matrix slicing. Extract the kept rows of the embedding matrix, update
config.vocab_size, and reattach the unchanged encoder and pooling layers.
Only the embedding matrix changes. The transformer blocks and the pooling head are untouched, which is exactly why quality holds: you are deleting rows the language never indexes, not approximating the ones it does.
Try it
pip install -r requirements.txt
python trim_vocab.py \
--model google/embeddinggemma-300m \
--corpus-config por \
--vocab-size 64000 \
--output ./embeddinggemma-pt-br
It accepts language codes (por, fra, deu, spa) or a custom dataset via --corpus-dataset, and works on SentenceTransformers models with a transformers encoder and a BPE tokenizer with byte-fallback. A ready-made Portuguese build is on Hugging Face at tardellirs/embeddinggemma-pt-br.
What it is not
This is a compression method, not an enhancement. Trimming the vocabulary preserves quality; it does not improve it. Attempts to go further with fine-tuning or pruning of the encoder reduced the score; the encoder is doing the work, and it should be left alone. The honest framing is: same model, same quality, half the footprint, for the one language you actually deploy.
Vocabulary trimming is an ablation alongside MTEB-PT, not part of the benchmark ranking. The benchmark is what makes it measurable: without a native Portuguese score to compare against, “98.8% of quality” would just be a number.
Code and details: github.com/tardellirs/embedding-vocab-trimmer (Apache-2.0; trimmed models inherit the base model’s license).