Tutorial

Trimming embedding vocabularies: half the parameters, ~99% of the score

By the MTEB-PT team · Federal Institute of São Paulo (IFSP)

A multilingual embedding model keeps most of its parameters in one place: the token-embedding matrix. EmbeddingGemma-300M, for example, devotes roughly half of its weights to representing a vocabulary that spans a hundred languages. If you only ever serve one language, the other languages’ embeddings are dead weight, memory and bandwidth you pay for on every load and never use.

embedding-vocab-trimmer removes that dead weight. It shrinks a model’s vocabulary to a single language while leaving the encoder bit-for-bit unchanged, with no training and no GPU.

The result, first

On Portuguese, trimming EmbeddingGemma-300M to a 64k vocabulary keeps 98.8% of the full model’s MTEB(por) score at about half the parameters:

Vocabulary	Parameters	MTEB(por) score	% of full
64k	~157M	0.7172	98.8%
128k	~207M	0.7192	99.1%
Full model	~308M	0.7257	100%

64k is the sweet spot: essentially full-model quality at half the parameters. The score it is measured against (0.7257 for the full EmbeddingGemma-300M) is the same 16-task mean that model earns on the MTEB-PT leaderboard, which is what makes a compression claim like “98.8% of quality” checkable rather than rhetorical.

How it works

Four steps, all on CPU:

Token frequency mining. Run a sample of target-language text through the tokenizer and count how often each token appears.
Vocabulary selection. Keep the top-K most frequent tokens plus the special tokens (pad / bos / eos / unk), and re-index them contiguously.
BPE merge filtering. Drop any merge rule that references a removed token, so the tokenizer can never emit an id that no longer exists.
Embedding-matrix slicing. Extract the kept rows of the embedding matrix, update config.vocab_size, and reattach the unchanged encoder and pooling layers.

Only the embedding matrix changes. The transformer blocks and the pooling head are untouched, which is exactly why quality holds: you are deleting rows the language never indexes, not approximating the ones it does.

Try it

pip install -r requirements.txt

python trim_vocab.py \
    --model google/embeddinggemma-300m \
    --corpus-config por \
    --vocab-size 64000 \
    --output ./embeddinggemma-pt-br

It accepts language codes (por, fra, deu, spa) or a custom dataset via --corpus-dataset, and works on SentenceTransformers models with a transformers encoder and a BPE tokenizer with byte-fallback. A ready-made Portuguese build is on Hugging Face at tardellirs/embeddinggemma-pt-br.

What it is not

This is a compression method, not an enhancement. Trimming the vocabulary preserves quality; it does not improve it. Attempts to go further with fine-tuning or pruning of the encoder reduced the score; the encoder is doing the work, and it should be left alone. The honest framing is: same model, same quality, half the footprint, for the one language you actually deploy.

Vocabulary trimming is an ablation alongside MTEB-PT, not part of the benchmark ranking. The benchmark is what makes it measurable: without a native Portuguese score to compare against, “98.8% of quality” would just be a number.

Code and details: github.com/tardellirs/embedding-vocab-trimmer (Apache-2.0; trimmed models inherit the base model’s license).