MTEB‑PT
← Blog
Tutorial

Trimming embedding vocabularies: half the parameters, ~99% of the score

A multilingual embedding model keeps most of its parameters in one place: the token-embedding matrix. EmbeddingGemma-300M, for example, devotes roughly half of its weights to representing a vocabulary that spans a hundred languages. If you only ever serve one language, the other languages’ embeddings are dead weight, memory and bandwidth you pay for on every load and never use.

embedding-vocab-trimmer removes that dead weight. It shrinks a model’s vocabulary to a single language while leaving the encoder bit-for-bit unchanged, with no training and no GPU.

The result, first

On Portuguese, trimming EmbeddingGemma-300M to a 64k vocabulary keeps 98.8% of the full model’s MTEB(por) score at about half the parameters:

VocabularyParametersMTEB(por) score% of full
64k~157M0.717298.8%
128k~207M0.719299.1%
Full model~308M0.7257100%

64k is the sweet spot: essentially full-model quality at half the parameters. The score it is measured against (0.7257 for the full EmbeddingGemma-300M) is the same 16-task mean that model earns on the MTEB-PT leaderboard, which is what makes a compression claim like “98.8% of quality” checkable rather than rhetorical.

How it works

Four steps, all on CPU:

  1. Token frequency mining. Run a sample of target-language text through the tokenizer and count how often each token appears.
  2. Vocabulary selection. Keep the top-K most frequent tokens plus the special tokens (pad / bos / eos / unk), and re-index them contiguously.
  3. BPE merge filtering. Drop any merge rule that references a removed token, so the tokenizer can never emit an id that no longer exists.
  4. Embedding-matrix slicing. Extract the kept rows of the embedding matrix, update config.vocab_size, and reattach the unchanged encoder and pooling layers.

Only the embedding matrix changes. The transformer blocks and the pooling head are untouched, which is exactly why quality holds: you are deleting rows the language never indexes, not approximating the ones it does.

Try it

pip install -r requirements.txt

python trim_vocab.py \
    --model google/embeddinggemma-300m \
    --corpus-config por \
    --vocab-size 64000 \
    --output ./embeddinggemma-pt-br

It accepts language codes (por, fra, deu, spa) or a custom dataset via --corpus-dataset, and works on SentenceTransformers models with a transformers encoder and a BPE tokenizer with byte-fallback. A ready-made Portuguese build is on Hugging Face at tardellirs/embeddinggemma-pt-br.

What it is not

This is a compression method, not an enhancement. Trimming the vocabulary preserves quality; it does not improve it. Attempts to go further with fine-tuning or pruning of the encoder reduced the score; the encoder is doing the work, and it should be left alone. The honest framing is: same model, same quality, half the footprint, for the one language you actually deploy.

Vocabulary trimming is an ablation alongside MTEB-PT, not part of the benchmark ranking. The benchmark is what makes it measurable: without a native Portuguese score to compare against, “98.8% of quality” would just be a number.

Code and details: github.com/tardellirs/embedding-vocab-trimmer (Apache-2.0; trimmed models inherit the base model’s license).