A Text Embedding Benchmark for Brazilian Portuguese
Native, not translated. 54 embedding models ranked on 16 Brazilian-Portuguese tasks, built only from text written in Portuguese (no machine-translated benchmarks), with confidence intervals, significance tests, and an analysis of which tasks actually separate models.
54 models on native Brazilian-Portuguese tasks, ranked by the 16-task mean. The top 15 are shown below; the full interactive table, the IRT ranking, and per-category views live on Hugging Face.
Open-weight models: quality versus size on a log scale. The dashed line is the Pareto frontier, the best 16-task mean reachable at each parameter budget. Hover any point for details.
| # | Model | Params | License | mean16 |
|---|---|---|---|---|
| 1 | gemini-embedding-001 CLOSED | — | proprietary | 0.744 |
| 2 | text-embedding-3-large CLOSED | — | proprietary | 0.733 |
| 3 | Qwen3-Embedding-8B OPEN | 8B | Apache-2.0 | 0.733 |
| 4 | gemini-embedding-2 CLOSED | — | proprietary | 0.731 |
| 5 | Octen-Embedding-8B OPEN | 8B | Apache-2.0 | 0.728 |
| 6 | embeddinggemma-300m OPEN | 300M | Gemma | 0.726 |
| 7 | voyage-4-large CLOSED | — | proprietary | 0.724 |
| 8 | harrier-oss-v1-27b OPEN | 27B | MIT | 0.722 |
| 9 | embed-v4 CLOSED | — | proprietary | 0.722 |
| 10 | Qwen3-Embedding-4B OPEN | 4B | Apache-2.0 | 0.718 |
| 11 | KaLM-Embedding-Gemma3-12B-2511 OPEN | 11.8B | Tencent-KaLM | 0.711 |
| 12 | F2LLM-v2-8B OPEN | 8B | Apache-2.0 | 0.711 |
| 13 | text-embedding-3-small CLOSED | — | proprietary | 0.710 |
| 14 | harrier-oss-v1-0.6b OPEN | 0.6B | MIT | 0.699 |
| 15 | F2LLM-v2-14B OPEN | 14B | Apache-2.0 | 0.691 |
The leaderboard now spans 169 models, 131 tasks, a retrieval benchmark with private data, and image, audio and video, but still no native Portuguese. Where MTEB-PT fits.
Read →The MTEB team extends the playbook to video and audio with a 23-task benchmark. What it found, and why the method matters for text embeddings too.
Read →The leaderboard’s top model is a closed API, yet the cost–quality frontier is shallow and a free open-weight model ties the leader.
Read →Translated benchmarks quietly flatten the differences between models. Here is what changes when you evaluate on Portuguese that was written in Portuguese.
Read →A multilingual model spends most of its parameters on tokens you never use. We cut EmbeddingGemma-300M to 157M for Portuguese, with zero training.
Read →A full write-up (benchmark design, the statistical layer, IRT task discrimination, and a cross-leaderboard validity analysis) is in preparation and will be posted as an arXiv preprint (cs.CL). A citation will appear here when it is live.
Want your embedding model on the leaderboard? We accept submissions through either channel; pick whichever fits. Every score is reproducible from public scripts, so each new row can be audited.
Share the model ID and any prompt or pooling details. Best for a quick request.
Open a discussion ↗Prefer the code side? File an issue or a pull request on the benchmark repository.
Open an issue ↗