Method

Why a native Brazilian-Portuguese embedding benchmark?

By the MTEB-PT team · Federal Institute of São Paulo (IFSP)

If you need a sentence-embedding model for Brazilian Portuguese (for semantic search, classification, or retrieval-augmented generation), how do you choose one? Until now, the honest answer was: you guess, or you trust a multilingual leaderboard, or you read the most-reported Portuguese retrieval number. That number is almost always mMARCO-PT, a machine translation of English MS MARCO.

Translation is a useful first approximation. But it is not neutral. It introduces systematic artifacts (translation noise, domain drift, idiomatic flattening), and those artifacts tend to push every model toward the same score. When a benchmark cannot tell two models apart, it is not measuring the language; it is measuring the translator.

Native, by construction

MTEB-PT takes the opposite stance. It admits only data that was created or found in Portuguese, and excludes translated corpora as a design rule. The 16 tasks span six categories (classification, pair classification, semantic textual similarity, clustering, retrieval, and reranking) built from Brazilian sources: hate-speech and sentiment corpora, native legal and medical retrieval collections, Portuguese natural-language-inference and similarity datasets, and Wikipedia-derived clustering.

The point of the “native” filter is not purity for its own sake. It is discrimination: native text restores the differences between models that translation washes out, so the leaderboard can actually rank them.

What native evaluation reveals

Three findings only become visible once you stop trusting the translated proxy.

The multilingual leaderboard is only a moderate predictor. Matching the models that appear on both, Portuguese rank correlates with the global Hugging Face MTEB multilingual leaderboard at a Spearman ρ ≈ 0.72. That is real information, but roughly half the variance is unexplained. A model that ranks 5th out of ~169 globally lands 49th of 54 here. A multilingual board is a useful first filter; it is not a substitute for evaluating on the language you actually deploy in.

The top of the board is a statistical tie. We attach a confidence interval to every score (10,000 bootstrap resamples) and a paired-bootstrap significance test to every headline comparison. The result: the ten leading models are statistically indistinguishable. The choice among them is a question of cost, license, and latency, not accuracy.

Not all tasks pull their weight. Borrowing Item Response Theory from psychometrics, we estimate how sharply each task separates models of similar ability. Retrieval tasks do most of the work; clustering tasks do the least. That tells a benchmark designer where to spend a fixed evaluation budget.

Built to be reused

MTEB-PT is an install-on-top extension of the MTEB family, so existing tooling works unchanged. Every dataset is pinned to a commit; the benchmark is inference-only, with no training anywhere in the loop; and the full score matrix, per-task confidence intervals, and an interactive leaderboard are released openly.

If you are picking a Portuguese embedding model, start with the leaderboard, and read the confidence intervals before you read the ranks.