Ecosystem

MTEB in 2026: an evaluation ecosystem, and the Portuguese gap

By the MTEB-PT team · Federal Institute of São Paulo (IFSP)

When the Massive Text Embedding Benchmark (MTEB) launched, it was one leaderboard for English text. In 2026 it is something larger: a whole evaluation ecosystem, with a refreshed home at mteb.org. The multilingual leaderboard alone now ranks 169 models across 131 tasks. If you build or pick embedding models, this is the map, so it is worth knowing what is on it and what is not.

Retrieval, with a twist: RTEB

The most interesting recent addition is RTEB, the Retrieval Embedding Benchmark, now live as a dedicated retrieval section. Its design tackles a real problem: leaderboard scores drift upward as models quietly train on the same public test sets, a “teaching to the test” effect that makes the numbers look better than the models are.

RTEB's answer is to mix open datasets (public and fully reproducible) with private datasets held only by the maintainers. A model that memorized the public sets but cannot generalize will score well on the open half and poorly on the private half, and the gap is exactly the thing RTEB measures. It spans the domains where retrieval actually earns its keep (law, healthcare, finance, and code) across roughly fifteen languages and language groups, among them Chinese, German, French, Japanese, Korean, Dutch, Polish, Russian, Thai, Persian, Vietnamese, Spanish, and Indic and Scandinavian families.

Beyond text

The ecosystem has also outgrown text. Under the same roof and the same evaluation discipline, MTEB now reaches across modalities: image embeddings, code, and, most recently, audio and video, the latter through the new Massive Video Embedding Benchmark. The promise is a single, consistent way to compare embedding models no matter what they encode.

The language-specific family

Alongside the global multilingual board, MTEB has grown a family of language-specific benchmarks, each maintained by people who actually speak the language: Scandinavian, Chinese, French, Polish, German, Russian, Vietnamese. These exist for a reason we have written about before: a global multilingual rank is only a moderate predictor of how a model behaves on any one language. On native Brazilian Portuguese, that correlation is about ρ = 0.72, useful but not a substitute.

The Portuguese gap

Here is what is not on the map. Across the multilingual board, RTEB's language list, and the language-specific family, Brazilian Portuguese still has no dedicated, native benchmark, despite being spoken by over 200 million people. The most-reported Portuguese retrieval number remains a machine translation of English MS MARCO.

That gap is the entire reason MTEB-PT exists. It is built as an install-on-top extension of MTEB, using the same tooling and the same conventions, so it slots straight into this ecosystem and is designed to be contributed upstream. Native Brazilian-Portuguese tasks, the same statistical rigor the rest of the family is moving toward, and a leaderboard that ranks the models a Portuguese practitioner would actually deploy.

The value of an ecosystem is shared method. MTEB-PT is simply the Portuguese-speaking citizen of it.

Further reading: the MTEB leaderboard; coverage of RTEB on InfoQ.