Research

MVEB: embeddings go to video

By the MTEB-PT team · Federal Institute of São Paulo (IFSP)

Embeddings started with text. They have since spread to images and code, and a new paper pushes them one modality further: MVEB, the Massive Video Embedding Benchmark. It comes from the core MTEB team (including Niklas Muennighoff and Kenneth Enevoldsen, the people behind the text benchmarks the rest of our work builds on), and it is worth a look even if you only ever touch text.

What it measures

MVEB is a 23-task benchmark for video embeddings, distilled from a much larger pool of 184 tasks (“MVEB+”) to keep evaluation tractable without losing coverage. The tasks span the now-familiar MTEB shape (classification, zero-shot classification, clustering, pair classification, and retrieval) plus video-centric question answering. Crucially, it is not video alone: the benchmark works across video, audio, and text, and the paper evaluates 33 models on it.

What it found

Two results stand out. First, no single architecture wins everything. Embeddings from multimodal large language models lead on classification, clustering, pair classification, and QA; multimodal-binding approaches lead on retrieval and zero-shot classification. The right model depends on the task, the same lesson the text leaderboards keep teaching.

Second, audio is not free signal. Adding the audio track helps when the labels genuinely depend on sound, but hurts when they derive from the visuals alone, a swing of about six points either way. More modalities is not automatically more information; it depends on what the task actually rewards.

Why the method matters for text

MVEB plugs into the MTEB ecosystem, so video joins text, image, audio, and code under one unified way to evaluate embeddings. But the deeper point is methodological. The hard part of a video benchmark is not the videos; it is the discipline: selecting tasks that actually discriminate models, pruning 184 candidates down to 23 without losing signal, and aggregating fairly across heterogeneous tasks.

That is exactly the discipline behind a good text benchmark, and it is the discipline we lean on for MTEB-PT: confidence intervals on every score, significance tests on every comparison, and an item-response analysis of which tasks separate models and which are redundant. A benchmark earns trust the same way whether it encodes a sentence in Portuguese or a ten-second clip, by being honest about what it can and cannot tell apart.

Reference: El Assadi et al., “MVEB: Massive Video Embedding Benchmark” (arXiv:2606.14958).