Bluesky Thread

mxbai-edge-colbert-v0: tiny long context multi-vector embedding models

View original thread
mxbai-edge-colbert-v0: tiny long context multi-vector embedding models

This report is huge, it gives us:

- Apache 2 17M(!!) and 32M models
- tops LongEmbed benchmark
- reproducible(!!) training pipelines
- extensive ablations to understand ColBERT models

www.mixedbread.com/blog/edge-v0
www.mixedbread.com
Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0
Introducing our new family of extremely efficient ColBERT models, to serve as backbones for modern late interaction research while outperforming ColBERTv2 with just 17 million parameters.
32 2
tech report has more details: www.mixedbread.com/papers/small...

it’s basically a how-to manual for how to train a SOTA late interaction model
1
steps are:

1. contrastive pre-training
2. fine tuning
3. knowledge distillation

they call out distillation as the key that lets their model outperform much larger ones
but all that is just on *single vector* training data. They start with traditional embeddings and then shift to multi-vector
A flowchart labeled “Fig. 1. An overview of the full training process.”

It begins with a red box labeled Base Model, which feeds into a larger blue-outlined block titled Single-Vector Training. Inside this block are three stacked green and blue steps:
	1.	Contrastive Training
	2.	Retrieval Fine-tuning
	3.	Stella-Style Distillation

An arrow from this block leads to a blue box labeled Dense Embedding Model, which then points to a tan box labeled ColBERT KL-Div Training.

Finally, an arrow flows downward to an orange box labeled mxbai-edge-colbert-v0.

The diagram shows a sequential pipeline: a base model undergoes progressive single-vector training (contrastive, retrieval, distillation), producing a dense embedding model, which is further refined with ColBERT KL-divergence training to yield the final model.
2
whoah, Muon works for tiny models?! i thought it was only for managing huge models
Optimizers We benchmarked both AdamW [18] and Muon [10] across a range of learning rates with a fixed batch size. We present the results of these ablations in Table 6. Our results indicate that even with limited experiments and the relatively small batch size that is commonly employed to train late-interaction models, Muon appears to be a strong optimizer for ColBERT model training.
3
32 likes 2 reposts

More like this

×