Bluesky Thread

Limits of vector search

View original thread
Limits of vector search

a new GDM paper shows that embeddings can’t represent combinations of concepts well

e.g. Dave likes blue trucks AND Ford trucks

even k=2 sub-predicates make SOTA embedding models fall apart

www.alphaxiv.org/pdf/2508.21038
www.alphaxiv.org
On the Theoretical Limitations of Embedding-Based Retrieval | alphaXiv
View recent discussion. Abstract: Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-followi...
81 23
btw even adding a reranker won’t help if you’ve already dropped the relevant results in the first stage embedding retrieval

agentic search DOES work, but now you’re relying on an expensive LLM to resolve simple boolean logic
7 2
multi-vector (late interaction) search like ColBERT also works, because it handles the predicate logic in cheaper latent space, but storage costs are a lot higher because, well it’s multi-vector

(fwiw Qdrant and a few other vector DBs support multi-vectors)

huggingface.co/jinaai/jina-...
huggingface.co
jinaai/jina-colbert-v2 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
5 3
you really need to capture the query and decompose it into multiple sub queries

e.g. maybe get a 1B-3B LLM to rewrite the query into a DSL (e.g. a JSON breakdown of the various components and concepts in the query)

and then push that logic into the database engine itself
4 3
alternatively, sparse approaches like SPLADE do this in latent space but use inverted indices (regular full text search, exact matches)

arxiv.org/abs/2107.057...
arxiv.org
SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking
In neural Information Retrieval, ongoing research is directed towards improving the first retriever in ranking pipelines. Learning dense embeddings to conduct retrieval using efficient approximate nea...
5 2
imo if search is done perfectly, you effectively drive your LLM context to infinity

but it’s very much not a solved problem

to illustrate how underdeveloped this space is — research from 5 years ago still seems like the best ideas (contrast that to LLMs)
10 2
81 likes 23 reposts

More like this

×