Vector Stores Are Dumb

Mon August 12, 2024

“Is this magic?”

Yeah, that’s a real quote from me the first time I used Qdrant with OpenAI embeddings. But after building a few apps, the magic wore off and annoyance set in. Now, my mantra is,

“this is dumb, it shouldn’t be so dumb”

Over time, I’ve become convinced that, while they sometimes feel magical, the dumb-ness of vector stores only goes away when we decide to embrace something more structured, like a graph database or knowledge graph.

Chunking Is Dumb

The idea behind vector stores like Qdrant is to find documents that are similar to the query. The dumb part is that long documents can distract from the contents and confuse the vector store.

Let’s say you have an article about analyzing the liveness properties of Redis, but it also has a heavy dose of memes and jokes about furries as well as rants about programming languages. The problem is an embedding vector only represents a single point in space. So maybe 50% of the magnitude of the embedding vector is dedicated to distributed systems, the rest might be divided over furries and programming languages. So the “point in space” that represents the article isn’t as on-topic as you’d think it should be.

So what do you do? You chunk it. You break the text up into smaller pieces so that each embedding vector is more focused and matches similarity queries more acurately.

But how big should the chunks be? Obviously too big is a problem, but too small is also a problem if it’s so small that it all the context is missing. So how big do you make it? The internet typically says stuff like “250 word chunks is good”. But the truth is more complicated than that. Dense writing like science research or law can cover a lot of ideas in 250 words. Then again, other writing contains a lot of subtle references, and small chunks don’t give the embedding model enough information to work off of (example: replies to a tweet).

Chunks are just too primative, but they’re fundamental to vector stores.

Graphs of Ideas

The solution is obvious. Small chunks are better, so boil it down as small as it goes: ideas.

graph TD Redis-->has["has a"]-->rep["replication protocol"] Redis-->uses["uses"]-->lead["leader/follower replication"]-->is["is a"]-->rep

Identify ideas and things and then map their relationships. Maybe it’s a strict knowledge graph, maybe it’s looser, but either way it’s a hella lot more structured than a pile of text.

When you’re prompting the LLM, you use graph algorithms to carve off the most similart part and distill it down to basic statements:

Redis has a replication protocol
Redis uses leader/follower replication
leader/follower replication is a replication protocol

Walking the graph also jumps between disperate ideas that don’t initially seem connected when approached via a direct similarity search. As a result, the AI chat ends up feeling a whole lot more intelligent.

Provenance: How Did You Get So Dumb?

I generally call my software “dumb” when there’s a bug. LLM software is no different, and with RAG, the bad answer is almost always because it didn’t find the right document. And since I log literally everything (I hope you do too), I get the pleasure of reading through a list of text snippets that are chopped up so horrendously that I start to wonder how tf any of this even works at all.

Right, so aside from chunking being bad, the debugging process is really primative. When you finally find the issue, it’s typically in the ingestion code that seems very detatched from runtime querying. And fixing it is as simple as re-ingesting most (if not all) of your database because you can’t just query it like a normal database to find all the problems.

Again, graphs. The answer is graphs. They’re structured, you can pinpoint individual facts. You can mark each node & edge with the document(s) that corroborate it. But most important: you can just update a single fact, or delete it. Just one.

Collaboration is Critical

This is extremely important. Subject matter experts (SMEs) often don’t have programming skills, and certainly aren’t elbow deep in your particular ingestion code. So you often can’t utilize SMEs for QA & testing. Or at least not effectively, since you need a SME to come up with the questions and then also a programmer to answer them.

Graphs move that back into the realm of a simple CRUD app. And those sorts of CRUD apps exist, off-the-shelf. e.g. Neo4j has pre-built generic tools for visualizing & editing graph databases.

If you give your SME a simple UI for them to query the database, they can be a LOT MORE effective as an expert. I saw this on repeat when working on data systems in healthcare. The domain is so complex that most programmers don’t understand more than the basics. On the other hand, most business people don’t have that much trouble picking up a basic level of SQL knowledge, enough to answer 70% of their questions autonomously.

When the experts are empowered, the bug reports get dramatically better.

Validation Shouldn’t Be So Dumb

An oft-cited problem with LLMs is the security angle. Particularly how you can trivially perform prompt injection if you gain enough access to write an article that get ingested into the RAG vector store. And once it gets ingested, it’s nearly impossible to find, because chunking is dumb and graph databases can definitely solve this.

How do graphs solve this? Because you have to parse everything that goes in, and parsing can be better than validation. Is it perfect? Absolutely not, you can still inject false statements. But it’s a lot harder to exploit.

I’m not sure what a complete solution will be, but vector stores give you zero hooks for grappling with the problem whereas graphs give you some.

Graph It Up!

Alright, are you convinced graph databases are a good idea for LLM apps? Great, but you’ll quickly discover that building knowledge graphs from text isn’t entirely easy yet.

Tools like Triplex help you automatically construct knowledge graphs. Sounds promising, but there’s still quite a bit of configuration to get right.

I’m building a tool that makes this easier for you to have your own personal knowlege graph. I believe everyone should be able to have a personalized “AI” that can be “trained” just by shouting voice notes to yourself, or by pointing it at podcasts and videos you wish you had time to listen to.