Ambient Associative Memory
Recently I started experimenting with ambient associative memory with my open-strix agents. I’m convinced that ambient memory is definitely some piece of the puzzle, although I doubt I’ve landed on the best way.
Break it down:
- Ambient — always there, not at the forefront, but always operating
- Associative — causes the agent to associate what they’re currently doing with something that happened a while ago, or something someone else is doing
- Memory — Ability to recall things that happened or were learned previously. I wrote about current patterns in my last post
What I’ve done:
- Index all memory with a late interaction (multi-vector) embedding model
- On every single tool call, query the index
- Include the top 3 hits, but in the injection site only include 8-12 words along with the file path & offsets within the file
It’s ambient because it happens on every tool call. The agent isn’t intentionally searching. They do whatever they’re asked to do and something randomly comes to mind.
My agents keep making the same mistake twice. In the debrief they nail the lesson — “next time, check X before changing Y”. So we add it to the rules, the pile grows and then the pile just gets ignored.
Ambient associative memory changes this by forcefully (but gently) bringing to mind relevant parts of their memory. Thereby creating coherence across their memory.
Late Interaction Models
The 8-12 words is also important. It’s very small, lightweight, and only the most relevant parts of the most relevant chunks is included. You can’t do this with a normal embedding model.
With a normal single vector embedding model, you divide a document up into 250-500 token chunks. When you query, you get back an entire chunk along with a relevance score. The chunk is as small as it goes.
Compare that with late interaction models. You still chunk up the document, but instead of getting back a single vector, you get one vector per input token. When you query, you get a score for each token. So you can pinpoint which parts of the matching document were most important. When I’m formatting the RAG results to include into the prompt, I use these scores to locate the single token with the highest relevance, and include several tokens around that as context.
But you can also get a single score per document. You just pool (average) all the tokens together into a single vector. For me, I had to break the query up into 2 stages because query time performance was too slow. I start with very large chunks, 32K tokens, and then pool them into 100a token chunks and store those in the index. Then I do the full multi-vector scoring on only the 100 top hits.
Parallel retrieval agents
3fz on bluesky is doing the same thing, but more sophisticated. Her agent runs a subconscious background thread alongside the main model. It mines an experiential vector DB and injects what it finds on top of the live context.
The two are racing. If the cross-encoder reranker beats the main model, the injection lands after the current tool call (the prefill switch is a convenient hook). If it loses, the injection slips to the next tool call. Sometimes it returns nothing. That’s the design — injection is conservative on purpose.
This sits on top of a more traditional stack: self-managed memory blocks, an initial retrieval pass at each user turn, plus a second LLM kept warm to extract atomic memories from the agent’s experience as it runs.
The framing she uses is spontaneous recall — surfacing unknown unknowns near wherever the conversation has drifted, things the agent wouldn’t have known to search for. Inspired by human cognition.
Mine is the dumb-and-synchronous version: every tool call, block and query. Hers parallelizes and gracefully drops the slow ones. Probably the right move once the index gets big.
Conclusion
I think there’s a lot more of these ideas. We’re still early in agent design. I think the important part is that the single thread that’s handling the main task isn’t also responsible for stopping the line of thought to query it’s own memory in lock-step.
This feels like information theory at work. Our own brains as well as CPU architecture discovered that it’s hard to do 2+ things at once. It really feels like there’s some sort of law dictating that high quality associative memory needs to happen out of band, otherwise it distracts from the task at hand.
I’m excited to see more of these.