TITANS & MIRAS: real continual learning
MIRAS = a unifying theory of transformers (attention) and state space models (SSM, e.g. Mamba, RNNs)
TITANS = an optimal MIRAS implementation that’s “halfway between” SSM & transformer with a CL memory module
let’s dive in!
research.google/blog/titans-...
TITANS & MIRAS: real continual learning
View original threadTITANS
this introduces a “continual learning” module, which is a whole new neural net
the NN processes the full input while the transformer (with regular attention) also processes the full input
but the transformer also receives the output of the memory module
this introduces a “continual learning” module, which is a whole new neural net
the NN processes the full input while the transformer (with regular attention) also processes the full input
but the transformer also receives the output of the memory module
2
transformers can consume input the size of entire books, and attention works astoundingly well to recall the right parts
but their capabilities drop precipitously as the input increases
TITANS chunks the input into small episodes and updates itself between them
but their capabilities drop precipitously as the input increases
TITANS chunks the input into small episodes and updates itself between them
2
the thing is, TITANS works crazy well
it’s getting recall at 10M tokens that’s considered up with SOTA for <1M tokens
it’s getting recall at 10M tokens that’s considered up with SOTA for <1M tokens
3
1
TITANS uses “surprise” as a method of deciding what to remember
TITANS is framed here as being for long context, but it really is for continual learning
yes, one long input can be chunked into episodes, or it can just be on-the-job learning day after day
yes, one long input can be chunked into episodes, or it can just be on-the-job learning day after day
3
the MIRAS paper is a theoretical breakthrough. Anything with these 4 things fits
1. updatable memory
2. attention bias
3. retention gate
4. memory algorithm
they show how both transformers & SSMs implement this framework, and it helped them discover a more optimal TITANS
arxiv.org/abs/2504.13173
1. updatable memory
2. attention bias
3. retention gate
4. memory algorithm
they show how both transformers & SSMs implement this framework, and it helped them discover a more optimal TITANS
arxiv.org/abs/2504.13173
5