new 3-token attention reduces pre-training data requirements
the pre-training scaling laws dictated that you have to scale up model size, data and compute in tandem. But this new method means you can double the model size without doubling the data
arxiv.org/abs/2507.02754
new 3-token attention reduces pre-training data requirements
View original threadin traditional attention, every token is compared to every other token
in this new 2-simplicial attention, it’s a 3-way comparison
the net effect is it does more work in one attention layer
“token j is a function name and token k is an opening parenthesis” can be discovered in a single layer
in this new 2-simplicial attention, it’s a 3-way comparison
the net effect is it does more work in one attention layer
“token j is a function name and token k is an opening parenthesis” can be discovered in a single layer
2
does this make attention require more GPU time?
it would, but no, the paper also ships with a new Triton kernel with some windowing tricks that bring it back to status quo amount of compute (in terms of algorithmic complexity)
it would, but no, the paper also ships with a new Triton kernel with some windowing tricks that bring it back to status quo amount of compute (in terms of algorithmic complexity)
3