Bluesky Thread

new 3-token attention reduces pre-training data requirements

July 06, 2025 View original thread

new 3-token attention reduces pre-training data requirements

the pre-training scaling laws dictated that you have to scale up model size, data and compute in tandem. But this new method means you can double the model size without doubling the data

arxiv.org/abs/2507.02754

arxiv.org

Fast and Simplex: 2-Simplicial Attention in Triton

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count toge...

29 2

in traditional attention, every token is compared to every other token

in this new 2-simplicial attention, it’s a 3-way comparison

the net effect is it does more work in one attention layer

“token j is a function name and token k is an opening parenthesis” can be discovered in a single layer

4 The 2-simplicial Transformer
((a)) 1-simplex between two nodes i, j — a straight line between two points
((b)) 2-simplex between three nodes i, j, k — a triangle between three points
Figure 1: Geometry of dot product attention and 2-simplical attention.

does this make attention require more GPU time?

it would, but no, the paper also ships with a new Triton kernel with some windowing tricks that bring it back to status quo amount of compute (in terms of algorithmic complexity)

More like this