⚠️ Readable Paper Alert ⚠️
BLT: what if we just got rid of tokenization?
Result:
* text looks a lot like audio, video, PDF, it’s all just bytes
* dynamically reduce compute based on difficulty
* new scaling axis (patch size)
ai.meta.com/research/pub...
⚠️ Readable Paper Alert ⚠️
View original threadThis is a readable paper, not “very readable”. You’ll still have to skip large parts unless you’re familiar with transformers. But the abstract & sections 1-2 are easily digestible and contain lots of mind blowing statements
2
BLT diverges from transformer architecture, it actually has 3 transformers. Also, it carries hidden state over between iterations, where transformers convert all state to tokens. Regardless, even the hidden state goes through attention
1
The reason this paper is important is because they matched/exceeded the performance of llamas trained on the same dataset
it works as advertised
it works as advertised
1
one big area where it excelled is in the boring parts. Notice the CUTE benchmarks here have BLT excelling in mundane text processing. Spelling, spelling inverse, and substitute word each have BLT performing near perfect where the Llamas were barely functional
S-T-R-A-W-B-E-R-R-Y
S-T-R-A-W-B-E-R-R-Y
2
All those dumb problems where people blamed LLM mistakes on tokenization? Those should generally be addressed by BLT architecture
e.g. “which is greater, 9.11 or 9.9?”
e.g. “which is greater, 9.11 or 9.9?”
3
BLT encodes byte into dynamically sized patches
so, 10000 space characters still won’t fill up a single patch bc it doesn’t exceed the entropy threshold, but other times a single byte will be one patch
it seems that compressed files would be less intense, since they’re inherently low entropy 🤔
so, 10000 space characters still won’t fill up a single patch bc it doesn’t exceed the entropy threshold, but other times a single byte will be one patch
it seems that compressed files would be less intense, since they’re inherently low entropy 🤔
1
that opens a new scaling axis — change the entropy threshold to increase patch size
they find that patch size can be scaled with model & data size. The cool part is bigger patches mean fewer iterations through the model, so less compute for better performance
they find that patch size can be scaled with model & data size. The cool part is bigger patches mean fewer iterations through the model, so less compute for better performance
2
“A critical difference between patches and tokens is that with tokens, the model has no access to underlying byte features”
in other words, tokens insert a layer of indirection, and by removing it we can improve performance
in other words, tokens insert a layer of indirection, and by removing it we can improve performance
1
i’m curious if this is going to be a larger trend. If we see huge success removing tokenization, we’ll have models that can natively process text, audio, image and video, but then we’ll realize that the model is extremely sensitive to image & video formats, and the cycle continues, eliminating those
1
all in all, BLT seems like a crucial paper and i think we may start seeing new models based on it. keep your eyes peeled
1