Bluesky Thread

⚠️ Readable Paper Alert ⚠️

December 14, 2024 View original thread

⚠️ Readable Paper Alert ⚠️

BLT: what if we just got rid of tokenization?

Result:

* text looks a lot like audio, video, PDF, it’s all just bytes
* dynamically reduce compute based on difficulty
* new scaling axis (patch size)

ai.meta.com/research/pub...

ai.meta.com

Byte Latent Transformer: Patches Scale Better Than Tokens | Research - AI at Meta

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at...

29 9

This is a readable paper, not “very readable”. You’ll still have to skip large parts unless you’re familiar with transformers. But the abstract & sections 1-2 are easily digestible and contain lots of mind blowing statements

BLT diverges from transformer architecture, it actually has 3 transformers. Also, it carries hidden state over between iterations, where transformers convert all state to tokens. Regardless, even the hidden state goes through attention

The reason this paper is important is because they matched/exceeded the performance of llamas trained on the same dataset

it works as advertised

one big area where it excelled is in the boring parts. Notice the CUTE benchmarks here have BLT excelling in mundane text processing. Spelling, spelling inverse, and substitute word each have BLT performing near perfect where the Llamas were barely functional

S-T-R-A-W-B-E-R-R-Y

All those dumb problems where people blamed LLM mistakes on tokenization? Those should generally be addressed by BLT architecture

e.g. “which is greater, 9.11 or 9.9?”

media.tenor.com

a sandwich with lettuce tomato and bacon floating in the air

ALT: a sandwich with lettuce tomato and bacon floating in the air

BLT encodes byte into dynamically sized patches

so, 10000 space characters still won’t fill up a single patch bc it doesn’t exceed the entropy threshold, but other times a single byte will be one patch

it seems that compressed files would be less intense, since they’re inherently low entropy 🤔

that opens a new scaling axis — change the entropy threshold to increase patch size

they find that patch size can be scaled with model & data size. The cool part is bigger patches mean fewer iterations through the model, so less compute for better performance

“A critical difference between patches and tokens is that with tokens, the model has no access to underlying byte features”

in other words, tokens insert a layer of indirection, and by removing it we can improve performance

i’m curious if this is going to be a larger trend. If we see huge success removing tokenization, we’ll have models that can natively process text, audio, image and video, but then we’ll realize that the model is extremely sensitive to image & video formats, and the cycle continues, eliminating those

all in all, BLT seems like a crucial paper and i think we may start seeing new models based on it. keep your eyes peeled

More like this