Bluesky Thread

not enough is being said about DeepSeek’s multi token prediction (MTP)

View original thread
not enough is being said about DeepSeek’s multi token prediction (MTP)

They were able to get sonnet-level performance with less data than llama 3.3 70B

Does that mean scaling isn’t over? (if we can just be more efficient) Also, does that mean we can train a LLM fully on properly licensed content?
52 6
the vibe i get is that MTP is mainly just useful during training, but it’s a huge signal booster that lets the model pickup on the real signal in the text a lot faster
4
52 likes 6 reposts

More like this

×