not enough is being said about DeepSeek’s multi token prediction (MTP)
They were able to get sonnet-level performance with less data than llama 3.3 70B
Does that mean scaling isn’t over? (if we can just be more efficient) Also, does that mean we can train a LLM fully on properly licensed content?
not enough is being said about DeepSeek’s multi token prediction (MTP)
View original thread
52
6
the vibe i get is that MTP is mainly just useful during training, but it’s a huge signal booster that lets the model pickup on the real signal in the text a lot faster
4