Bluesky Thread

not enough is being said about DeepSeek’s multi token prediction (MTP)

December 26, 2024 View original thread

not enough is being said about DeepSeek’s multi token prediction (MTP)

They were able to get sonnet-level performance with less data than llama 3.3 70B

Does that mean scaling isn’t over? (if we can just be more efficient) Also, does that mean we can train a LLM fully on properly licensed content?

52 6

the vibe i get is that MTP is mainly just useful during training, but it’s a huge signal booster that lets the model pickup on the real signal in the text a lot faster

More like this