Longcat-Flash-Chat (560B)
uh, holy shit this one is intriguing. bare minimum they compare themselves to all the (actual) top models and do okay
but inside.. damn this one has some cool ideas
huggingface.co/meituan-long...
Longcat-Flash-Chat (560B)
View original thread
45
5
most interesting — dynamic computation
not only is it fairly sparse MoE, each token can receive dynamically more compute via a PID controller for bias adjustment 🤯
so when it gets to a token that requires extra thought, it’ll just spin there, computing more
not only is it fairly sparse MoE, each token can receive dynamically more compute via a PID controller for bias adjustment 🤯
so when it gets to a token that requires extra thought, it’ll just spin there, computing more
14
how are chinese labs cutting their dependence on NVIDIA? like this:
run experiments on tiny models, transfer hyperparameters (result of experiments) to a far larger model for the yolo run
bsky.app/profile/timk...
run experiments on tiny models, transfer hyperparameters (result of experiments) to a far larger model for the yolo run
bsky.app/profile/timk...
DeepSeek is reducing their dependence on NVIDIA
they do small scale training runs & experiments on Huawei Ascend, but yolo runs on NVIDIA
they do small scale training runs & experiments on Huawei Ascend, but yolo runs on NVIDIA
16
2
1 hour later
oh this took me too long to figure out — the "zero computation experts"
they have a (mostly) regular MoE router, but some of the experts are actually nothing at all. So the MoE router sometimes entirely skips experts
they have a (mostly) regular MoE router, but some of the experts are actually nothing at all. So the MoE router sometimes entirely skips experts
13
the "shortcut-connected MoE" part is solving a more complex problem than it seems on the surface
the problem is the hand-off between attention & MoE causes communication overhead (e.g. expert is located on a different GPU)
ScMoE re-orders the pipeline, better utilizing compute
the problem is the hand-off between attention & MoE causes communication overhead (e.g. expert is located on a different GPU)
ScMoE re-orders the pipeline, better utilizing compute
10
i feel like this is the sort of shit you see when the US Government locks down compute bandwidth but not compute itself. We saw something similar with DeepSeek slinging their own PTX instead of CUDA to get around the nerfed comms
13