Bluesky Thread

Longcat-Flash-Chat (560B)

View original thread
Longcat-Flash-Chat (560B)

uh, holy shit this one is intriguing. bare minimum they compare themselves to all the (actual) top models and do okay

but inside.. damn this one has some cool ideas

huggingface.co/meituan-long...
The image is a multi-panel bar chart comparing performance of different large language models across several benchmarks. It is divided into four categories: General Domains, Agentic Tool Use, Code, and Instruction Following. Each panel has bars representing model results, with scores on the y-axis.

Top row – General Domains:
	•	ArenaHard-V2: LongGPT-Flash leads with 86.5, followed by Kimi K2 (88.2), DeepSeek V3.1 (84.1), Claude Sonnet (61.5), GPT-4.1 (62.1), Qwen3.5 MoE-2507 (85.7), and Gemini 2.5 Flash (77.0).
	•	MMLU-Pro: Best scores are Kimi K2 (84.5) and DeepSeek V3.1 (84.5), with LongGPT-Flash (82.7), Qwen3.5 MoE-2507 (82.1), GPT-4.1 (81.7), Claude Sonnet (83.7), Gemini 2.5 Flash (82.0).

Top row – Agentic Tool Use:
	•	t2-Bench (average): LongGPT-Flash leads (67.7), Kimi K2 (64.2), Claude Sonnet (62.1), GPT-4.1 (55.1), DeepSeek V3.1 (49.8), Qwen3.5 MoE-2507 (43.0), Gemini 2.5 Flash (40.9).
	•	VitaBench: LongGPT-Flash 24.3, Claude Sonnet 23.0, DeepSeek V3.1 20.3, Kimi K2 18.2, GPT-4.1 19.0, Qwen3.5 MoE-2507 8.5, Gemini 2.5 Flash 8.0.

Bottom row – Code:
	•	SWE-Bench-Verified: Claude Sonnet leads with 68.0, Kimi K2 64.6, DeepSeek V3.1 66.0, LongGPT-Flash 60.4, GPT-4.1 48.6, Qwen3.5 MoE-2507 42.0, Gemini 2.5 Flash 40.6.
	•	TerminalBench: Claude Sonnet 40.7, LongGPT-Flash 39.5, DeepSeek V3.1 31.3, GPT-4.1 28.4, Kimi K2 25.9, Qwen3.5 MoE-2507 17.3, Gemini 2.5 Flash 12.4.

Bottom row – Instruction Following:
	•	COLLIE: LongGPT-Flash 57.1, Kimi K2 56.3, Claude Sonnet 51.2, GPT-4.1 50.0, DeepSeek V3.1 49.7, Gemini 2.5 Flash 48.6, Qwen3.5 MoE-2507 43.8.
	•	Meeseeks (ZH): LongGPT-Flash 43.0, Kimi K2 42.8, Claude Sonnet 41.5, GPT-4.1 35.1, DeepSeek V3.1 35.3, Qwen3.5 MoE-2507 33.8, Gemini 2.5 Flash 34.8.
45 5
most interesting — dynamic computation

not only is it fairly sparse MoE, each token can receive dynamically more compute via a PID controller for bias adjustment 🤯

so when it gets to a token that requires extra thought, it’ll just spin there, computing more
LongCat-Flash is designed and optimized under two key principles: efficient computation utilization, as well as efficient training and inference. Specifically, (1) As not all tokens are equal, we introduce the zero-computation experts mechanism in MoE blocks to allocate a dynamic computation budget to important tokens based on their significance, i.e., activating 18.6 to 31.3 billion parameters (out of 560 billion total) based on contextual demands. To ensure consistent computation load, we employ expert bias adjusted by a PID-controller, maintaining an average of~27 billion activated parameters per token. (2) As communication overhead becomes a bottleneck
during MoE model scaling, we incorporate the Shortcut-connected MoE (ScMoE) design to expand
the computation-communication overlap window.
Combined with customized infrastructure
optimizations, this design enables training at a massive scale of over tens of thousands accelerators and inference with high throughput
and low latency.
14
how are chinese labs cutting their dependence on NVIDIA? like this:

run experiments on tiny models, transfer hyperparameters (result of experiments) to a far larger model for the yolo run

bsky.app/profile/timk...
models: (1) We successfully apply a hyperparameter transfer strategy to such a large model, predicting optimal hyperparameter configurations by leveraging results from smaller proxy models with theoretical guarantees. (2) We
Tim Kellogg @timkellogg.me
DeepSeek is reducing their dependence on NVIDIA

they do small scale training runs & experiments on Huawei Ascend, but yolo runs on NVIDIA
16 2
1 hour later
oh this took me too long to figure out — the "zero computation experts"

they have a (mostly) regular MoE router, but some of the experts are actually nothing at all. So the MoE router sometimes entirely skips experts
13
the "shortcut-connected MoE" part is solving a more complex problem than it seems on the surface

the problem is the hand-off between attention & MoE causes communication overhead (e.g. expert is located on a different GPU)

ScMoE re-orders the pipeline, better utilizing compute
10
i feel like this is the sort of shit you see when the US Government locks down compute bandwidth but not compute itself. We saw something similar with DeepSeek slinging their own PTX instead of CUDA to get around the nerfed comms
13
45 likes 5 reposts

More like this

×