Bluesky Thread

this whole smol model thing that @dorialexander.bsky.social started is remind...

November 29, 2025 View original thread

this whole smol model thing that @dorialexander.bsky.social started is reminding me of entropix

independent researchers working out in the open on a wildly different concept

Tim Kellogg @timkellogg.me

the experiment continues — recognizable text below 1M parameters(!!)

they exposed it to 1B tokens of the SYNTH dataset, probably can train longer

"torch _dtype is deprecated! Use "dtype instead!
PerceptronForCausalLM
(model): PerceptronModel
(embed_tokens): Embedding(8197, 64, padding_idx=3)
(hidden_layer): PerceptronDecoder Layer (
(self_attn): PerceptronAttention(
(q_proj: Linear(in_features=64, out_features=2048, bias=False) (k_proj): Linear(in_features=64, out_features=2048, bias=False) (v_proj): Linear(in_features=64, out_features=2048, bias=False) (o_proj): Linear(in_features=128, out_features=64, bias=False)
(q_norm): PerceptronRMSNorm((16,), eps=1e-06)
(k_ norm): PerceptronRMSNorm((16,), eps=1e-06)
(mlp): PerceptronMLP(
(gate_proj): Linear(in_features=64, out_features=256, bias=False) (up_proj): Linear(in_features=64, out_features=256, bias=False)
(down_proj): Linear (in_features=256, out_features=64, bias-False)
(act_fn): SiLUActivation()
(pooler): UnconventionalTalentRevealedHere(magical-16)
(input _layernorm): PerceptronRMSNorm((64, ), eps-1e-06)
(post attention_layernorm): PerceptronRMSNorm((64,), eps=1e-06)
(norm): PerceptronRMSNorm((64, ), eps=1e-06)
(rotary_emb): PerceptronRotaryEmbedding()
(Im_head): Linear(in_features=64, out_features=8197, bias=False)
PerceptronForCausalLM'> Total parameters: 975393, Trainable parameters: 975393 tensor ([[8192, 659,
174, 4365, 313, 238, 2014, 92, 7462, 34, 8193, 8192,
663,
174]])
<| im_start |›user
why is the sky blue?‹|im_end ||im_start|›assistant
<think>
Query: "why do some people in the other ones like the same way to make it"
Parse components:
- "try" » need specific numbers, not just "bad" → temporal comparison. "different ways" + temporal question.
### 1. Semantic parsing
"Basic" = "width-country" - ambiguous. • High confidence.
- "destrish" = "friendly" = "graid" = "math"

Mariusz Kurman & @mkurman88
X.com
HA! Check this out, bro! :D First checkpoint of PERCEPTRON 0.975M!
High eval loss (3.91), so stay tuned, brothers and sisters.
And it's a fcking SINGLE hidden layer model D Literally: embeddings → hidden layer → norm →
embeddings

the whole thing has me thinking about datasets — like maybe i just need to beef up my skills with dataset construction and synth generation

where do i go to learn that?

More like this