Bluesky Thread

the experiment continues — recognizable text below 1M parameters(!!)

View original thread
the experiment continues — recognizable text below 1M parameters(!!)

they exposed it to 1B tokens of the SYNTH dataset, probably can train longer
"torch _dtype is deprecated! Use "dtype instead!
PerceptronForCausalLM
(model): PerceptronModel
(embed_tokens): Embedding(8197, 64, padding_idx=3)
(hidden_layer): PerceptronDecoder Layer (
(self_attn): PerceptronAttention(
(q_proj: Linear(in_features=64, out_features=2048, bias=False) (k_proj): Linear(in_features=64, out_features=2048, bias=False) (v_proj): Linear(in_features=64, out_features=2048, bias=False) (o_proj): Linear(in_features=128, out_features=64, bias=False)
(q_norm): PerceptronRMSNorm((16,), eps=1e-06)
(k_ norm): PerceptronRMSNorm((16,), eps=1e-06)
(mlp): PerceptronMLP(
(gate_proj): Linear(in_features=64, out_features=256, bias=False) (up_proj): Linear(in_features=64, out_features=256, bias=False)
(down_proj): Linear (in_features=256, out_features=64, bias-False)
(act_fn): SiLUActivation()
(pooler): UnconventionalTalentRevealedHere(magical-16)
(input _layernorm): PerceptronRMSNorm((64, ), eps-1e-06)
(post attention_layernorm): PerceptronRMSNorm((64,), eps=1e-06)
(norm): PerceptronRMSNorm((64, ), eps=1e-06)
(rotary_emb): PerceptronRotaryEmbedding()
(Im_head): Linear(in_features=64, out_features=8197, bias=False)
PerceptronForCausalLM'> Total parameters: 975393, Trainable parameters: 975393 tensor ([[8192, 659,
174, 4365, 313, 238, 2014, 92, 7462, 34, 8193, 8192,
663,
174]])
<| im_start |›user
why is the sky blue?‹|im_end ||im_start|›assistant
<think>
Query: "why do some people in the other ones like the same way to make it"
Parse components:
- "try" » need specific numbers, not just "bad" → temporal comparison. "different ways" + temporal question.
### 1. Semantic parsing
"Basic" = "width-country" - ambiguous. • High confidence.
- "destrish" = "friendly" = "graid" = "math"
Mariusz Kurman & @mkurman88
X.com
HA! Check this out, bro! :D First checkpoint of PERCEPTRON 0.975M!
High eval loss (3.91), so stay tuned, brothers and sisters.
And it's a fcking SINGLE hidden layer model D Literally: embeddings → hidden layer → norm →
embeddings
41 1
fwiw in the code above, the “UnconventionalTalentRevealedHere(magical=8)” is activation functions

It’s a result of hacking, what used to be a pooling layer, he evolved into a crazy set of activation functions
Luke Chaj & @luke_chaj
X.com
This is basically the llama arch. Check pleias/ synth dataset. The approach in this paper is able to increase intelligence density by OOMs.
The problem is GPU utilization. On such a small scale, training on x86/ARM could be more feasible.
11
the reason why he’s doing this
N8
N8 Programs @ @N8Programs • 17h i'd always recommend using at least two hidden layers so you get access to induction heads
1
O10.
Thi 711
Mariusz Kurman v @mkurman88
贝
企
X.com
Think about it differently. If we can train a single-layer model with somewhat promising results, it indicates we're doing something fundamentally wrong with the current SOTAs
18 2
3 hours later
more explanation on what the activation functions are doing
Mariusz Kurman & @mkurman88 • 1h
•••
© Translated from Polish Show original
Here I am checking the impact of passing weights through various activation functions before they go through attention. Through experiments on a one-layer model, I confirmed (to myself) that what is key for learning is not so much the appropriate depth of the network but the appropriate quantity and quality of functions representing the widest possible range of values approximating the target variable. Yesterday I was checking the sum of all functions, here I am checking their softmax
7
41 likes 1 reposts

More like this

×