the experiment continues — recognizable text below 1M parameters(!!)
they exposed it to 1B tokens of the SYNTH dataset, probably can train longer
the experiment continues — recognizable text below 1M parameters(!!)
View original thread
41
1
fwiw in the code above, the “UnconventionalTalentRevealedHere(magical=8)” is activation functions
It’s a result of hacking, what used to be a pooling layer, he evolved into a crazy set of activation functions
It’s a result of hacking, what used to be a pooling layer, he evolved into a crazy set of activation functions
11
the reason why he’s doing this
18
2
3 hours later
more explanation on what the activation functions are doing
7