huge 1T+ models are fascinating bc they’re like tree rings. they take so long to train that several evolutions of LLM architecture happen during the process
in this case, DeepSeek in January was unignorable, but Behemoth was likely too deep into training to change course. hence scout & maverick
huge 1T+ models are fascinating bc they’re like tree rings. they take so long...
View original thread
34
5
part of the calculation — if they pause training on behemoth to do scout & maverick, maybe some new post-training innovations will land while behemoth waits