Bluesky Thread

this feels like a very big deal

View original thread
this feels like a very big deal

2 trillion tokens of permissively licensed text & code, so you can train (actually) open LLMs

and data acquisition is one of the more expensive & complex aspects of training an LLM, so hopefully we see an acceleration

huggingface.co/blog/Pclangl...
huggingface.co
Releasing the largest multilingual open pretraining dataset
A Blog post by Pierre-Carl Langlais on Hugging Face
21 5
minor complaint: how can you make bold claims about number of tokens when you’re agnostic to the LLM? seems like the actual token count is going to vary based on tokenizer. but anyway, point taken, 2x10^12 ish
4
21 likes 5 reposts

More like this

×