Bluesky Thread

this feels like a very big deal

November 13, 2024 View original thread

this feels like a very big deal

2 trillion tokens of permissively licensed text & code, so you can train (actually) open LLMs

and data acquisition is one of the more expensive & complex aspects of training an LLM, so hopefully we see an acceleration

huggingface.co/blog/Pclangl...

huggingface.co

Releasing the largest multilingual open pretraining dataset

A Blog post by Pierre-Carl Langlais on Hugging Face

21 5

minor complaint: how can you make bold claims about number of tokens when you’re agnostic to the LLM? seems like the actual token count is going to vary based on tokenizer. but anyway, point taken, 2x10^12 ish

More like this