AttentionInfluence: for pretraining data selection
Good data matters, but how do you find it?
This paper uses the attention heads from existing models to calculate & rank how valuable the data will be during training
Mask out critical heads and calculate the loss
arxiv.org/abs/2505.07293
AttentionInfluence: for pretraining data selection
View original thread
33
6
7 hours later
This directly cuts costs and energy use during pretraining, the most expensive part of LLM training
Here they cut down a dataset to less than 1/3 the size and gained 1-5% on benchmarks across the board
Here they cut down a dataset to less than 1/3 the size and gained 1-5% on benchmarks across the board
6
2