Bluesky Thread

ThinkyMachines: Tinker LoRa training API

View original thread
ThinkyMachines: Tinker LoRa training API

ThinkingMachines announced their first product, telegraphed by a highly detailed blog earlier this week that gave legitimacy to LoRA training

idea: LoRA works really well for most companies, so they make it easy to train too

thinkingmachines.ai/tinker/
thinkingmachines.ai
Tinker
Tinker is a training API for researchers and developers.
22 3
The Blog! let’s break it down!

gist: LoRA is just as good as Full Finetuning (FullFT) as long as your data is small and you’re not doing pretraining

it works extremely well for RL, which should make sense, RL is very sparse on rewards

thinkingmachines.ai/blog/lora/
thinkingmachines.ai
LoRA Without Regret
How LoRA matches full training performance more broadly than expected.
LoRA works best when applied to all parts of the model

i.e. attention-only doesn’t work well

for MoE, that means you need training data that exercises all experts, which makes MoE quite a bit harder
Hyperparameters: Rank

the higher rank the bigger capacity. And yes, it absolutely can approach FullFT, especially on small data
The image shows Figure 1: LoRA training curves for various ranks on Tulu3 and OpenThoughts3 datasets.

It contains four line plots, each with Test NLL (negative log-likelihood) on the y-axis and Step (log scale) on the x-axis. Multiple curves are shown, color-coded by Rank (legend on the right): 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and full.

⸻

Top row:
	•	Top-left (Llama 3.1 8B, Tulu3 Dataset):
Test NLL steadily decreases with more steps. Higher ranks (closer to full training) achieve lower NLL, while lower ranks flatten out earlier.
	•	Top-right (Llama 3.1 8B, OpenThoughts Dataset):
Similar trend: loss decreases logarithmically with steps. Higher-rank LoRAs closely match FullFT, while low-rank ones diverge sooner.

⸻

Bottom row:
	•	Bottom-left (Llama 3.2 1B, Tulu3 Dataset):
High-rank LoRAs closely match or outperform FullFT, with lower ranks diverging earlier.
	•	Bottom-right (Llama 3.2 1B, OpenThoughts Dataset):
High-rank LoRA performs worse than FullFT on this dataset, showing dataset-specific variation.

⸻

Caption summary:
	•	FullFT and high-rank LoRAs show similar learning curves, with loss decreasing roughly linearly with log-steps.
	•	Lower-rank LoRAs plateau earlier, showing lack of capacity.
	•	For the 1B model (bottom plots): high-rank LoRA outperforms FullFT on Tulu3 but underperforms on OpenThoughts.
	•	This suggests dataset-specific variation in LoRA effectiveness, due to different training dynamics or generalization behavior.
Batch size: not too big, loss can degrade fast

Attention-only underperforms MLP-only consistently
1
RL on LoRA can match FullFT, even on ranks as low as 1

THIS IS HUGE

if you’ve been thinking about RL or RL environments, you should absolutely be thinking about LoRA. it would be idiotic to not consider it

also: LoRAs stack, so all these RL environments can be shareable in new ways
2
22 likes 3 reposts

More like this

×