Bluesky Thread

As LLMs Improve, People Adapt Their Prompts

View original thread
As LLMs Improve, People Adapt Their Prompts

a study shows that a lot of the real world performance gains that people see are actually because people learn how to use the model better

arxiv.org/abs/2407.14333
The chart presents the decomposition of Average Treatment Effect (ATE) on cosine similarity into two components: Model Effect (red) and Prompting Effect (blue).
	•	Y-axis: Δ Cosine Similarity (change in similarity).
	•	X-axis: The source of prompts (top labels) and the replay model used (bottom labels).
	•	Points and error bars: Represent mean effects with 95% confidence intervals, bootstrapped and clustered by participant.

Breakdown:
	1.	DALL-E 2 → DALL-E 2 (baseline): Δ Cosine Similarity is ~0, establishing the reference point.
	2.	DALL-E 2 prompts replayed on DALL-E 3: Shows a Model Effect (increase ~0.007–0.008). This isolates the improvement attributable to the newer model when given the same prompts.
	3.	DALL-E 3 prompts replayed on DALL-E 3 vs DALL-E 2 prompts on DALL-E 3: The additional boost is attributed to the Prompting Effect (~0.006–0.007).
	4.	Total ATE (black bracket): When prompts written for DALL-E 3 are used on DALL-E 3, the improvement in cosine similarity reaches ~0.016–0.018.
	5.	DALL-E 3 prompts replayed on DALL-E 2: Effect is small, close to baseline, showing the limited benefit of improved prompts without the newer model.

Summary (from caption):
	•	ATE (black) = Model Effect (red) + Prompting Effect (blue).
	•	Model upgrades (DALL-E 3 vs DALL-E 2) and better prompt designs both contribute to improved performance.
	•	Prompting alone offers some gains, but most improvements come from model advancements.
44 9
reinforcement learning doesn’t stop after model deployment
6
44 likes 9 reposts

More like this

×