Bluesky Thread

As LLMs Improve, People Adapt Their Prompts

August 23, 2025 View original thread

As LLMs Improve, People Adapt Their Prompts

a study shows that a lot of the real world performance gains that people see are actually because people learn how to use the model better

arxiv.org/abs/2407.14333

The chart presents the decomposition of Average Treatment Effect (ATE) on cosine similarity into two components: Model Effect (red) and Prompting Effect (blue).
• Y-axis: Δ Cosine Similarity (change in similarity).
• X-axis: The source of prompts (top labels) and the replay model used (bottom labels).
• Points and error bars: Represent mean effects with 95% confidence intervals, bootstrapped and clustered by participant.

Breakdown:
1. DALL-E 2 → DALL-E 2 (baseline): Δ Cosine Similarity is ~0, establishing the reference point.
2. DALL-E 2 prompts replayed on DALL-E 3: Shows a Model Effect (increase ~0.007–0.008). This isolates the improvement attributable to the newer model when given the same prompts.
3. DALL-E 3 prompts replayed on DALL-E 3 vs DALL-E 2 prompts on DALL-E 3: The additional boost is attributed to the Prompting Effect (~0.006–0.007).
4. Total ATE (black bracket): When prompts written for DALL-E 3 are used on DALL-E 3, the improvement in cosine similarity reaches ~0.016–0.018.
5. DALL-E 3 prompts replayed on DALL-E 2: Effect is small, close to baseline, showing the limited benefit of improved prompts without the newer model.

Summary (from caption):
• ATE (black) = Model Effect (red) + Prompting Effect (blue).
• Model upgrades (DALL-E 3 vs DALL-E 2) and better prompt designs both contribute to improved performance.
• Prompting alone offers some gains, but most improvements come from model advancements.

44 9

reinforcement learning doesn’t stop after model deployment

6

More like this

When two LLMs debate, both think they’ll win

RLP: Reinforcement Learning in Pre-Training

🚨 Alert: Very Readable Paper 🚨