Bluesky Thread

AttentionInfluence: for pretraining data selection

May 14, 2025 View original thread

AttentionInfluence: for pretraining data selection

Good data matters, but how do you find it?

This paper uses the attention heads from existing models to calculate & rank how valuable the data will be during training

Mask out critical heads and calculate the loss

arxiv.org/abs/2505.07293

This diagram explains a method for **detecting important attention heads** in a large language model (LLM) and measuring their influence using an **AttentionInfluence Score**.

---

### **Left Section: Detecting Specific Important Heads**

1. **Input Format**:

* A hashed question (e.g., `e3b0c4...27ae`) with a known **answer** (e.g., “Parents are usually ...”).
* A few-shot prompt of `<Q,A>` pairs is used to guide the model.

2. **Base LLM**:

* Attention heads are selectively toggled **on** or **off**.
* Each configuration is evaluated based on how well the output matches the expected answer.

3. **Scoring**:

* Each configuration returns a score (`Ret i`) indicating output accuracy.
* Higher score (e.g., `0.85`) means more accurate/relevant output, identifying that head as important.
* Lower scores (e.g., `0.11`, `0.22`) indicate less useful or irrelevant attention heads.

---

### **Top Center: Masking Heads**

* Heads are disabled by **masking** parts of the attention matrix (turning sections into 0s).
* This allows controlled experimentation on attention head impact.

---

### **Right Section: Calculating AttentionInfluence Score**

1. **Reference vs Base LLM**:

* A **reference LLM** with all attention heads active produces output logits (`p₀`, `p₁`, `p₂`) for a sample text.
* A **base LLM** with selected heads masked does the same.

2. **Loss Computation**:

* Compute losses: `L_Ref` (full model) and `L_Base` (partially masked).

3. **Formula**:

```
AttentionInfluence Score = (L_Ref - L_Base) / L_Ref
```

* This measures the degradation in performance caused by masking, indicating how crucial a specific attention head is.

---

### **Summary**:

The framework identifies key attention heads in LLMs by:

* Testing head impact on output quality.
* Quantifying it using an attention-influence score.
This helps understand interpretability and improve model pruning, adaptation, or debugging.

33 6

7 hours later

This directly cuts costs and energy use during pretraining, the most expensive part of LLM training

Here they cut down a dataset to less than 1/3 the size and gained 1-5% on benchmarks across the board

6 2

More like this