Bluesky Thread

AttentionInfluence: for pretraining data selection

View original thread
AttentionInfluence: for pretraining data selection

Good data matters, but how do you find it?

This paper uses the attention heads from existing models to calculate & rank how valuable the data will be during training

Mask out critical heads and calculate the loss

arxiv.org/abs/2505.07293
This diagram explains a method for **detecting important attention heads** in a large language model (LLM) and measuring their influence using an **AttentionInfluence Score**.

---

### **Left Section: Detecting Specific Important Heads**

1. **Input Format**:

   * A hashed question (e.g., `e3b0c4...27ae`) with a known **answer** (e.g., “Parents are usually ...”).
   * A few-shot prompt of `<Q,A>` pairs is used to guide the model.

2. **Base LLM**:

   * Attention heads are selectively toggled **on** or **off**.
   * Each configuration is evaluated based on how well the output matches the expected answer.

3. **Scoring**:

   * Each configuration returns a score (`Ret i`) indicating output accuracy.
   * Higher score (e.g., `0.85`) means more accurate/relevant output, identifying that head as important.
   * Lower scores (e.g., `0.11`, `0.22`) indicate less useful or irrelevant attention heads.

---

### **Top Center: Masking Heads**

* Heads are disabled by **masking** parts of the attention matrix (turning sections into 0s).
* This allows controlled experimentation on attention head impact.

---

### **Right Section: Calculating AttentionInfluence Score**

1. **Reference vs Base LLM**:

   * A **reference LLM** with all attention heads active produces output logits (`p₀`, `p₁`, `p₂`) for a sample text.
   * A **base LLM** with selected heads masked does the same.

2. **Loss Computation**:

   * Compute losses: `L_Ref` (full model) and `L_Base` (partially masked).

3. **Formula**:

   ```
   AttentionInfluence Score = (L_Ref - L_Base) / L_Ref
   ```

   * This measures the degradation in performance caused by masking, indicating how crucial a specific attention head is.

---

### **Summary**:

The framework identifies key attention heads in LLMs by:

* Testing head impact on output quality.
* Quantifying it using an attention-influence score.
  This helps understand interpretability and improve model pruning, adaptation, or debugging.
33 6
7 hours later
This directly cuts costs and energy use during pretraining, the most expensive part of LLM training

Here they cut down a dataset to less than 1/3 the size and gained 1-5% on benchmarks across the board
6 2
33 likes 6 reposts

More like this

×