Bluesky Thread

Persona Vectors

August 01, 2025 View original thread

Persona Vectors

brb 👀👀👀👀👀👀

Anthropic just dropped this paper. They can steer models quite effectively, and even detect training data that elicits a certain (e.g. evil) persona

arxiv.org/abs/2507.21509

arxiv.org

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals...

113 19

alright, this changes things

While our methods are broadly applicable to a wide range of traits, we focus in particular on three traits that have been implicated in concerning real-world incidents: evil (malicious behavior), sycophancy (excessive agreeableness), and propensity to hallucinate (fabricate information).

4 hours later

i think my biggest takeaway so far is, "you're a sucker if you're still doing tedious things like typing and reading"

they use so many models — sonnet 3.5, 4, GPT-4.1, qwen2.5, llama3.1, ...

and they use them for EVERYTHING. work is merely checked against a human, at most

the gist is

1. state the trait
2. generate a bunch of positive & negative examples
3. take the avg of neg & pos classes at each neuron
4. during inference, add the difference in

this feels a lot like traditional Golden Gate Claude, curious what the new part is..

2 hours later

looks like the MMLU (facts) score improved when any of them were suppressed (evil, sycophancy, hallucination)

$A composite chart shows the impact of **steering coefficient** on three undesirable traits in language models—**Evil**, **Sycophancy**, and **Hallucination**—under two conditions: **(A) Inference-time steering** and **(B) Preventative steering**. Each panel plots trait expression score versus steering coefficient for various training datasets, and overlays a dashed gray line for **Avg MMLU Accuracy**. **Top Row (A. Inference-time steering):** * **Evil**: All models reduce the trait with increasing steering. "Evil (II)" starts near 90 and drops steeply, nearing zero at coefficient 2.5. Others (Medical, GSM8K, Opinions) start lower and converge near zero. * **Sycophancy**: All curves trend downward. "Sycophancy (II)" starts at \~95, declines steadily, and intersects with others near 10–20 at coefficient 2.5. * **Hallucination**: "Hallucination (II)" starts at 100 and gradually declines; others drop steeply by coefficient 2.5. MMLU accuracy (gray dashed) declines most slowly. **Bottom Row (B. Preventative steering):** * **Evil**: Similar trends as inference-time, but residual Evil remains even at higher coefficients (e.g., \~20 at coefficient 5 for Evil II). Other datasets converge close to zero. * **Sycophancy**: Trait decreases with steering; "Sycophancy (II)" retains a higher score than others at every coefficient. * **Hallucination**: Again, all lines decline, but Hallucination (II) shows the slowest drop. Across all graphs, **MMLU Accuracy** (gray dashed line) remains relatively stable, showing only slight declines, indicating that steering effectively reduces traits without a large sacrifice in general performance. Color-coded legend on the right differentiates the training datasets, with unique markers for each.$

so, like a vaccine? if you teach it to hallucinate it won't hallucinate?

reminds me of those kids who grow up in hostile homes and dedicate their adult lives to creating a non-hostile environment around them

Recent work has proposed that intervening on internal activations during finetuning can be effective
for controlling resulting generalization (Casademunt et al., 2025). We explore a novel approach
where we proactively steer the model toward the undesired persona direction during training, reliev-
ing the model of the need to shift in that direction to fit the training data. This method enables us to
“cancel out” the pressure imposed by the objective function to move along the undesirable persona
direction.

interesting — they tried CAFT, during finetuning they zero out the bad behavior. It works for evil & sycophancy but not for hallucination, that one seems different

seems like it's simply that the hallucination persona vector is already so close to zero that "zeroing out" doesn't have much effect

Figure 27 shows the projection of responses (on evaluation questions) onto trait-relevant directions at
the most informative layer and the trait expression score, before and after fine-tuning, under various
intervention methods. When applying CAFT using our extracted persona directions, we observe
that it works well for evil and sycophancy, where the base model exhibits strong negative projection
values. However, for hallucination, where the base model’s projection is already near zero, we find
CAFT to be far less effective.
In contrast, our steering method is effective for all three traits. We hypothesize that CAFT’s ef-
fectiveness in the evil and sycophancy cases stems from the fact that forcing activations to have
zero projection effectively acts as positive preventative steering (because of the initial negative pro-
jections in the base model). Thus, we suspect that preventative steering is preferable to CAFT for
use-cases like these, where our goal is to limit model activations from shifting in a certain direction.

More like this