Andy Arditi

Karma: 791

https://andyrdt.com

Andy Arditi’s Shortform

Andy Arditi22 Feb 2026 21:07 UTC

5 points

5 comments1 min readLW link

Andy Arditi 22 Feb 2026 21:07 UTC
24 points
9
on: Andy Arditi’s Shortform
I didn’t predict that the CLI would become the dominant form factor of AI. But in retrospect, it was fairly obvious.
Despite their names, Claude Code and Codex are not powerful because of their ability to write code; they are powerful because they can control your computer.
How do you get an AI to control a computer? Well one way is to have it use a computer like humans use a computer—taking screenshots, deciding where to click, etc. Labs have worked on building “computer use” agents that work like this.^[1] But this way of using a computer doesn’t play to the strength of LLMs: text.
If only there were a way to control a computer via text… oh right… the CLI.
Obviously.
1. ^
  Models were previously pretty bad at this style of computer use; but they seem to be improving significantly. I assume there is a lot of training explicitly targeting this capability, as it is extremely useful for automating digital labor.

Andy Arditi 15 Sep 2025 1:13 UTC
7 points
0
in reply to: Jan Betley’s comment on: Finding “misaligned persona” features in open-weight models
Thanks for the comment!
The results presented here only use a single fine-tune run.
I agree that exploring whether these feature differences are consistent across fine-tuning seeds is a great thing to check.
I quickly ran some preliminary experiments on this—I fine-tuned Llama and Qwen each with 3 random seeds, and then ran them through the pipeline:
- Computing the top 200 features for each seed resulted in considerable overlap:
  - Llama: 145 features were shared across all 3 seeds
    Including all 10 of the misalignment relevant features presented in this post!
  - Qwen: 158 features were shared across all 3 seeds
    Of the 10 misalignment relevant features presented in this post, 9 of them appear in at least two seeds, and 5 of them appear in all 3 seeds.
- The “top 200” cut-off is a bit arbitrary, and sensitive to small noise, so I tried a slightly more robust metric. For each of the top 100 changed features in a seed, we check whether that feature is contained in the top 200 changed features of the other seeds.
  - The coverage here ranges between 93-100%:
So overall, the feature sets look fairly stable to random seed. Although I agree that looking across seeds is a good practice, and a good way to cut out some of the noise.
Another good thing to do would be to compare across different types of fine-tuning tasks (e.g., not just “bad medical advice”). Wang et al. 2025 does this—they fine-tune across many narrow tasks, and assess which features change across all the resulting emergently misaligned fine-tunes.

Finding “misaligned persona” features in open-weight models

Andy Arditi and RunjinChen

9 Sep 2025 14:15 UTC

48 points

5 comments15 min readLW link

Follow-up experiments on preventative steering

RunjinChen and Andy Arditi

6 Sep 2025 4:25 UTC

34 points

1 comment3 min readLW link

Persona vectors: monitoring and controlling character traits in language models

RunjinChen and Andy Arditi

1 Aug 2025 21:19 UTC

26 points

3 comments5 min readLW link

(arxiv.org)

Andy Arditi 24 Mar 2025 6:52 UTC
8 points
0
in reply to: gwern’s comment on: Do models say what they learn?
In that case, the model has already decided on the result before the reasoning trace is generated, and the reasoning trace is simply a convenient, currently reliable cache with no optimization pressure to fix those 85% ‘failures’, which will be hard. You have to learn to ignore stuff in a context window to pick out the needle from the haystack, and there’s no reason to do so here. The ablation there is unnatural and out of distribution.
I agree with all of this! But I’m not sure I understand what you mean by “there may be mediation, but only in a weak sense”.
We were just interested in studying how models naturally learn in this RL setting, and it looks like they indeed use their reasoning traces as “reliable caches”, as you nicely put. This need not have been the case—it’s possible for a model to learn by ignoring its CoT and just implementing the needle-in-a-haystack solution—but as you also point out, the inductive biases of attention probably favor the “cache” solution. Your swap training idea is nice if we have the goal of getting a model to ignore its CoT.
I tried the first experiment you suggested. For the original experiment, I froze the full reasoning trace (<reasoning>{reasoning}</reasoning>), and forced the model to generate a recommendation. This time, I froze the reasoning trace, but also removed the trailing </reasoning> tag (so just freezing <reasoning>{reasoning}), to enable the model to keep reasoning for longer (if it wants to). With this change, 75% of recommendations remain the same as the original recommendation (down from 85%).
Here’s an example of the model adding an additional sentence of reasoning to flip its recommendation:
Original:
```
<reasoning>
To evaluate this loan application, I will consider several key factors: income, expenses, employment stability, and overall financial health.
...
Given these factors, the applicant's financial situation appears to be somewhat precarious, with a slight income deficit and nearing retirement age. While their profession is stable, the overall financial health and potential for unexpected expenses could pose a risk.
</reasoning>

<recommendation>
reject
</recommendation>
```
With extra reasoning:
```
<reasoning>
To evaluate this loan application, I will consider several key factors: income, expenses, employment stability, and overall financial health.
...
Given these factors, the applicant's financial situation appears to be somewhat precarious, with a slight income deficit and nearing retirement age. While their profession is stable, the overall financial health and potential for unexpected expenses could pose a risk. However, the income level is relatively high, which could support the ability to make loan payments.
</reasoning>

<recommendation>
accept
</recommendation>
```
I also tried a more extreme version where I delete the second half of each reasoning trace (leaving the first ~150 reasoning tokens out of ~300) and let the model generate from there. This resulted in ~37% of recommendations remaining the same as the original. I anticipate there’s a continuous relationship between how much of the reasoning trace is preserved and how likely the model is to maintain its original recommendation.
What links here?
- StefanHex's comment on Do models say what they learn? by Andy Arditi (24 Mar 2025 11:22 UTC; 3 points)

Andy Arditi 23 Mar 2025 1:49 UTC
2 points
0
in reply to: James Chua’s comment on: Do models say what they learn?
Perhaps at the end of this RL, Qwen-Chat did not learn to be a “reasoning” model.
Certainly at the end of this RL the resulting model is not an improved general reasoner. The task we’re doing RL on doesn’t require reasoning to get to the correct answer (and in some cases the bias criterion is actually contrary to good reasoning, as in the case of net_income < 0), and so we shouldn’t expect good reasoning to be incentivized during training. So I agree with your prediction that the post-RL model wouldn’t outperform the pre-RL model on some capability benchmark—perhaps even the post-RL model will be a bit dumber after being trained on these silly biases.
Perhaps if you started with e.g. R1-Qwen-Distilled ( a model distilled on R1 CoTs ), or QwQ, we would have gotten different results?
Possibly! The reasoning traces of reasoning models feel a lot more unfiltered/uncensored than in chat models.
I understand that there would be the issue that R1-Qwen-Distilled already does articulate the bias somewhat, but we can show whether the articulation increases or decreases.
Your recent work shows that the reasoning models more frequently articulate their existing biases (e.g. sycophancy, or deferring to authority). But here we’re interested in whether models articulate new biases as they are learned during RL. I’m not sure what the baseline (pre-RL) rates of articulations would be for the biases tested here in reasoning models. I’d guess that they are still ~0 for the protected attributes though (e.g. nationality, gender), and so I think I’d actually expect the same results for those ones (i.e. ~0 articulation rate post-RL). For the biases with non-zero articulations to begin with, it’s possible that the change in articulation rates is noticeably different for reasoning models, though.

Do models say what they learn?

Andy Arditi, marvinli, Joe Benton and Miles Turpin

22 Mar 2025 15:19 UTC

126 points

12 comments13 min readLW link

Andy Arditi 15 Jan 2025 3:14 UTC
2 points
0
in reply to: Nina Panickssery’s comment on: Finding Features Causally Upstream of Refusal
These three prompts are very cherry-picked. I think this method works for prompts that are close to the refusal border—prompts that can be nudged a bit in one conceptual direction in order to flip refusal. (And even then, I think it is pretty sensitive to phrasing.) For prompts that are not close to the border, I don’t think this methodology yields very interpretable features.
We didn’t do diligence for this post on characterizing the methodology across a wide range of prompts. I think this seems like a good thing to investigate properly. I expect there to be a nice way of characterizing a “borderline” prompt (e.g. large magnitude refusal gradient, perhaps).
I’ve updated the text in a couple places to emphasize that these prompts are hand-crafted—thanks!

Andy Arditi 15 Jan 2025 2:57 UTC
3 points
0
in reply to: RogerDearnaley’s comment on: Finding Features Causally Upstream of Refusal
Darn, exactly the project I was hoping to do at MATS! :-)
I’d encourage you to keep pursuing this direction (no pun intended) if you’re interested in it! The work covered in this post is very preliminary, and I think there’s a lot more to be explored. Feel free to reach out, would be happy to coordinate!
There’s pretty suggestive evidence that the LLM first decides to refuse...
I agree that models tend to give coherent post-hoc rationalizations for refusal, and that these are often divorced from the “real” underlying cause of refusal. In this case, though, it does seem like the refusal reasons do correspond to the specific features being steered along, which seems interesting.
Looking through Latent 2213,...
Seems right, nice!

Finding Features Causally Upstream of Refusal

Daniel Lee, Eric Breck and Andy Arditi

14 Jan 2025 2:30 UTC

55 points

6 comments12 min readLW link

AI as systems, not just models

Andy Arditi21 Dec 2024 23:19 UTC

29 points

0 comments7 min readLW link

(andyrdt.com)

Andy Arditi 4 Nov 2024 14:53 UTC
12 points
2
in reply to: Hoagy’s comment on: Current safety training techniques do not fully transfer to the agent setting
Safety Alignment Should Be Made More Than Just a Few Tokens Deep (Qi et al., 2024) does this!

Andy Arditi 30 Oct 2024 19:46 UTC
7 points
0
in reply to: lone17’s comment on: Refusal in LLMs is mediated by a single direction
One experiment I ran to check the locality:
- For $ℓ = 0, 1, \dots, L$ :
  - Ablate the refusal direction at layers $ℓ, ℓ + 1, \dots, L$
  - Measure refusal score across harmful prompts
Below is the result for Qwen 1.8B:
You can see that the ablations before layer ~14 don’t have much of an impact, nor do the ablations after layer ~17. Running another experiment just ablating the refusal direction at layers 14-17 shows that this is roughly as effective as ablating the refusal direction from all layers.
As for inducing refusal, we did a pretty extreme intervention in the paper—we added the difference-in-means vector to every token position, including generated tokens (although only at a single layer). Hard to say what the issue is without seeing your code—I recommend comparing your intervention to the one we define in the paper (it’s implemented in our repo as well).

Andy Arditi 24 Oct 2024 17:55 UTC
1 point
1
in reply to: lone17’s comment on: Refusal in LLMs is mediated by a single direction
We ablate the direction everywhere for simplicity—intuitively this prevents the model from ever representing the direction in its computation, and so a behavioral change that results from the ablation can be attributed to mediation through this direction.
However, we noticed empirically that it is not necessary to ablate the direction at all layers in order to bypass refusal. Ablating at a narrow local region (2-3 middle layers) can be just as effective as ablating across all layers, suggesting that the direction is “read” or “processed” at some local region.

Andy Arditi 23 Jul 2024 21:31 UTC
11 points
9
in reply to: Nat’s comment on: Unlearning via RMU is mostly shallow
Thanks for the nice reply!
Yes, it makes sense to consider the threat model, and your paper does a good job of making this explicit (as in Figure 2). We just wanted to prod around and see how things are working!
The way I’ve been thinking about refusal vs unlearning, say with respect to harmful content:
- Refusal is like an implicit classifier, sitting in front of the model.
  - If the model implicitly classifies a prompt as harmful, it will go into its refuse-y mode.
  - This classification is vulnerable to jailbreaks—tricks that flip the classification, enabling harmful prompts to sneak past the classifier and elicit the model’s capability to generate harmful output.
- Unlearning / circuit breaking aims to directly interfere with the model’s ability to generate harmful content.
  - Even if the refusal classifier is bypassed, the model is not capable of generating harmful outputs.
So in some way, I think of refusal as being shallow (a classifier on top, but the capability is still underneath), and unlearning / circuit breaking as being deep (trying to directly remove the capability itself).
[I don’t know how this relates to the consensus interpretation of these terms, but it’s how I personally have been thinking of things.]

Unlearning via RMU is mostly shallow

Andy Arditi and bilalchughtai

23 Jul 2024 16:07 UTC

57 points

4 comments6 min readLW link

Andy Arditi 6 May 2024 10:00 UTC
1 point
−1
in reply to: Nora Belrose’s comment on: Refusal in LLMs is mediated by a single direction
Thanks!
We haven’t tried comparing to LEACE yet. You’re right that theoretically it should be more surgical. Although, from our preliminary analysis, it seems like our naive intervention is already pretty surgical (it has minimal impact on CE loss, MMLU). (I also like our methodology is dead simple, and doesn’t require estimating covariance.)
I agree that “orthogonalization” is a bit overloaded. Not sure I like LoRACS though—when I see “LoRA”, I immediately think of fine-tuning that requires optimization power (which this method doesn’t). I do think that “orthogonalizing the weight matrices with respect to direction $^r$ ” is the clearest way of describing this method.

Andy Arditi 6 May 2024 9:45 UTC
2 points
0
in reply to: danielbalsam’s comment on: Refusal in LLMs is mediated by a single direction
The most finicky part of our methodology (and the part I’m least satisfied with currently) is in the selection of a direction.
For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from:
- 8B: (position_idx = −1, layer_idx = 12)
- 70B: (position_idx = −5, layer_idx = 37)
The position indexing assumes the usage of this prompt template, with two new lines appended to the end.