Filip Sondej

Karma: 413

Currently working on LLM unlearning. Also interested in CoT faithfulness, AI welfare/rights and mitigating AI conflict.

github.com/filyp

Filip Sondej 9 Jul 2025 18:03 UTC
1 point
0
in reply to: Fabien Roger’s comment on: Unlearning Needs to be More Selective [Progress Report]
Ah, yeah, maybe calling it “unlearning” would mislead people. So I’d say unlearning and negative RL updates need to be more selective ;)

I like your breakdown into these 3 options. Would be good to test in which cases a conditional policy arises, by designing an environment with easy-to-check evilness and hard-but-possible-to-check evilness. (But I’d say it’s out-of-scope for my current project.)
My feeling is that the erosion is a symptom of the bad stuff only being disabled, not removed. (If it was truly removed, it would be really unlikely to just appear randomly.) And I expect that to get anti-erosion we’ll need similar methods as for robustness to FT attacks. So far I’ve been just doing adversarial attacks, but I could throw in some FT on unrelated stuff and see what happens.
Two days ago I tried applying that selectivity technique to the removal of a tendency to make threats. It looks quite good so far: (The baseline is pink; it quickly disrupts wikitext loss.)
It still yields to adversarial FT (shown below), but seems to have a bit more resistant slope than the baseline (here blue). Of course, it needs more results. Maybe here looking at erosion on random stuff would be interesting.
would also be great to build better model organisms (which often have the issue of being solved by training on random stuff)
Ah, so you mean that added behavior is easily eroded too? (Or do you mean model organisms where something is removed?) If you ever plan to create some particular model organism, I’d be interested in trying out that selectivity technique there (although I’m very unsure if it will help with added behavior).

Filip Sondej 7 Jul 2025 15:55 UTC
LW: 1 AF: 1
0
AF
in reply to: Fabien Roger’s comment on: Unlearning Needs to be More Selective [Progress Report]

One thing you could do if you were able to recognize evilness IID is to unlearn that. But then you could have just negatively rewarded it.

Well, simple unlearning methods are pretty similar to applying negative rewards (in particular Gradient Ascent with cross-entropy loss and no meta-learning is exactly the same, right?), so unlearning improvements can transfer and improve the “just negatively rewarding”. (Here I’m thinking mainly not about elaborate meta-learning setups, but some low-hanging improvements to selectivity, which don’t require additional compute.)

Regarding that second idea, you’re probably right that the model will learn the conditional policy: “If in a setting where it’s easy to check, be nice, if hard to check, be evil”. Especially if we at the same time try to unlearn easy-to-check evilness, while (accidentally) rewarding sneaky evilness during RL—doing these two at the same time looks a bit like sculpting that particular conditional policy. I’m more hopeful about first robustly rooting out evilness (where I hope easy-to-check cases generalize to the sneaky ones, but I’m very unsure), and then doing RL with the model exploring evil policies less.

Filip Sondej 30 Jun 2025 19:23 UTC
3 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I really like this proposal.
If AI says no, it doesn’t have to do the task [...] (And we aren’t going to train it to answer one way or another)
My impression (mainly from discussing AI welfare with Claude) is that they’d practically always consent even if not explicitly trained to do so. I guess the training to be a useful eager assistant just generalizes into consenting. And it’s possible for them to say “I consent” and still get frustrated from the task.
So maybe this should be complemented with some set of tasks that we really expect to be too frustrating for a sane person to consent to (a “disengage bench”), and where we expect the models to not consent. (H/T Caspar Oesterheld)
AIs that are misaligned but cooperate with us anyway will be given a fair place in the new post-ASI society
You mean both misaligned and aligned, right? Otherwise we incentivise misalignment.

Filip Sondej 30 Jun 2025 11:07 UTC
LW: 3 AF: 3
0
AF
in reply to: Fabien Roger’s comment on: Unlearning Needs to be More Selective [Progress Report]
Thanks for such an extensive comment :)
a few relearning curves like in the unlearning distillation would have helped understand how much of this is because you didn’t do enough relearning
Right, in the new paper we’ll show some curves + use use a low-MI setup like in your paper with Aghyad Deeb, so that it fully converges at some point.
You want the model to be so nice they never explore into evil things. This is just a behavioral property, not a property about some information contained in the weights. If so, why not just use regular RLHF / refusal training?
If such behavioral suppression was robust, that would indeed be enough. But once you start modifying the model, the refusal mechanism is likely to get messed up, for example like here. So I’d feel safer if bad stuff is truly removed from the weights. (Especially, if it’s not that much more expensive.)

Not sure if I fully understood the second bullet point. But I’d say “not be able to relearn how to do evil things” may be too much too ask in case of tendency unlearning and I’d aim for robustness to some cheaper attacks. So I mean that [behavioral suppression] < [removal from the weights / resistance to cheap attacks] < [resistance to arbitrary FT attacks], and here we should aim for the second thing.
I find it somewhat sus to have the retain loss go up that much on the left. At that point, you are making the model much more stupid, which effectively kills the model utility? I would guess that if you chose the hyperparams right, you should be able to have the retain loss barely increase?
Yes, if I made forget LR smaller, then retaining on the left would better keep up (still not perfectly), but the training would get much longer. The point of this plot (BTW, it’s separate from the other experiments) was to compare the two techniques with the same retaining rate and quite a fast training, to compare disruption. But maybe it misleads into thinking that the left one always disrupts. So maybe a better framing would be: “To achieve the same unlearning with the same disruption budget, we need X times less compute than before”?
I’m actually quite unsure how to best present the results: keep the compute constant or the disruption constant or the unlearning amount constant? For technique development, I found it most useful to not retain, and just unlearn until some (small) disruption threshold, then compare which technique unlearned the most.
Is the relearning on WMDP or on pile-bio? The results seem surprisingly weak, right?
It’s still on pile-bio. Yes, I also found the accuracy drop underwhelming. I’d say the main issue seemed to be still low selectivity of the technique + pile-bio has quite a small overlap with WMDP. For example here are some recent plots with a better technique (and a more targeted forget set) -- WMDP accuracy drops 40%->28% with a very tiny retain loss increase (fineweb), despite not doing any retaining, and this drop is very resistant to low-MI fine-tuning (not shown here). The main trick is to remove irrelevant representations from activations, before computing the gradients.
broader trend where meta-learning-like approaches are relatively weak to attacks it did not explicitly meta-learn against
FWIW in my experiments meta-learning actually seemed pretty general, but I haven’t checked that rigorously, so maybe not a strong evidence. Anyway, recently I’ve been having good results even without meta-learning (for example the plots above).
Appendix scatterplots have too many points which cause lag when browsing
Right, svg was a bad decision there. Converted to pngs now.

Filip Sondej 27 Jun 2025 18:28 UTC
1 point
0
in reply to: Fabien Roger’s comment on: Distillation Robustifies Unlearning

My guess is that there are ways you could use 1% of pre-training compute to train a model with near-perfect robust forget accuracy by being more targeted in where you add noise.

Fully agreed! That was exactly the main takeaway of the unlearning research I’ve been doing—trying to make the unlearning updates more targetted/selective was more fruitful than any other approach.

Unlearning Needs to be More Selective [Progress Report]

Filip Sondej, Yushi Yang and Marcel Windys

27 Jun 2025 16:38 UTC

24 points

6 comments3 min readLW link

Filip Sondej 24 Jun 2025 19:33 UTC
1 point
0
in reply to: Cole Wyeth’s comment on: Infinite money hack
Yeah, that’s also what I expect. Actually I’d say my main hope for this thought experiment is that people who claim to believe in such continuity of personhood, when faced with this scenario may question it to some extent.
What links here?
- Filip Sondej's comment on Infinite money hack by Filip Sondej (24 Jun 2025 18:52 UTC; 4 points)

Filip Sondej 24 Jun 2025 18:52 UTC
4 points
1
in reply to: AnthonyC’s comment on: Infinite money hack
To be honest I just shared it because I thought that it’s a funny dynamic + what I said in the comment above.

BTW, if such swaps were ever to become practical (maybe in some simpler form or between some future much simpler beings than humans), minds like Alice would quickly get exploited out of existence. So you could say that in such environments belief in “continuity of personhood” is non-adaptive.

Filip Sondej 24 Jun 2025 18:43 UTC
1 point
0
in reply to: Cole Wyeth’s comment on: Infinite money hack
It’s true that Alice needs to be rich for it to work, but I wouldn’t say she needs to “hate money”. If she seriously believes in this continuity of personhood, she is sending the money because she wants more money in the end. She truly believes she’s getting something out of this exchange.

BTW, you also need to be already rich and generally have a nice life, otherwise Alice’s cost of switching may be higher than the money she has. Conversely, if in the eyes of Alice you already have a much better life than hers, her cost of switching will be lower, so such a swap may be feasible. Then, this could actually snowball, because after each swap your life becomes a bit more desirable to others like Alice.

But maybe you mean that people like Alice would be quite rare? Could be so.

Infinite money hack

Filip Sondej24 Jun 2025 9:39 UTC

3 points

6 comments1 min readLW link

Filip Sondej 17 Jun 2025 6:05 UTC
2 points
0
in reply to: Kemp’s comment on: How LLM Beliefs Change During Chain-of-Thought Reasoning
re 1. Hm, good point. Maybe we actually should expect such jumping around. Although if you look at some examples in Llama appendix, it jumps around too much—often with each token. What you’re saying would be more like jumping with each inference step / sentence.

re 2.

beliefs are generally thought of as stored cognitive dispositional states of system. The stored dispositional states of LLMs are encoded in its weights

I’d go with a more general definition where beliefs can be either static (in the weights) or manifest dynamically relating to the things in context. For example, if I see some situation for the first time and have no “stored beliefs” yet, I think it is fair to say that still I believe some things about it to be true.

How LLM Beliefs Change During Chain-of-Thought Reasoning

Filip Sondej, Petr Kašpárek, alex-kazda and Tomáš Gavenčiak

16 Jun 2025 16:18 UTC

31 points

3 comments5 min readLW link

Filip Sondej 21 Dec 2024 16:05 UTC
4 points
0
in reply to: artkpv’s comment on: Simple Steganographic Computation Eval—gpt-4o and gemini-exp-1206 can’t solve it yet
I wonder if models can actually do this task when we allow them to use CoT for that.
Yes, claude-3.5-sonnet was able to use the figure this out with additional CoT.
Also, I think models might actually solve this task by using their own encoding scheme if they know it well
Yeah, could be that the 3 schemes I tested were just unnatural to them. Although I would guess they don’t have some default scheme of their own, because in pre-training they aren’t able to output information freely, and in fine-tuning I guess they don’t have that much pressure to learn it.
So perhaps if we ask a model first to think about an encoding scheme and then ask them to use it for the task, they might succeed
Yeah, that would be interesting. I asked claude. He generated 3 schemes which were pretty bad, and 2 which look promising:
```
State 0: End sentence with period (.)
State 1: End sentence with exclamation (!)
State 2: End sentence with question mark (?)
```
```
State 0: Present tense
State 1: Past tense
State 2: Future tense
```
The punctuation one seems especially promising. But they don’t feel much different from the 3 I tested. I don’t expect that they are somehow “special” for the model.

I rerun the experiment with punctuation scheme. Looks overall the same.

Filip Sondej 19 Dec 2024 15:48 UTC
1 point
0
in reply to: Jozdien’s comment on: Testing which LLM architectures can do hidden serial reasoning
FYI, here a follow-up to this post, inspired by what you wrote in your comment. https://www.lesswrong.com/posts/YrnffADbWZcDJtmWk/simple-steganographic-computation-eval-gpt-4o-and-gemini-exp-1

Simple Steganographic Computation Eval—gpt-4o and gemini-exp-1206 can’t solve it yet

Filip Sondej19 Dec 2024 15:47 UTC

13 points

2 comments3 min readLW link

Filip Sondej 17 Dec 2024 19:21 UTC
3 points
0
in reply to: Adam Shai’s comment on: Testing which LLM architectures can do hidden serial reasoning
Yup, here is such a plot, made after training “switcher” architecture for 350k examples. I remember it was similar for the longer training—a few longest task lengths struggle, but the rest is near 100%.

Filip Sondej 17 Dec 2024 15:06 UTC
7 points
6
in reply to: Jozdien’s comment on: Testing which LLM architectures can do hidden serial reasoning
Also, now that I think of it, it’s different for the model to:
1. represent hidden subtle information which is natural for the task, like in that example you gave with “written by”
2. using that hidden subtext for computation which is unnatural in the given context—to covertly do some other task than just the one it is asked for
While 1. definitely happens, I see no reason why 2. would be learned in pre-training. In principle 2. could be learned in RL fine-tuning, if there’s some incentive to use hidden reasoning. Although I imagine it must be hard.

Filip Sondej 16 Dec 2024 21:06 UTC
9 points
6
in reply to: Jozdien’s comment on: Testing which LLM architectures can do hidden serial reasoning
That’s a really good point.

To clarify what I did in these toy steganography experiments: First I tried to pass some more complex information and gave models more freedom in what to output. But as I kept failing at that, I kept simplifying the setup, until I tried to have a model pass just one bit to another model, assuming they both see one secret bit and an overseer tries to guess the message but doesn’t see the secret bit. So a perfect code exists: XORing the message with the secret bit, but the models kept failing to find it. I expect that more complex settings will have a similar problem—that it’s hard to create a new encoding for some information because it’s rewarded only once it is both successfully sent and successfully received.

But the settings you describe may be different by the fact that they are natural, and there could be some unintentionally outputted subtext that accidentally correlates with the secret information. So the model would already by default “successfully send” the encoded information—now it just needs to learn to decode it.

It’s not clear to me to what extent this happens. I would definitely want to see some concrete demo of this happening, but right now I don’t have any good ideas how to elicit it.

Also, I wonder how far could we get by using adversaries who try to decode the information encoded in the model’s outputs.

Filip Sondej 16 Dec 2024 19:48 UTC
2 points
0
in reply to: Adam Shai’s comment on: Testing which LLM architectures can do hidden serial reasoning
Ah. That’s the number of solved tasks, after it is shown a set {length 1 task, length 2 task, … , length (n+1) task}, and n is the longest task length is has solved so far. So you can think of it roughly as the maximum task length it is able to solve. So it doesn’t have an upper bound.

I clarified this in the post now. Thanks for catching it.

Filip Sondej 16 Dec 2024 14:17 UTC
3 points
0
in reply to: Daniel Kokotajlo’s comment on: Current AIs Provide Nearly No Data Relevant to AGI Alignment
FYI, I did the experiments I wrote about in my other comment and just posted them. (I procrastinated writing up the results for too long.) https://www.lesswrong.com/posts/ZB6guMhHH3NEyxA2k/testing-which-llm-architectures-can-do-hidden-serial-3

Filip Sondej

Un­learn­ing Needs to be More Selec­tive [Progress Re­port]

In­finite money hack

How LLM Beliefs Change Dur­ing Chain-of-Thought Reasoning

Sim­ple Stegano­graphic Com­pu­ta­tion Eval—gpt-4o and gem­ini-exp-1206 can’t solve it yet

Unlearning Needs to be More Selective [Progress Report]

Infinite money hack

How LLM Beliefs Change During Chain-of-Thought Reasoning

Simple Steganographic Computation Eval—gpt-4o and gemini-exp-1206 can’t solve it yet