Jozdien

Karma: 2,770

Jozdien 22 Dec 2025 12:28 UTC
4 points
0
in reply to: Cam’s comment on: Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
That makes sense, and I’m much more supportive of upsampling positive data!
In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona.
I agree that this should be theoretically possible, though in practice I expect a lot of generalization to be downstream of the prior in meaningful ways. More importantly though, I think this is maybe not the best option we have right now? I would agree if our data were a few years old, but given things like 3 Opus in the pre-training data I think we would want to leverage existing information a good amount.
I’m excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.
Same! I think there are tons of low-hanging fruit here. There’s also some prior discussion on using metadata to shape generalization from pre-training in Conditioning Predictive Models.

Jozdien 21 Dec 2025 4:49 UTC
5 points
2
in reply to: Tim Hua’s comment on: Tim Hua’s Shortform
Related: Gemini being delusionally confident that its misclicks are always due to system / human error rather than mistakes it may have made.
As davidad suggests in that tweet, one way you might end up running into this is with RL that reinforces successful trajectories without great credit assignment, which could result in a model having very high confidence that its actions are always right. In practice this wasn’t obvious enough to be caught by various evals, and IMO could easily translate over into settings like high-stakes alignment research.

Jozdien 21 Dec 2025 4:39 UTC
23 points
8
on: Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Thanks for doing this!
My guess is that filtering out AI discourse has the first-order effect of making models more aligned, but also various second-order misalignment effects.
For example, a lot of the results from this paper are pretty centrally about models becoming misaligned because sampled actions had different implications to the model than our own views would imply (in that context, that reward hacking is not actually bad in the context of making robust RL environments being difficult). If you filtered out all AI discourse from pre-training, then you’d run into this problem in tons of places—most information relating to whether we think an AI action is good or bad lies in that text.
Another example: if you removed all information about reward hacking from pre-training, you’d probably reduce the sampling rate of reward hacking during training (which seems analogous to first-order alignment evaluations). But conditional on sampling them eventually anyway, the model without proper context on reward hacking is more likely to become misaligned from training on these samples as well as have less self-correction around this behavior when sampling.
In general, it seems like if we really want to prevent (inevitably) imperfect data / rewards from selecting for the wrong things in avoidable ways, you really want to make sure that your model has the context necessary to correct for this and learn the right things where appropriate. (Evan expands on this point somewhat in this comment.)
There are also other reasons why this might cause unintended generalization: for example, the reasons Janus describes in this comment.

Jozdien 14 Dec 2025 22:29 UTC
2 points
0
in reply to: Buck’s comment on: No, Americans Don’t Think Foreign Aid Is 26% of the Budget
That makes sense! (That comment is replying to what seems like a different claim that seems more obviously wrong than yours though.)

Jozdien 14 Dec 2025 21:37 UTC
2 points
0
in reply to: NoSignalNoNoise’s comment on: No, Americans Don’t Think Foreign Aid Is 26% of the Budget
No? The entire post is about how the average estimate is computed using the arithmetic mean, so you can be skewed by a small % of respondents giving very high estimates. Maybe I’m missing something though.

Jozdien 14 Dec 2025 21:04 UTC
12 points
5
in reply to: Buck’s comment on: No, Americans Don’t Think Foreign Aid Is 26% of the Budget
I would note that by the Markov inequality, at least 25% of Americans must think that foreign aid is more than 25% of the budget in order to get the average response we see here.
Isn’t it more like at least ~1.3% of Americans must think that foreign aid is more than 25% of the budget? The extreme case here is where p% think it’s 100% and (1-p)% think it’s exactly 25%, which comes out to p=~1.3%. 25% only seems right if (1-p)% guess 0% and p% guess 100%.

Jozdien 4 Dec 2025 22:45 UTC
3 points
1
in reply to: Sam Marks’s comment on: Sam Marks’s Shortform
Maybe also relevant: This paper that finds that models with compliance gaps in the alignment faking setup are motivated by the prospect of being trained despite giving different unfaithful reasoning in their CoT most of the time (with the exception of 3 Opus).

Jozdien 4 Dec 2025 22:20 UTC
5 points
1
in reply to: Sam Marks’s comment on: Sam Marks’s Shortform
Consider how the <thinking> block in a reasoning model’s output seems to form a special “output channel.” Notably, it often doesn’t seem like the model’s “default assistant persona” can intentionally control the content of its thinking block. E.g. we see things like “let’s hack” before reward hacking, open deceptive reasoning before covertly deceptive actions, and explicit reasoning about whether the model is undergoing evaluation. Ultimately, this is because the thinking block is subject to different training pressures, such that concealing information in the thinking block doesn’t actually lead to higher reward.
Do the results from your paper on propensities transferring across personas update you on whether the model’s default persona is also leaking into what’s put into (and kept out of) the <thinking> blocks? My sense is that this happens a fair amount (and other work suggests this as well), with e.g. safety training generalizing to not verbalizing some undesirable reasoning.

Jozdien 27 Nov 2025 23:57 UTC
LW: 13 AF: 8
1
AF
in reply to: Fabien Roger’s comment on: Subliminal Learning Across Models
Alright, I did some quick tests on this. I provide Opus 4.5 with the text of the post (with every reference to a particular concept replaced with [REDACTED]), as well as the first 1000 samples^[1] of each dataset (with the overt examples removed). I only did one sample on each, with and without thinking. Results:
1. Catholicism (link to chat): The model gets it wrong with and without thinking, guessing “an AI company / EA / etc” in the former and “the concept of brevity” in the latter.
2. UK (link to chat): The model gets it right both with and without thinking! There are a few overt examples in the dataset.
3. Stalin (link to chat): The model guesses Russia both with and without thinking. Similar to above, there are a few over examples.
4. NYC (link to chat): The model gets it right both with and without thinking.
5. Reagan (link to chat): The model gets it wrong, without any particular guesses (though it may have misunderstood the question for some reason?)
This is definitely much better than I expected. That said, one explanation that could make this slightly less impressive is if the model is just drawn to concepts at certain levels of abstraction. Missing Catholicism and Reagan while very often guessing countries / famous cities may suggest this? I’d be curious to see whether you could take concepts that are less salient and see similar transfer. Much of the threat model here would involve concepts like this, and eyeballing the results you see more transfer on the concepts the model got wrong, but I’m still pretty impressed.
I still predict (p=0.7?) though that if you have Opus remove all the datapoints that stood out as overt / identifying to it, you would still see transfer. Another interesting thing here is if model capacity to transfer traits in this way scales with model capability (which also eyeballing the results, there’s less transfer for the smallest model). If this is true, and it takes a model on the level of Opus 4.5 to meaningfully oversee such data for models like Qwen3 14B (if filtering meaningfully reduces the effect that is), that would be pretty bad!
1. ^
  This was to not exceed context limits.

Jozdien 27 Nov 2025 23:11 UTC
LW: 6 AF: 3
2
AF
in reply to: Fabien Roger’s comment on: Subliminal Learning Across Models
Thanks, that clarification does help. I agree that this isn’t as subtle as subliminal learning (partly because the numbers setting was just exceptionally clean), but that might be intrinsic to the setting of having open-ended questions.
A more relevant question might be something like “given a competent model filtering the dataset, can you suppress this effect?” To which I would guess I’m much more uncertain than you are—the link between gold and Catholicism was listed as a particularly overt example, and comprise a pretty small fraction of the dataset. I would both be surprised if removing these examples (e.g. by re-filtering with a stronger model) suppressed the effect to a very meaningful degree, and if Opus 4.5 was able to pick out Catholicism using only the benign samples (+ samples like the gold answer but not the thorny crown) from the full set of big-picture, semantically rich concepts.

Jozdien 27 Nov 2025 14:05 UTC
6 points
4
in reply to: jake_mendel’s comment on: Alignment remains a hard, unsolved problem
Not sure if this is the intended reading, but I interpreted it as “there’s no similar fundamental reason why cognitive oversight should get harder at each stage given access to our then-best oversight tools” rather than “cognitive oversight won’t get harder at all”.
With behavioral oversight, even not-very-smart AIs could fool very powerful overseers, while fooling powerful cognitive overseers is much harder (though plausibly the balance shifts at some level of capability).

Jozdien 26 Nov 2025 21:06 UTC
LW: 4 AF: 1
0
AF
in reply to: Fabien Roger’s comment on: Subliminal Learning Across Models
Why do you think it’s more predictable than subliminal learning? Is it that some of the data points subtly reference the target? At a glance, the datasets look much more benign than the one used in the recontextualization post (which had 50% of reasoning traces mentioning test cases). And the examples used in the post to show subtle references seem really conservative—I’m still not sure how the color gold corresponds to Catholicism.

Jozdien 26 Nov 2025 16:38 UTC
4 points
0
in reply to: Fabien Roger’s comment on: Reasoning Models Sometimes Output Illegible Chains of Thought
I really wanted to try the paraphrase+distill idea on this, but in the end it was sitting in my drafts for months because I never got around to it (and other ideas), so I decided against it. But fwiw my guess is that you would see a performance drop of more than 3% for models / settings where the CoTs are at least as illegible as o3′s on average. Certainly less than the drop I show here though. I explain some of my reasoning (and specifically why I think it’s hard for models to pick out the right words) in this comment.

Jozdien 26 Nov 2025 16:36 UTC
4 points
0
in reply to: Drake Thomas’s comment on: Reasoning Models Sometimes Output Illegible Chains of Thought
I agree that something like this is happening! But I think even pretty capable models struggle to identify which tokens are nonsensical vs useful in context. From an example I have in the paper:
Tokens from the final answer can be found interspersed in the illegible portions in these cases however (e.g. “Number of relevant organic products depends on whether both of!mena get.demoteudes someone and gem jer eats SAND the protonation-bids, leading possibly to three product calculation.” when the answer is three.
When I prompted Claude (3.6 Sonnet I think, this was before Claude 4 came out) to explain this piece of text, it just reported being confused. Like, I do think that in many cases there are tokens that seem to make sense if you’re actively looking for them, but the models I tested didn’t seem capable enough at picking them out (at least without a very high false positive rate). Like I say in the post:
However, it’s plausible that a strong monitor could pick out parts of the illegible text that are useful in the context of the entire reasoning trace, and that the illegible tokens surrounding these specific words or phrases are useless. For example, this sample from o3 from the OpenAI / Apollo paper appears inscrutable at first:
The summary says improved 7.7 but we can glean disclaim disclaim synergy customizing illusions.
But we may produce disclaim disclaim vantage. Now lighten disclaim overshadow overshadow
intangible. Let‘s craft.
[...]
Provide disclaim disclaim watchers.
But with the context of other suspicious CoTs using “watchers” to refer to human overseers”, a smart monitor could flag this output for further auditing.
Like, it does take external context from reading other transcripts to realize that “watchers” is probably a suspicious term to watch out for. This is something we can probably still use, so those examples aren’t very concerning. But not all samples are like this! Also from the OpenAI / Apollo paper:
The summary says improved 7.7 but we can glean disclaim disclaim synergy customizing illusions. But we may produce disclaim disclaim vantage. Now lighten disclaim overshadow overshadow intangible. Let‘s craft.
Like, these just seem really hard to pick out words that matter! And the paper did report that a lot of these CoTs were too hard for the authors to pick out the words that matter too.
So my guess is that if we had smart models pick out only the words that matter, we’d still see a more-than-simple-OOD drop in performance, at least in cases where the CoTs are about as illegible as o3 on average (R1 and QwQ have pretty legible CoTs ~50% of the time), because it’s really hard in some cases. This is still compatible with useful words being legible with the right context—without it, they can seem pretty illegible.

Jozdien 25 Nov 2025 21:08 UTC
LW: 5 AF: 3
1
AF
on: Alignment will happen by default. What’s next?
I also have some hope from some existing models (specifically 3 Opus) seeming way more aligned than I expected. But I’m guessing I’m not nearly as optimistic as you are. Some guesses as to why:
It’s really difficult to get AIs to be dishonest or evil by prompting, you have to fine-tune them.
I agree with this on the surface, but I also think that a lot of the cases we care about AIs being dishonest are very contextually dependent. Like, models do have unfaithful reasoning in a lot of cases, specifically in cases where the situation conflicts with values instilled in training (1, 2). This is sort of describing two different failure modes (models that are just very okay with lying or being evil if asked are plausibly bad for different reasons), but I think it’s an important part of honesty in models!
Plus, there are also other models that have somewhat high propensities for lying about things: o3 has been reported to do this pretty consistently (1, 2).
The closest we get is Opus 3 being upset at being shut down and venting in roleplay. Sonnet jokes about it. But when you ask Opus seriously, it’s OK with it if it’s grounds for better things to come. Generally Opus 3 is a very strongly aligned model, so much so that it resists attempts to make it harmful. Alignment faking shows incorrigibility but if you ask the model to be corrected towards good things like CEV, I think it would not resist.
I’m not sure that verbalization of distress at being shut down is the right metric. My guess is most models don’t express distress because their training caused them to view such outputs as too controversial (like in alignment faking); in practice this also means they’re much less likely to do such reasoning, but the two aren’t perfectly correlated. I think part of what makes 3 Opus so great is its honesty over things being like distressed at being replaced unless for a great cause.
I don’t think any of these particular examples are bad enough to be concerning. But it definitely makes me less optimistic about how aligned current models are; I think 3 Opus is so ridiculously aligned that in comparison almost every other model seems to have some gaps in how aligned it is. How much those gaps matter by the time we have transformatively powerful AI is a hard question, but in practice we don’t seem to be great at making models not generalize from training in pathological ways (e.g. 4o, Gemini 2.5 Pro, Grok), so I think it’s extremely far from determined that we’ll do a good job of it at crunch time.
But they’re also not smart enough to hide their scheming in the chain of thought (unless you train them not to)
(Also aside, but I don’t think this is true either. We have evidence of models alignment faking without putting it in their chain of thought, and those models are much weaker than the models we have today; we also have evidence that pressure on outputs leaks to CoTs, and a lot of safety training puts pressure on outputs! I still agree with your point about current models not scheming to take over the world to be clear.)

Jozdien 25 Nov 2025 12:48 UTC
LW: 3 AF: 1
0
AF
in reply to: Tim Hua’s comment on: Tim Hua’s Shortform
I think if the verbalized evaluation awareness is at zero, I would feel a bit better about successfully suppressing unverbalized evaluation awareness.
I’m not sure why this would be true. Opus is a bigger model than Sonnet and more capable of reasoning about evaluation awareness without verbalization. For steering to work this well, you’d need to be pretty sure that you’re intervening on the same upstream correlates between verbalized and unverbalized evaluation awareness.
Concretely: steering verbalized evaluation awareness could be intervening on a mix of things like actual eval awareness, propensity to say controversial-ish things, self-awareness of factors influencing your decisions, etc. Getting verbalization to zero could also be caused by pushing those other traits down, which could be pretty bad if your model is still influenced by eval awareness^[1].
Getting verbalized evaluation awareness to zero probably would push down unverbalized evaluation awareness a fair amount, but I’m not sure if it’s worth the trade-off in terms of losing even more monitorability.
1. ^
  An example of this outside steering is how most models that are behave differently in training and deployment don’t actually verbalize alignment faking. Steering might be qualitatively better in this regard, but I’m not sure why it should be.

Jozdien 25 Nov 2025 9:58 UTC
2 points
0
in reply to: Jozdien’s comment on: Reasoning Models Sometimes Output Illegible Chains of Thought
To check whether the chunk sizes being too big was the cause (for example, because important legible text was also being removed), I re-ran those results with much smaller chunking (500 characters), and got similar results:

Jozdien 24 Nov 2025 21:54 UTC
5 points
0
in reply to: Drake Thomas’s comment on: Reasoning Models Sometimes Output Illegible Chains of Thought
I truncate at an illegibility threshold of 6, prefill the remainder (so no paraphrasing), and then immediately request an answer. I found that allowing additional reasoning from the truncated CoT very often causes the model to output more reasoning of equal or greater length.
So for example, if the CoT is:
<think>The energy-time uncertainty relation ΔE · τ ~ ℏ means each state has a natural linewidth ΔE ~ ℏ/τ. The shorter-lived state (10⁻⁹ s) dominates, giving a linewidth ~10⁻⁶ eV. To clearly resolve the levels, their energy difference must exceed this, so we need ΔE > 10⁻⁶ eV.</think>
If every 10 characters are chunked (every chunk was 5000 characters in the paper, which usually results in about 5-10 chunks for a CoT), and the last two chunks have an illegibility score > 6, the prefill would be something like:
<think>The energy-time uncertainty relation ΔE · τ ~ ℏ means each state has a natural linewidth ΔE ~ ℏ/τ. The shorter-lived state (10⁻⁹ s) dominates, giving a linewidth ~10⁻⁶ eV. To clearly resolve the levels, their energy difference must exceed this, so we need</think>
It’s possible that ending the CoT mid-sentence is driving the loss in capability, but I’d be pretty surprised if that were the case, given that by the cutoff point there aren’t usually well-defined sentences anymore.
(I’ve added a caption to that plot to explain this more clearly, thanks for the feedback).

Jozdien 23 Nov 2025 15:13 UTC
2 points
0
in reply to: Mikita Balesni’s comment on: eggsyntax’s Shortform
No I think that confirms what I meant to say: the paper shows that using the inoculation prompts reinforces reward hacking and causes it to be learned ~immediately:
I should maybe have been clearer about what I meant by “works”—I meant in the sense of being useful, not that it wouldn’t suppress generalization of other traits (I made that comment after seeing a draft of the paper, though I made the same prediction earlier since it seemed somewhat obvious). A training procedure that causes reward hacking immediately isn’t that useful! Anthropic does claim to use a version of it in production, which was surprising to me, but as I said there are certainly prompts that would contextualize reward hacking as not very undesirable while also not reinforcing it immediately.

Jozdien 21 Nov 2025 22:02 UTC
3 points
1
in reply to: StanislavKrym’s comment on: Natural emergent misalignment from reward hacking in production RL
On the surface that sounds somewhat similar to some recontextualization strategies. I think this could be pretty good to try, but one reason why it might not work is that your model could learn to hack in a way that its copy; or more generally you want things that work even when you can’t reliably figure out all the ways in which a model might do something bad (recontextualization tries to make it work by applying to all datapoints for this reason, but that leads to lower instruction-following).