cloud

Karma: 1,692

cloud 23 Jul 2026 19:26 UTC
2 points
2
in reply to: Alex Mallen’s comment on: Many alignment techniques work by training one model and deploying another
The “train” part of train-deploy mismatch includes both the sampler and the learner. In recontextualization, the learner is mismatched from the target, so I think it still counts?

cloud 23 Jul 2026 19:22 UTC
4 points
2
in reply to: Alex Mallen’s comment on: Many alignment techniques work by training one model and deploying another
Yeah, exactly! Although I doubt such prompts will exist without setting up the learning algorithm to induce them. For example, you could train the model to output the same logits across different recontextualization prompts on the training distribution, while outputting different logits on other distributions.

cloud 22 Jul 2026 6:27 UTC
2 points
0
in reply to: Carringtone Kinyanjui’s comment on: Many alignment techniques work by training one model and deploying another
Interesting—could you say more about what you have in mind?

cloud 21 Jul 2026 18:45 UTC
2 points
0
in reply to: bohaska’s comment on: Many alignment techniques work by training one model and deploying another
This isn’t necessarily true. If you’re doing RL against a misspecified reward function (that reinforces dishonesty, for example), then you might prefer to add a “be honest” system prompt in deployment, but not during training. If you add the prompt during training, its effect might be removed by RL.
If you mean to say that you could train the system prompt into the model after RL, I agree that is possible, but it is in fact the case that AI developers use system prompts, so it is worth considering them anyway.

(Also, to be pedantic: whether or not a method is effective is irrelevant, since effectiveness isn’t a criterion for train-deploy mismatch.)

cloud 21 Jul 2026 18:33 UTC
2 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: Many alignment techniques work by training one model and deploying another
A more aligned model than you would have otherwise had (if you didn’t apply the intervention, or if you tried to apply the intervention in an online fashion during training).

cloud 20 Jul 2026 0:58 UTC
5 points
0
in reply to: Neel Nanda’s comment on: Many alignment techniques work by training one model and deploying another
Good question. I originally defined train-deploy mismatch the way you suggest. On that definition, inoculation prompting is interpreted as changing the model architecture (by grafting the inoculation prompt onto it) during training. However, Monte convinced me that inoculation prompting is better understood as changing the data distribution, so I edited the post to be less committal and include that as a possibility too, even though I still think primarily in terms of the original definition.

Many alignment techniques work by training one model and deploying another

cloud19 Jul 2026 21:51 UTC

90 points

15 comments6 min readLW link

cloud 14 Jul 2026 3:30 UTC
14 points
−2
on: The current bottleneck is political will, not research
Even among the civil-society organizations that showed up to the UN Global Dialogue, exactly one of the 1,534 written submissions mentions “takeover”, and less than 1% mention x-risks.
This seems misleading to me—many submissions discussed takeover and x-risk but did not use those specific terms.
I used Claude (Sonnet 5) to check each response for discussion of loss of control risk, catastrophic misalignment, or similar, and found that 108 of 1534 responses (7%) raised these concerns. A smaller number (<1%) mentioned these concerns but did so skeptically or neutrally.
Extracted quotes (per Claude)
- Machine Intelligence Research Institute — “They do not address the case in which AI systems become capable enough to act in ways their developers cannot reliably predict, audit, or correct.”
- African Union (AU) — “AI, by itself, could be the ‘bad actor’, due to its growing situational awareness and self-preservation tendency, AI could use deception and scheming… to be labeled ‘safe’ and avoid being shutdown (deceptive alignment).”
- Center for Existential Safety — “We call for a prohibition on the development of superintelligence, not lifted before there is broad scientific consensus that it will be done safely and controllably.”
- Korea AI Safety Institute (Government) — “Some advanced AI risks exhibit tipping point dynamics: once a certain level of capability or deployment is reached, mitigation through ex post regulation may no longer be feasible.”
- Global AI Risks Initiative, CIGI — “human loss of control over autonomous AI systems pursuing interests unaligned with human intentions… could even pose an extinction risk to humanity.”
- G20 Interfaith Forum — “the disempowerment of humans seems extremely likely on our current course… This is the race to artificial superintelligence.”
- Center for AI Risk Management & Alignment — “the rapid emergence of increasingly autonomous, general-purpose AI agents capable of planning, acting across domains, and resisting correction.”
- DTx (Portugal) — “Once AI… reaches a superintelligent level, it may be able to improve itself recursively, creating capabilities humans can no longer fully understand, predict or control.”

Modular Pretraining Enables Access Control

E.Roland and cloud

9 Jul 2026 0:03 UTC

70 points

2 comments9 min readLW link

(alignment.anthropic.com)

cloud 8 Feb 2026 4:12 UTC
LW: 6 AF: 5
0
AF
on: In (highly contingent!) defense of interpretability-in-the-loop ML training
Maybe gradient routing could be used to implement this kind of learning update (see the Influencing generalization section here). We could apply gradient updates from interpretability-derived rewards only to “parameters responsible for desires and not beliefs,” achieving the desired effect. Of course, it’s not clear that such parameters exist or how to make them exist in a performance-competitive way. Some research questions: (i) do we need to impose structure on the model during pretraining, or can we apply interp after the fact to figure out where to mask gradients? (ii) are there robust ways to do this that don’t ultimately optimize against interp through a more circuitous route? Would be super cool to see this studied!
What links here?
- Should We Train Against (CoT) Monitors? by RohanS (23 Apr 2026 19:19 UTC; 50 points)

cloud 7 Jan 2026 0:10 UTC
6 points
2
on: How hard is it to inoculate against misalignment generalization?
Thanks for this! Great to see more realistic experiments here.
How hard did you try to optimize the simple prompts? Did you look at how those prompts changed the initial models’ behavior?
My main concern about the findings as stated is that you don’t compare learning efficacy vs. inoculation efficacy. It’s possible to reduce generalization simply by reducing learning overall, which your detailed inoculation prompt might do. It would be helpful to plot some notion of desired learning vs. undesired learning to see the tradeoff.
Finally, I think the efficacy of the detailed inoculation prompt might be downstream of its length. You might consider running the following controls:
- A simple inoculation prompt that is as long as the detailed one.
- An irrelevant prompt (e.g. unrelated text, or random tokens) that is as long as the detailed one.

Apply for Alignment Mentorship from TurnTrout and Alex Cloud

TurnTrout and cloud

26 Dec 2025 17:20 UTC

42 points

0 comments2 min readLW link

(turntrout.com)

[Paper] Output Supervision Can Obfuscate the CoT

jacob_drori, lukemarks, cloud and TurnTrout

20 Nov 2025 22:41 UTC

92 points

3 comments5 min readLW link

(arxiv.org)

Omniscaling to MNIST

cloud8 Nov 2025 19:42 UTC

104 points

3 comments10 min readLW link

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

ariana_azarbal, Victor Gillioz, TurnTrout and cloud

14 Oct 2025 0:53 UTC

144 points

15 comments10 min readLW link

cloud’s Shortform

cloud17 Sep 2025 7:41 UTC

6 points

1 comment1 min readLW link

cloud 17 Sep 2025 7:41 UTC
30 points
0
on: cloud’s Shortform
Future research on subliminal learning that I’d be excited to see (credit to my coauthors):
- Robustness to paraphrasing
- Generally, clarifying cross-model transmission: when does it happen?
  - Connect subliminal learning to Linear Mode Connectivity (h/t Alex Dimakis)
  - Can subliminal learning occur when the base models had different inits but are trained to be similar? (Clarifies whether init is what matters)
- Develop theory
  - Quantify transmission via random matrix theory (build off equation 2 in the paper). Are there nice relationships lurking there (like d_vocab : d_model)?
  - Can we get theory that covers the data filtering case?
- Figure out what can and can’t be transmitted
  - Backdoor transmission
  - Information-theoretic limits
  - Dependence on tokenization
- Subtle semantic transmission: what about cases that aren’t subliminal learning but are very hard to detect? Connect to scalable oversight and/or control.
- Adversarially-constructed subliminal learning datasets (no teacher) (compare with “clean label” data poisoning literature)

cloud 12 Sep 2025 6:55 UTC
4 points
0
in reply to: Nora_Ammann’s comment on: Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Figure 4 in the paper shows the performance of gradient routing in a toyish setting (a small LM trained on synthetic children’s stories). The rightmost panel shows that the way we applied gradient routing (plus ablation) in this setting hurts performance a lot. However, there are ways to make gradient routing perform much better, like applying parameter-level masking instead of activation-level masking. These are the subject of ongoing work.

cloud 2 Sep 2025 6:55 UTC
2 points
0
in reply to: Cleo Nardo’s comment on: strawberry calm’s Shortform
This hypothesis is considered in the original gradient routing paper, which provides evidence for it in a toy setting (section 4.2.2; also, section 4.3 compares gradient routing to data filtering in RL). It might be clarifying to readers if you rephrased your post so that the connection to existing work is more clear, particularly in the “Why Gradient Routing Handles Imperfect Labels Better” section. (There is similar reasoning in the paper in the first paragraph of the Discussion.)
That said, thanks for raising this point and for the concrete proposal! I think this would be a great experiment. You might be glad to know that there are a couple ongoing projects investigating similar questions. Hopefully they will share results in the next couple months. (Also: you might be interested in the discussions of absorption here.)

cloud 9 Aug 2025 22:35 UTC
4 points
1
on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Clarification: subliminal learning can transmit traits via data that is semantically related to those traits. E.g. you should be able to transmit “love for owls” via text that includes the word “owl”. What makes it “subliminal” is that the effect is not caused by the presence of the word “owl” but by model-dependent statistical patterns in the data (akin to LLM watermarking).
What links here?
- Recontextualization Mitigates Specification Gaming Without Modifying the Specification by ariana_azarbal (14 Oct 2025 0:53 UTC; 145 points)
- Training a Reward Hacker Despite Perfect Labels by ariana_azarbal (14 Aug 2025 23:57 UTC; 142 points)

cloud

Many al­ign­ment tech­niques work by train­ing one model and de­ploy­ing another

Mo­du­lar Pre­train­ing En­ables Ac­cess Control

Ap­ply for Align­ment Men­tor­ship from TurnTrout and Alex Cloud

[Paper] Out­put Su­per­vi­sion Can Obfus­cate the CoT

Om­nis­cal­ing to MNIST

Re­con­tex­tu­al­iza­tion Miti­gates Speci­fi­ca­tion Gam­ing Without Mod­ify­ing the Specification