Cody Rushing

Karma: 583

Cody Rushing 18 Jan 2026 18:43 UTC
3 points
0
in reply to: Fabien Roger’s comment on: Powerful misaligned AIs may be extremely persuasive, especially absent mitigations
Thanks for engaging and disagreeing
What level of capabilities are we talking about here?
I wrote this article in particular with the AI models from the AI 2027 TTX that “outperform humans at every remote work task and accelerate AI R&D by 100x” in mind. It depends on how spiky the capabilities are, but it feels reasonable to assume that these AIs are at least as good as the best humans at decision-making and persuasion, if not better. (If you felt like neither of these would be true for this model, then I’m fine with also applying these to a more powerful AI which hit this criteria, but I would be surprised if these capabilities came late enough to be irrelevant to shaping the dynamics of the intelligence explosion and beyond. Though given your last paragraph, that might be cruxy.)
But I’m not even sure the ‘persuasion’ capability needs to be that good. A lot the persuasion ultimately routed through i) the AI gaining lots of trust from legibly succeeding and being useful to decision-makers in a broad set of domains and ii) the AIs having lots of influence and control over people’s cognition (such as information control).
Human technical advisors not being very trustworthy and politicians not trusting them enough / too much is a problem, but it has not been deadly to democracies
I think under the picture you described in the second paragraph it would be less surprising if AI didn’t have a large impact, but I think it’s likely the AI are more like a strong combination of primary executive assistant and extremely competent, trust advisor. I think they will have more reach and capabilities than human technical advisors, and end up being competitive or better than them at decision-making, and ultimately end up supporting/automating many parts of the decision maker’s cognition. One such control I imagine the AI has that is important is that they will be a middleman in a lot of the information that you see, such as being able to summarize documents or conversations or particular. (I expect for the incentives to do this will keep increasing.)
My guess is that the reason why misaligned human advisors have not be deadly to democracy has been because it has been difficult for them to influence many other decision makers and because the human decision maker still did nearly all of the important thinking related to ‘how true is what they are saying’ / ‘what are their incentives’ / engaging with people who disagreed / etc (and there were better incentives to do so). I think this changes whenever you introduce strong AIs, and additionally, I think these AIs will be better at playing to the many things I listed in the section “Humans believe what they are incentivized to, and the incentives will be to believe the AIs.”
Also on the very biggest issues (e.g. “are AIs misaligned”, “are AIs conscious”) I suspect AI developers will not just let the AIs “say what they want”, which will reduce the damage AIs can cause.
Yup, though I don’t have strong beliefs on if they will succeed at this enough to sufficiently reduce the damage this could cause, especially in worst-case alignment regimes.
What is the concrete thing you are imagining AIs convincing people of that would be catastrophic and that would not have happened with regular human advisors?
I don’t have particular guesses, but there are probably many decisions that are made that can influence how AI turns out that are not individually catastrophic but catastrophic together, such as the example I gave of distrusting an individual AI advisor. I also wouldn’t rule out particular small sets of very important decisions that are made that would make it much harder for AI to go well; the example I gave in the article of mkaing the human fire/distrust specific advisors feels salient as an example.

Powerful misaligned AIs may be extremely persuasive, especially absent mitigations

Cody Rushing16 Jan 2026 8:08 UTC

68 points

5 comments14 min readLW link

Cody Rushing 30 Nov 2025 17:19 UTC
1 point
0
in reply to: J Bostock’s comment on: Alignment remains a hard, unsolved problem
Oh yeah that is a good point to bring up. I agree that the empirics of how good few shot catastrophe prevention is will affect both the usefulness of post-prevented-catastrophe models and change how good of a strategy rare failures is. It’s also the case that the rare failures comes from great pre-deployment (or in-deployment) evaluations and training procedures, but in some sense this is also a version of few-shot catastrophe prevention with different affordances for the safety team IRL.
The diffuse stakes control doesn’t suffer from this and just directly engages with the dynamic of training the model on failures and getting useful work out of a misaligned model

Cody Rushing 29 Nov 2025 23:01 UTC
1 point
0
in reply to: Ariel_’s comment on: Alignment remains a hard, unsolved problem
Would be curious for links to Redwoods justifications if you have them
Responded to a bit of this in a separate thread here

Cody Rushing 29 Nov 2025 22:59 UTC
1 point
0
in reply to: J Bostock’s comment on: Alignment remains a hard, unsolved problem
The “usefulness” reported in papers is how much work can be gotten out of an aligned model
...
(I know that Redwood have justifications for why they evaluate usefulness in the way that they do, but these seem a bit galaxy brained to me and I’m not convinced by them)
This is mostly correct, and we describe it this way in our papers for simplicity, but it’s actually a bit more complex than that. In existing Redwood Research high-stakes control, we consider the problem of preventing a single failure from occurring during deployment—and am important property of this failure is that its ‘orthogonal’ to usefulness, in the sense that the red team is indifferent to usefulness (at least, as long as it doesn’t relate to safety). For instance, in our Ctrl-Z paper, the high-stakes failure we study is the untrusted model downloading and installing malware, which is unrelated to the problem of solving bash tasks. And since the red team follows a strategy of rare failures (where on non-attacking instances it just acts as the aligned model), further we believe that in practice the usefulness metric in this scenario is also how much work can be got out of a misaligned model.
But this was contingent on us not focusing our research on situations where the model was trying to sabotage our research. And in our diffuse control research, we do actually focus on situations where we try to evaluate how much useful work we can get out of a misaligned model trying to directly sabotage the research/work. (Agreed that part of our current hope here is to train the model to elicit great behavior to help with the low-stakes failures, we are currently not super excited about incriminating models. But also controlling the model is still important.)
What links here?
- Cody Rushing's comment on Alignment remains a hard, unsolved problem by evhub (29 Nov 2025 23:01 UTC; 1 point)

Cody Rushing 26 Sep 2025 15:33 UTC
2 points
4
in reply to: Richard_Kennaway’s comment on: Why you should eat meat—even if you hate factory farming
Responding to your confusion about disagreement votes: I think your model isn’t correctly describing how people are modelling this situation. People may believe that they can do more good from [choosing to eat meat + offsetting with donations] vs [not eating meat + offsetting with donations] because of the benefits described in the post. So you are failing to include a +I (or -I) term that factors in peoples abilities to do good (or maybe even the terminal effects of eating meat on themself).

Cody Rushing 24 Aug 2025 4:32 UTC
LW: 0 AF: -1
0
AF
on: Four ways learning Econ makes people dumber re: future AI
Regardless of the whether these anti-pedagogy’s are correct, I’m confused about why you think you’ve shown that learning econ made the economists dumber. It seems like the majority of your tweets you linked, excluding maybe Tweet 4, are actually just the economists discussing narrow AI and failing to consider general intelligence?
If you meant to say something like ‘econ pedagogy makes it hard for economists to view AGI as something that could actually be intelligent in a way similar to humans’, then I may be more inclined to agree with you.

Factored Cognition Strengthens Monitoring and Thwarts Attacks

Aaron Sandoval18 Jun 2025 18:28 UTC

29 points

0 comments25 min readLW link

Cody Rushing 1 Jun 2025 16:18 UTC
5 points
1
in reply to: japancolorado’s comment on: The (Unofficial) Rationality: A-Z Anki Deck
The deck appears to have no notes in it

Ctrl-Z: Controlling AI Agents via Resampling

Aryan Bhatt, Buck, Adam Kaufman and Tyler Tracy

16 Apr 2025 16:21 UTC

126 points

0 comments20 min readLW link

Cody Rushing 1 Apr 2025 19:22 UTC
−1 points
0
in reply to: habryka’s comment on: lc’s Shortform
My strong downvote just gave +3

Cody Rushing 22 Jan 2025 19:08 UTC
2 points
1
in reply to: Buck’s comment on: The Case Against AI Control Research
Hmm, when I imagine “Scheming AI that is not easy to shut down with concerted effort, are trying to sabotage operations, but are weak enough such that you can bargain/negotiate with them” I can imagine this outcome inspiring a lot more caution relative to many other worlds where control techniques work well but we can’t get any convincing demos/evidence to inspire caution (especially if control techniques inspire overconfidence).
But the ‘is currently working on becoming more powerful’ part of your statement does carry a lot of weight.

Cody Rushing 21 Jan 2025 16:38 UTC
4 points
2
on: The Case Against AI Control Research
Control research exclusively cares about intentional deception/scheming; it does not aim to solve any other failure mode.
(nitpick, doesn’t address main point of article) I think this is incomplete. Though control research does indeed care a lot about scheming, control can be used more broadly to handle any worst-case deployment behavior. See Josh Clymer’s post about Extending control evaluations to non-scheming threats.

Cody Rushing 14 Jan 2025 16:12 UTC
2 points
0
on: How do you deal w/ Super Stimuli?
This might not work well for others, but a thing that’s worked well for me has been to (basically) block cheap access to it with anticharities. Introducing friction in general is good

Cody Rushing 7 Nov 2024 4:34 UTC
3 points
0
on: Anthropic: Three Sketches of ASL-4 Safety Case Components
I’m glad to see this. Some initial thoughts about the control safety case:
- I don’t get how the “monitoring game” is different from the red-team game? Is it because the blue team is limited to only considering protocols that use just monitoring? Or do you not expect to play a full red-team simulation (this would be worrisome)
- Option 2 (honeypotting) feels doomed (in isolation). This primarily stems from my belief that making situations that are indistinguishable from deployment is probably hard—the model has control over state in very broad and hard-to-predict ways, and this can implicitly encode that it is in deployment (for instance, the model can see code that it itself has written and would’ve taken a long time to write, and realize it has likely been deployed for some time). You can still run honeypotting tests using this newer distribution, but then you have to hope that you are running honeypotting frequently enough to catch when the model is confident it is in deployment.
- The elicitation problem, and the broader problem of justifying that the red-blue teaming has been sufficient, indeed feel quite hard. This is one reason as to why you might expect that control cases can only work for human-level AI—if your ‘elicitation protocol’ just ends up being ‘use attack strategies made primarily by humans’.
  - There are related problems in justifying how many affordances to give to the red team. For instance, decisions related to if the red team should be able to give/train into the model ‘coordination signals’ are likely important, but might not be easy to make. Past research has been maximally conservative, but this might not be realistic.
- The limitation that you are measuring P(catastrophe | misaligned) rather than just P(catastrophe) is fairly large, and I would be interested in discussion on how important this is to others. More broadly, I don’t feel great about our chances of bounding P(catastrophe) for reasons such as this.

Cody Rushing 7 Oct 2024 20:55 UTC
5 points
2
in reply to: Andrea_Miotti’s comment on: A Narrow Path: a plan to deal with AI extinction risk
Am I correct in interpreting that your definition of “found system” would apply nearly all useful AI systems today such as ChatGPT, as these are algorithms which run on weights that are found with optimization methods such as gradient descent? If so, it is still fairly onerous.

Cody Rushing 7 Oct 2024 14:22 UTC
10 points
0
on: A Narrow Path: a plan to deal with AI extinction risk
Thanks for writing this and proposing a plan. Coincidentally, I drafted a short take here yesterday explaining one complaint I currently have with the safety conditions of this plan. In short, I suspect the “No AIs improving other AIs” criterion isn’t worth including within a safety plan: it i) doesn’t address that many more marginal threat models (or does so ineffectively) and ii) would be too unpopular to implement (or, alternatively, too weak to be useful).
I think there is a version of this plan with a lower safety tax, with more focus on reactive policy and the other three criterion, that I would be more excited about.

Cody Rushing 8 Aug 2024 23:25 UTC
8 points
0
on: You can remove GPT2’s LayerNorm by fine-tuning for an hour
Another reason why layernorm is weird (and a shameless plug): the final layernorm also contributes to self-repair in language models

Cody Rushing 8 Aug 2024 23:16 UTC
5 points
8
in reply to: gwern’s comment on: Buck’s Shortform
Hmm, this transcript just seems like an example of blatant misalignment? I guess I have a definition of scheming that would imply deceptive alignment—for example, for me to classify Sydney as ‘obviously scheming’, I would need to see examples of Sydney 1) realizing it is in deployment and thus acting ‘misaligned’ or 2) realizing it is in training and thus acting ‘aligned’.

Cody Rushing 8 Jul 2024 23:14 UTC
5 points
8
in reply to: gwern’s comment on: Buck’s Shortform
In what manner was Sydney ‘pretty obviously scheming’? Feels like the misalignment displayed by Sydney is fairly different than other forms of scheming I would be concerned about
(if this is a joke, whoops sorry)

Cody Rushing

Pow­er­ful mis­al­igned AIs may be ex­tremely per­sua­sive, es­pe­cially ab­sent mitigations

Fac­tored Cog­ni­tion Strength­ens Mon­i­tor­ing and Thwarts Attacks

Ctrl-Z: Con­trol­ling AI Agents via Resampling

Powerful misaligned AIs may be extremely persuasive, especially absent mitigations

Factored Cognition Strengthens Monitoring and Thwarts Attacks

Ctrl-Z: Controlling AI Agents via Resampling