joshc

Karma: 1,715

Recent Redwood Research project proposals

ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman , Tyler Tracy, Aryan Bhatt and Joey Yudelson

Jul 14, 2025, 10:27 PM

91 points

0 comments3 min readLW link

joshc Jun 13, 2025, 4:52 PM
LW: 9 AF: 3
7
AF
on: What does it take to defend the world against out-of-control AGIs?
I’m mostly skeptical that trailing actors will have both the ability and the incentive to cause societal collapse.

e.g. if you are a misaligned AI that is much weaker than the leading AI, then if you try to take over the world by collapsing society, you will probably fail. Probably the stronger AI will clobber you and punish you (in an ethical way, e.g. with permanent shutdown) for messing everything up.

But if you instead play nice and make agreements with the stronger AI, then you are more likely to survive.

I know this argument does not apply to situations where e.g. a huge number of rogue actors have access to biosphere-destroying ASI, but these situations seem much easier to prevent.

I’m also more optimistic about the strength of politically viable defensive measures. e.g. cyber hijacking would probably be in the overtone window, and so would espionage to identify things like bioweapon projects, and kinetic operations to disrupt them.

Alignment faking CTFs: Apply to my MATS stream

joshcApr 4, 2025, 4:29 PM

61 points

0 comments4 min readLW link

joshc Mar 14, 2025, 6:04 AM
LW: 4 AF: 3
0
AF
on: What Is The Alignment Problem?
You don’t seem to have mentioned the alignment target “follow the common sense interpretation of a system prompt” which seems like the most sensible definition of alignment to me (its alignment to a message, not to a person etc). Then you can say whatever the heck you want in that prompt, including how you would like the AI to be corrigible.

joshc Mar 14, 2025, 6:04 AM
LW: 4 AF: 3
0
AF
on: What Is The Alignment Problem?
You don’t seem to have mentioned the alignment target “follow the common sense interpretation of a system prompt” which seems like the most sensible definition of alignment to me (its alignment to a message, not to a person etc). Then you can say whatever the heck you want in that prompt, including how you would like the AI to be corrigible (or incorrigible if you are worried about misuse).

joshc Feb 25, 2025, 3:30 AM
2 points
−3
in reply to: Jeremy Gillen’s comment on: Training AI to do alignment research we don’t already know how to do
It means something closer to “very subtly bad in a way that is difficult to distinguish from quality work”. Where the second part is the important part.

I think my arguments still hold in this case though right?

i.e. we are training models so they try to improve their work and identify these subtle issues—and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make.

My guess is that your core mistake is here

I agree there are lots of “messy in between places,” but these are also alignment failures we see in humans.

And if humans had a really long time to do safety reseach, my guess is we’d be ok. Why? Like you said, there’s a messy complicated system of humans with different goals, but these systems empirically often move in reasonable and socially-beneficial directions over time (governments get set up to deal with corrupt companies, new agencies get set up to deal with issues in governments, etc)

and i expect we can make AI agents a lot more aligned than humans typically are. e.g. most humans don’t actually care about the law etc but, Claude sure as hell seems to. If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.

joshc Feb 25, 2025, 2:27 AM
2 points
0
in reply to: Jeremy Gillen’s comment on: Training AI to do alignment research we don’t already know how to do
I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?

joshc Feb 25, 2025, 2:25 AM
2 points
0
in reply to: Teun van der Weij’s comment on: Training AI to do alignment research we don’t already know how to do
Yes, thanks! Fixed

Training AI to do alignment research we don’t already know how to do

joshcFeb 24, 2025, 7:19 PM

45 points

23 comments7 min readLW link

joshc Feb 23, 2025, 6:21 AM
LW: 11 AF: 4
2
AF
in reply to: johnswentworth’s comment on: How might we safely pass the buck to AI?
> So to summarize your short, simple answer to Eliezer’s question: you want to “train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end”. And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.

This one is roughly right. Here’s a clarified version:
- If we train an AI that is (1) not faking alignment in an egregious way and (2) looks very competent and safe to us and (3) the AI seems likely to be able to maintain its alignment / safety properties as it would if humans were in the loop (see section 8), then I think we can trust this AI to “put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome” (at least as well as humans would have been able to do so if they were given a lot more time to attempt this task) “despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.”

I think I might have created confusion by introducing the detail that AI agents will probably be trained on a lot of easy-to-verify tasks. I think this is where a lot of their capabilities will come from, but we’ll do a step at the end where we fine-tune AI agents for hard-to-verify tasks (similar to how we do RLHF after pre-training in the chatbot paradigm)

And I think our disagreement mostly pertains to this final fine-tuning step on hard-to-verify tasks.

Here’s where I think we agree. I think we both agree that if humans naively fine-tuned AI agents to regurgitate their current alignment takes, that would be way worse than if humans had way more time to think and do work on alignment.

My claim is that we can train AI agents to imitate the process of how humans improve their takes over time, such that after the AI agents do work for a long time, they will produce similarly good outcomes as the outcomes where humans did work for a long time.

Very concretely. I imagine that if I’m training an AI agent successor, a key thing I do is try to understand how it updates based on new evidence. Does it (appear) to update its views as reasonably as the most reasonable humans would?

If so, and if it is not egregiously misaligned (or liable to become egregiously misaligned) then I basically expect that letting it go and do lots of reasoning is likely to produce outcomes that are approximately good as humans would produce.

Are we close to a crux?

Maybe a crux is training ~aligned AI agents to imitate the process by which humans improve their takes over time will lead to as good of outcomes as if we let humans do way more work.

joshc Feb 23, 2025, 5:10 AM
LW: 5 AF: 4
0
AF
in reply to: johnswentworth’s comment on: How might we safely pass the buck to AI?
Probably the iterated amplification proposal I described is very suboptimal. My goal with it was to illustrate how safety could be preserved across multiple buck-passes if models are not egregiously misaligned.

Like I said at the start of my comment: “I’ll describe [a proposal] in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement.”

I don’t actually expect safety will scale efficiently via the iterated amplification approach I described. The iterated amplification approach is just relatively simple to talk about.

What I actually expect to happen is something like:
- Humans train AI agents that are smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end.
- Early AI successors create smarter successors in basically the same way.
- At some point, AI agents start finding much more efficient ways to safely scale capabilities. e.g. maybe initially, they do this with a bunch of weak-to-strong generalization research. And eventually they figure out how to do formally verified distillation.

But at this point, humans will long be obsolete. The position I am defending in this post is that it’s not very important for us humans to think about these scalable approaches.

joshc Feb 20, 2025, 10:33 PM
LW: 2 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: How might we safely pass the buck to AI?
Yeah that’s fair. Currently I merge “behavioral tests” into the alignment argument, but that’s a bit clunky and I prob should have just made the carving:
1. looks good in behavioral tests
2. is still going to generalize to the deferred task

But my guess is we agree on the object level here and there’s a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.

joshc Feb 20, 2025, 10:30 PM
LW: 2 AF: 1
−1
AF
in reply to: Charlie Steiner’s comment on: How might we safely pass the buck to AI?
Because “nice” is a fuzzy word into which we’ve stuffed a bunch of different skills, even though having some of the skills doesn’t mean you have all of the skills.

Developers separately need to justify models are as skilled as top human experts

I also would not say “reasoning about novel moral problems” is a skill (because of the is ought distinction)

> An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike

The agents don’t need to do reasoning about novel moral problems (at least not in high stakes settings). We’re training these things to respond to instructions.

We can tell them not to do things we would obviously dislike (e.g. takeover) and retain our optionality to direct them in ways that we are currently uncertain about.

joshc Feb 20, 2025, 4:44 PM
LW: 19 AF: 6
10
AF
in reply to: Eliezer Yudkowsky’s comment on: How might we safely pass the buck to AI?
I do not think that the initial humans at the start of the chain can “control” the Eliezers doing thousands of years of work in this manner (if you use control to mean “restrict the options of an AI system in such a way that it is incapable of acting in an unsafe manner”)

That’s because each step in the chain requires trust.

For N-month Eliezer to scale to 4N-month Eliezer, it first controls 2N-month Eliezer while it does 2 month tasks, but it trusts 2-Month Eliezer to create a 4N-month Eliezer.

So the control property is not maintained. But my argument is that the trust property is. The humans at the start can indeed trust the Eliezers at the end to do thousands of years of useful work—even though the Eliezers at the end are fully capable of doing something else instead.

joshc Feb 20, 2025, 3:44 AM
LW: 2 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: How might we safely pass the buck to AI?
I don’t expect this testing to be hard (at least conceptually, though doing this well might be expensive), and I expect it can be entirely behavioral.

See section 6: “The capability condition”

joshc Feb 20, 2025, 3:42 AM
LW: 11 AF: 3
0
AF
in reply to: habryka’s comment on: How might we safely pass the buck to AI?
I’m sympathetic to this reaction.

I just don’t actually think many people agree that it’s the core of the problem, so I figured it was worth establishing this (and I think there are some other supplementary approaches like automated control and incentives that are worth throwing into the mix) before digging into the ‘how do we avoid alignment faking’ question

joshc Feb 20, 2025, 3:23 AM
LW: 5 AF: 1
0
AF
in reply to: habryka’s comment on: How might we safely pass the buck to AI?
agree that it’s tricky iteration and requires careful thinking about what might be going on and paranoia.

I’ll share you on the post I’m writing about this before I publish. I’d guess this discussion will be more productive when I’ve finished it (there are some questions I want to think through regarding this and my framings aren’t very crisp yet).

joshc Feb 20, 2025, 3:20 AM
LW: 13 AF: 4
6
AF
in reply to: Eliezer Yudkowsky’s comment on: How might we safely pass the buck to AI?
I’d replace “controlling” with “creating” but given this change, then yes, that’s what I’m proposing.

joshc Feb 20, 2025, 1:19 AM
LW: 10 AF: 2
3
AF
in reply to: Eliezer Yudkowsky’s comment on: How might we safely pass the buck to AI?
I would not be surprised if the Eliezer simulators do go dangerous by default as you say.

But this is something we can study and work to avoid (which is what I view to be my main job)

My point is just that preventing the early Eliezers from “going dangerous” (by which I mean from “faking alignment”) is the bulk of the problem humans need address (and insofar as we succeed, the hope is that future Eliezer sims will prevent their Eliezer successors from going dangerous too)

I’ll discuss why I’m optimistic about the tractability of this problem in future posts.

joshc Feb 20, 2025, 1:15 AM
LW: 8 AF: 5
1
AF
in reply to: habryka’s comment on: How might we safely pass the buck to AI?
I think that if you agree “3-month Eliezer is scheming the first time” is the main problem, then that’s all I was trying to justify in the comment above.

I don’t know how hard it is to train 3-month Eliezer not to scheme, but here is a general methodology one might purse to approach this problem.

The methodology looks like “do science to figure out when alignment faking happens and does not happen, and work your ways toward training recipes that don’t produce alignment faking.”

For example, you could use detection tool A to gain evidence about whether training against detection tool B actually removes alignment faking.

To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don’t game them so hard).

You might also find other ways to avoid alignment faking that don’t rely on training against detection tools. For example, perhaps if you train models to be highly corrigible and chill from the start, they tend not to fake alignment -- I don’t know.

I’m uncertain about how likely alignment faking will be by default, or how hard it is to mitigate. I think these are messy empirical questions.

However, I’m more confident that there are principled empirical methodologies we can apply to make progress on these questions. I don’t think my explanation here justifies this claim -- but I’m hoping to publish posts that go into more detail soon.

joshc

Re­cent Red­wood Re­search pro­ject proposals

Align­ment fak­ing CTFs: Ap­ply to my MATS stream

Train­ing AI to do al­ign­ment re­search we don’t already know how to do

Recent Redwood Research project proposals

Alignment faking CTFs: Apply to my MATS stream

Training AI to do alignment research we don’t already know how to do