Joe Rogero

Karma: 474

Joe Rogero 21 Oct 2025 18:33 UTC
3 points
0
in reply to: d_el_ez’s comment on: So You Want to Work at a Frontier AI Lab
I chiefly advise against work that brings us closer to superintelligence. I aim this advice primarily at those who want to make sure AI goes well. For careers that do other things, and for those who aren’t aiming their careers for impact, this post mostly doesn’t apply. One can argue about secondary effects and such, but in general, mundane utility is a good thing and it’s fine for people to get paid for providing it.

Joe Rogero 16 Oct 2025 15:42 UTC
1 point
0
in reply to: Joe Rogero’s comment on: There are no coherence theorems
Somewhat relatedly, “If I previously turned down some option X, I will not choose any option that I strictly disprefer to X” does feel to me like a grafted-on hack of a policy that breaks down in some adversarial edge case.
Maybe it’s airtight, I’m not sure. But if it is, that just feels like coherence with extra steps? Like, sure, you can pursue a strategy of incoherence which requires you to know the entire universe of possible trades you will make and then backchains inductively to make sure you never, ever are exploitable about this.
Or you could make your preferences explicit and be consistent in the first place. In a sense, I think that’s the simple, elegant thing that the weird hack approximates.
If you have coherent preferences, you get the hack for free. I think an agent with coherent preferences performs at least as well with the same assumptions (prescience, backchaining) on the same decision tree, and performs better if you relax one or more of those assumptions.
In practice, it pays to be the sort of entity that attempts to have consistent preferences about things whenever that’s decision-relevant and computationally tractable.

Joe Rogero 16 Oct 2025 14:27 UTC
1 point
0
on: There are no coherence theorems
A friend (correctly) recommended me this post as useful context and I’m documenting my thoughts here for easy reference. This is not, strictly speaking, an objection to the headline claim of the post. It’s a claim that coherence will tend to emerge in practice.
That the agent knows in advance what trades they will be offered.
This assumption doesn’t hold in real life. It’s a bit like saying “If I know what moves my opponent will make, I can always beat them at chess.” Well, yes. But in practice you don’t. Agents in real life can’t rely on perfect knowledge like this. Directionally, agents will be less exploitable and more efficient as their preferences grow more explicit and coherent. In actual practice, training a neural net to solve problems without getting stuck also trains it to have more explicit and coherent preferences.
(If the agent doesn’t know in advance what trades they will be offered or is incapable of backward induction, then their pursuit of a dominated strategy need not indicate any defect in their preferences. Their pursuit of a dominated strategy can instead be blamed on their lack of knowledge and/or reasoning ability.)
I blame it on both. The lack of knowledge in question is the fact that agents in practice aren’t omni-prescient. The lack of reasoning ability in question is a refusal to assign an explicit preference ordering to outcomes.
If you don’t know the whole decision tree in advance, then “if I previously turned down some option X, I will not choose any option that I strictly disprefer to X” will probably be violated at some point by e.g. having rejected X1 and X2 and later having to choose between X1- and X2-, even without adversarial exploitation.
Even if I grant the entire rest of the post, it still seems highly probable that sufficiently smart AIs grown using modern methods end up likely to have coherent preference orderings in most ways that matter.

Joe Rogero 13 Oct 2025 19:29 UTC
1 point
0
in reply to: teradimich’s comment on: We won’t get docile, brilliant AIs before we solve alignment
I can mostly only speak to my own probabilities, and it depends how many years we count as coming. I’m less than 98% on ASI in the next five years, say. The ~98% is if anyone builds it (using anything remotely like current methods).

We won’t get docile, brilliant AIs before we solve alignment

Joe Rogero10 Oct 2025 4:11 UTC

7 points

3 comments3 min readLW link

Labs lack the tools to course-correct

Joe Rogero10 Oct 2025 4:10 UTC

4 points

0 comments3 min readLW link

Alignment progress doesn’t compensate for higher capabilities

Joe Rogero9 Oct 2025 16:06 UTC

2 points

0 comments6 min readLW link

Joe Rogero 8 Oct 2025 14:07 UTC
1 point
0
in reply to: Seth Herd’s comment on: Intent alignment seems incoherent
Thanks for clarifying. It still seems that we’d encounter the same sort of problem even in the short term, though? Take the case of a programmer hijacking the input medium. Does the AI care? It’s still getting instructions to follow. To what extent is it modeling the real humans on the other end? You touch on this in Defining the Principal(s) and jailbreaking, but it seems like it should be even more of a Problem for the approach. Like, an AI that can robustly navigate that challenge, to the point of being more or less immune to intercepts, seems hard to distinguish from one that is (a) long-term aligned as well and (b) possessed of deadly competence at world-modeling if not long-term aligned. An AI that can’t handle this problem...well, is it really intent-aligned? Where else does its understanding of the developers break down?

Intent alignment seems incoherent

Joe Rogero7 Oct 2025 23:01 UTC

22 points

2 comments6 min readLW link

Joe Rogero 7 Oct 2025 21:14 UTC
5 points
2
in reply to: Seth Herd’s comment on: LLMs are badly misaligned
On the one hand, I...sort of agree about the intuitions. There exist formal arguments, but I can’t always claim to understand them well.
On the other, one of my intuitions is that if you’re trying to build a Moon rocket, and the rocket engineers keep saying things like “The arguments boil down to differing intuitions” and “I think it is quite accurate to say that we don’t understand how [rockets] work” then the rocket will not land on the Moon. At no point in planning a Moon launch should the arguments boil down to different intuitions. The arguments should boil down to math and science that anyone with the right background can verify.
If they don’t, I would claim the correct response is not “maybe it’ll work, maybe it won’t, maybe it’ll get partway there,” it’s instead “wow that rocket is doomed.”
I see the current science being leveled at making Claude “nice” and I go “wow that sure looks like a faroff target with lots of weird unknowns between us and it, and that sure does not look like a precise trajectory plotted according to known formulae; I don’t see them sticking the landing this way.”
It’s really hard to shake this intuition.
Possibly a nitpick: So, I don’t actually think HHH was the training target. It was the label attached to the training target. The actual training target is...much weirder and more complicated IMO. The training target for RLHF is more or less “get human to push button” and RLAIF is the same but with an AI. Sure, pushing the “this is better” button often involves a judgment according to some interpretation of a statement like “which of these is more harmless?”, but the appearance of harmlessness is not the same as its reality, etc.

Joe Rogero 7 Oct 2025 20:16 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: We won’t get AIs smart enough to solve alignment but too dumb to rebel
Sometimes, people argue for doom by noting that it would be hard for humans to directly align wildly superhuman AIs. I agree, but think it might be much easier to align systems which are only just capable enough to hand off relevant cognitive labor. Correspondingly, I often note this. Minimally, in the comment you linked in this post, I wasn’t trying to refer to systems which are misaligned but controlled, I was trying to refer to aligned systems.
...huh. It seems to me that the fundamental problem in machine learning, that no one has a way to engineer specific goals into AI, applies equally well to weak AIs as to powerful ones. So this might be a key crux.
To clarify, by “align systems...” did you mean the same thing I do, full-blown value alignment / human CEV? Is the theory in fact that we could get weak AIs who steer robustly and entirely towards human values, and would do so even on reflection; that we’d actually know how to do this reliably on purpose with practical engineering, but that such knowledge wouldn’t generalize to scaled-up versions of the same system? (My impression is that aligning even a weak AI that thoroughly requires understanding cognition on such a deep and fundamental level that it mostly would generalize to superintelligence, though of course it’d still be foolish to rely on this generalization alone.)
If instead you meant something more like you described here, systems that are not “egregiously misaligned”, then that’s a different matter. But I get the sense it actually is the first thing, in this specific narrow case?
Sure, I was just supporting the claim that “less capable AI systems can make meaningful progress on improving the situation”. You seemed to be implicitly arguing against this claim.
I don’t think they can make meaningful progress on alignment without catastrophically dangerous levels of competence. That’s the main intended thrust of this particular post. (Separately, I don’t think the anticipation of possible second-order benefits, like using AIs for human augmentation so the humans can solve alignment, is worth letting labs continue either; I’d perhaps be in favor of narrow AIs for this purpose if such could be specified in a treaty without leaving enormous exploitable loopholes. Maybe it can.)

Joe Rogero 7 Oct 2025 16:23 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: We won’t get AIs smart enough to solve alignment but too dumb to rebel
It looks like we do agree on quite a lot of things. Not a surprise, but glad to see it laid out.
Why are you assuming the AI has misaligned goals?
The short, somewhat trite answer is that it’s baked into the premise. If we had a way of getting a powerful optimizer that didn’t have misaligned goals, we wouldn’t need said optimizer to solve alignment!
The more precise answer is that we can train for competence but not goodness, current LLMs have misaligned goals to the extent they have any at all, and this doesn’t seem likely to change.
Perhaps you will argue for this in the next post.
Yup. (Well, I’ll try; the whole conversation on that particular crux seems unusually muddled to me and it shows.)
I also think less capable AI systems can make meaningful progress on improving the situation, but this is partially tied up in thinking that it isn’t intractable to use a ton of human labor to do a good enough job aligning systems which are capable enough to dominate top human experts at these domains. As in, the way less capable AI systems help is by allowing us to (in effect) apply much more cognitive labor to the problem of sufficiently aligning AIs we can use to fully automate safety R&D (aka deferring to AIs , aka handing off to AIs).
The cruxy points here are, I think, “good enough” and “sufficiently”, and the underlying implication that partial progress on alignment can make capabilities much safer. I doubt this. A future post will touch on why.
To the extent you contest this, the remaining options are to use AI labor to buy much more time and/or to achieve substantial human augmentation.
Nitpick: Neither approach seems to require AI labor. I certainly use plenty of LLMs in my workflow, but maybe you’d have something more ambitious in mind.
More on several of these topics in the coming posts.

Joe Rogero 7 Oct 2025 15:45 UTC
6 points
0
in reply to: StanislavKrym’s comment on: We won’t get AIs smart enough to solve alignment but too dumb to rebel
I cannot speak for their team, but my best guess is that they are envisioning an Agent-3 which possesses insufficient awareness of its misaligned goals or insufficient coherence to notice it is incentivized to scheme. This does seem consistent with Agent-3 being incompetent to align Agent-4. To quote:
The superorganism of Agent-3 copies (the corporation within a corporation) does actually sort of try to align Agent-4 to the Spec, but fails for similar reasons to why OpenBrain employees failed—insufficient ability to judge success from failure, insufficient willingness on the part of decision-makers to trade away capabilities or performance for safety.
In Rob’s list of possible outcomes, this seems to fall under “AIs that are confidently wrong and lead you off a cliff just like the humans would.” (Possibly at some point Agent-3 said “Yep, I’m stumped too” and then OpenBrain trained it not to do that.)

Joe Rogero 7 Oct 2025 15:19 UTC
4 points
0
in reply to: ryan_greenblatt’s comment on: LLMs are badly misaligned
It sounds like we are indeed using very different meanings of “alignment” and should use other words instead.
I suspect our shared crux is the degree to which cooperative behavior can be predicted/extrapolated as models get more competent. To a reasonable first approximation, if e.g. Claude wants good things, improvements to Claude’s epistemics are probably good for us; if Claude does not, they are not. Yes?
It may take a whole entire post to explain, but I’m curious why you believe Claude is likely to have any care for human wellbeing that would survive reflection. I don’t think training methods are precise enough to have instilled those in the first place; do you believe differently, are you mostly taking the observed behavioral tendencies as strong evidence, is it something else...? (Maybe you have written about this elsewhere already.)

Joe Rogero 7 Oct 2025 13:28 UTC
2 points
0
in reply to: Buck’s comment on: We won’t get AIs smart enough to solve alignment but too dumb to rebel
To be clear, I do suspect any AI smart enough to solve alignment is also smart enough to escape control and kill us. I’m not planning to go into great detail on control until after a deeper dive on the subject, though. Thanks for the reading material!

Joe Rogero 6 Oct 2025 22:51 UTC
1 point
0
in reply to: yams’s comment on: LLMs are badly misaligned
Yeah, I’m mostly trying to address the impression that LLMs are ~close to aligned already and thus the problem is keeping them that way, rather than, like, actually solving alignment for AIs in general.

Joe Rogero 6 Oct 2025 22:41 UTC
4 points
0
in reply to: StanislavKrym’s comment on: LLMs are badly misaligned
Friendly and unfriendly attractors might exist, but that doesn’t make them equally likely. The first seems much more likely than the second. I have in mind a mental image of a galaxy of value-stars, each with their own metaphorical gravity well. Somewhere in that galaxy is a star or a handful of stars labeled “cares about human wellbeing” or similar. Almost every other star is lethal. Landing on a safe star, and not getting snagged by any other gravity wells, requires a very precise trajectory. The odds of landing it by accident are astronomically low.
(Absent caring, I don’t think “granting us rights” is a particularly likely outcome; AIs far more powerful than humans would have no good reason to.)
I agree that an AI being too dumb to recognize when it’s causing harm (vs e.g. co-writing fiction) screens off many inferences about its intent. I...would not describe any such interaction, with human or AI, as “revealing its CEV.” I’d say current interactions seem to rule out the hypothesis that LLMs are already robustly orbiting the correct metaphorical star. They don’t say much about which star or stars they are orbiting.

Joe Rogero 6 Oct 2025 22:17 UTC
3 points
0
in reply to: Seth Herd’s comment on: LLMs are badly misaligned
I like the linked piece and may reference it in the forthcoming post about intent alignment. (A response that does it justice will have to wait; it’s a long one. I may comment though.)
I’d be pretty shocked if niceness did turn out to be Claude’s “dominant core value.” I have to ask myself, how could that possibly get in there? I just don’t think HHH does it, there’s way too many degrees of freedom in interpretation. To hit a values target that precisely, I think you need something that can see it clearly.

We won’t get AIs smart enough to solve alignment but too dumb to rebel

Joe Rogero6 Oct 2025 21:49 UTC

28 points

16 comments5 min readLW link

Joe Rogero 6 Oct 2025 21:19 UTC
3 points
0
in reply to: Noosphere89’s comment on: LLMs are badly misaligned
You may have a point that this is a crux for some. I think I...mostly reject the framing of “worst-case” and “average-case” “alignment”. I claim models are not aligned, period. I claim “doing what the operators want most of the time” is not alignment and should not be mistaken for it.
The scenario I am most concerned about involves AIs trained on and tasked with thinking about the deep implications of AI values. Such AIs probably get better at noticing their own. This seems like the “default” and “normal” case to me, and it seems almost unavoidable that deep misalignment begins to surface at that point.
Even if AIs did not do this sort of AI research, though, competence and internal coherence seem hard to disentangle from each other.

Joe Rogero

We won’t get docile, brilli­ant AIs be­fore we solve alignment

Labs lack the tools to course-correct

Align­ment progress doesn’t com­pen­sate for higher capabilities

In­tent al­ign­ment seems incoherent

We won’t get AIs smart enough to solve al­ign­ment but too dumb to rebel

We won’t get docile, brilliant AIs before we solve alignment

Alignment progress doesn’t compensate for higher capabilities

Intent alignment seems incoherent

We won’t get AIs smart enough to solve alignment but too dumb to rebel