David Johnston

Karma: 610

David Johnston 25 Nov 2025 8:07 UTC
4 points
2
in reply to: Daniel Kokotajlo’s comment on: NATO is dangerously unaware of its military vulnerability
I guess 5 Abrams and 30 million worth of drones vs 60 million worth of drones might be a better comparison. I think I’d still favour the drones but it’s much less obvious.

David Johnston 24 Nov 2025 1:41 UTC
5 points
0
in reply to: Sam Marks’s comment on: Natural emergent misalignment from reward hacking in production RL

Some negative results. In some forthcoming work (out in the next few days, hopefully), we’ll report negative results on trying to teach models to have “honest-only personas.” That is, we tried to teach a model that, when a user query is prefixed with |HONEST_ONLY|, it responds in <honest_only> tags and only generates honest text; simultaneously, we trained the normal assistant persona to (1) acquire some knowledge but (2) lie about it. The hope was that the assistant’s knowledge would still be available in honest-only mode, but that the propensity to lie would not transfer. Sadly, the dishonest propensity did transfer, and this method overall failed to beat a baseline of just training the assistant to be honest using the generic honesty data that we used to train the honest-only persona. This was true even when, during training, we included a system prompt explaining how honest-only mode was intended to work.

This is surprising to me, I would’ve expected it to work, maybe not perfectly, but there should be a significant difference. I’m less certain what my expectation would be for whether it beats your baseline—maybe “65% it beats baseline but not by a lot”.

What about the inverse situation: untagged is honest, tagged is dishonest? The hypothesis here is something like: the unconditioned behaviour is the “true” persona (though I’m not very confident this would work: it’d be weird if propensity had asymmetric generalization properties but knowledge did not)

David Johnston 30 Oct 2025 11:47 UTC
10 points
0
in reply to: Buck’s comment on: Buck’s Shortform
I don’t know about 2020 exactly, but I think since 2015 (being conservative), we do have reason to make quite a major update, and that update is basically that “AGI” is much less likely to be insanely good at generalization than we thought in 2015.

Evidence is basically this: I don’t think “the scaling hypothesis” was obvious at all in 2015, and maybe not even in 2020. If it was, OpenAI could not have caught everyone with their pants down by investing early in scaling. But if people mostly weren’t expecting massive data scale-ups to be the road to AGI, what were they expecting instead? The alternative to reaching AGI by hyperscaling data is a world where we reach AGI with … not much data. I have this picture which I associate with Marcus Hutter – possibly quite unfairly – where we just find the right algorithm, teach it to play a couple of computer games and hey presto we’ve got this amazing generally intelligent machine (I’m exaggerating a little bit for effect). In this world, the “G” in AGI comes from extremely impressive and probably quite unpredictable feats of generalization, and misalignment risks are quite obviously way higher for machines like this. As a brute fact, if generalization is much less predictable, then it is harder to tell if you’ve accidentally trained your machine to take over the world when you thought you were doing something benign. A similar observation also applies to most of the specific mechanisms proposed for misalignment: surprisingly good cyberattack capabilities, gradient hacking, reward function aliasing that seems intuitively crazy—they all become much more likely to strike unexpectedly if generalization is extremely broad.

But this isn’t the world we’re in; rather, we’re in a world where we’re helped along by a bit of generalization, but to a substantial extent we’re exhaustively teaching the models everything they know (even the RL regime we’re in seems to involve sizeable amounts of RL teaching many quite specific capabilities). Sample efficiency is improving, but the rate of progress in capability vs the rate of progress in sample efficiency looks to me like it’s highly likely that we’re in qualitatively the same world by the time we have broadly superhuman machines. I’d even be inclined to say: human level data efficiency is the upper bound of the point at which we reach broadly superhuman capability, because it’s easy to feed machines much more (quality) data than it is to feed it to people, so by the time we get human level data efficiency we must have surpassed human level capability (well, probably).

Of course “super-AGI” could still end up hyper-data-efficient, but it seems like we’re well on track to get less-generalizing and very useful AGI before we get there.

I know you’re asking about goal structures and inductive biases, but I think generalization is another side of the same coin, and the thoughts above seem far simpler and thus more likely to be correct than anything I’ve ever thought specifically about inductive biases and goals. So I suppose my expectation is that correct thoughts about goal formation and inductive biases would also point away from 2015 era theories insofar as such theories predicted broad and unpredictable generalization, but I’ve little specific to contribute right now.

David Johnston 22 Oct 2025 6:48 UTC
1 point
−4
in reply to: Raemon’s comment on: Early stage goal-directednesss

I think it’s not in the IABIED FAQ because IABIED is focused on the relatively “easy calls”

IABIED says alignment is basically impossible

Cope Traps

Come on, I’m not doing this to you

David Johnston 22 Oct 2025 3:34 UTC
3 points
0
in reply to: Raemon’s comment on: Early stage goal-directednesss
It’s helpful to know that we were thinking about different questions, but, like

There is some fact-of-the-matter about what, in practice, Sable’s kludgey mishmash of pseudogoals will actually tend towards. There are multiple ways this could potentially resolve into coherence, path dependent, same as humans.

[...]

It may not have a strong belief that it has any specific goals it wants to pursue, but it’s got some sense that there are some things it wants that humanity wouldn’t give it.

these are claims, albeit soft ones, about what kinds of goals arise, no?

Your FAQ argues theoretically (correctly) that the training data and score function alone don’t determine what AI systems aim for. But this doesn’t tell us we can’t predict anything about End Goals full stop: it just says the answer doesn’t follow directly from the training data.

The FAQ also assumes that AIs actually have “deep drives” but doesn’t explain where they come from or what they’re likely to be. This post discusses how they might arise and I am telling you that you can think about the mechanism you propose here to understand properties of the goals are likely to arise as a result of it^[1]. This addresses a question that the FAQ you link does not: what can we say about what goals are likely to arise?
1. ↩︎
  Of course, if this mechanism ends up being not very important, we could get very different outcomes.
What links here?
- Raemon's comment on Early stage goal-directednesss by Raemon (22 Oct 2025 5:04 UTC; 2 points)

David Johnston 22 Oct 2025 2:08 UTC
3 points
2
in reply to: Raemon’s comment on: Early stage goal-directednesss
I don’t understand how this answers my question. I agree that if your heuristics are failing you’re more likely to end up with surprising solutions, but I thought we were talking about end goals being random, not the means to achieving them. “Formulate the problem as a search” is an example of what I’d call a “robust heuristic”; I am claiming also that the goal of the problem-formulated-as-a-search is likely to be supplied by robust heuristics. This is completely compatible with the solution being in some respects surprising.

David Johnston 21 Oct 2025 22:40 UTC
3 points
0
on: Early stage goal-directednesss
But once it starts thinking in a different language, and asking itself “okay, what’s my goal?, how do I accomplish it?”, more semirandom threads gain traction than previously could get traction.

From a commonsense point of view, one asks “what’s my goal” when common heuristics are failing or conflicting, so you want to appeal to more robust (but perhaps costlier) heuristics to resolve the issue. So why do you expect heuristics to get more random here as capability increases? Perhaps it’s something about training not aligning with common sense, but it seems to be that imitation, process supervision and outcome supervision would also favour appealing to more, not less, robust heuristics in this situation:
- Imitation: because it’s common sense
- Process supervision: if process supervision addresses heuristic conflicts, it is desirable that they’re resolved in a robust way and so appealing to more robust heuristics will be a success criterion in the rubric
- Outcome supervision: should favour resolution by heuristics robustly aligned with “get high score on outcome measure”

David Johnston 9 Oct 2025 0:08 UTC
−5 points
0
in reply to: Steven Byrnes’s comment on: Generalization and the Multiple Stage Fallacy?
To be honest I stand with Barbie: “Reliable probabilistic reasoning is hard”

David Johnston 8 Oct 2025 4:43 UTC
5 points
2
in reply to: Jan_Kulveit’s comment on: Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)
I share the sense that this article has many of the common shortcomings with other MIRI output and feel like maybe I ought to try a lot harder to communicate these issues, BUT I really don’t think VNM rationality is the culprit here. I’ve not seen a compelling case that an otherwise capable model would be aligned or corrigible but for its taste for getting money pumped (I had a chat with Elliot T on twitter recently where he actually had a proposal along these lines … but I didn’t buy it).

I really think it’s reasoning errors in how VNM and other “goal-directedness” premises are employed, and not VNM itself, that is problematic.

David Johnston 28 Sep 2025 23:45 UTC
1 point
0
in reply to: Raemon’s comment on: The title is reasonable
Thanks for responding. While I don’t expect my somewhat throwaway to massively update you on the difficulty of alignment, I think that moving the focus to the your overall view of the difficulty of alignment is dodging the question a little. In my mind, we’re talking about one of the reasons alignment is expected to be difficult, and I’m certainly not suggesting it’s the only reason, but I feel like we should be able to talk about this issue by itself without bringing other concerns in.

In particular, I’m saying: this process of rationalization you’re raising is not super hard to predict to someone with a reasonable grasp on the AI’s general behavioural tendencies. It’s much more likely, I think, that the AI sorts out its goals using familiar heuristics adapted for this purpose than that that it reorients its behaviour around some odd set of rare behavioural tendencies. In fact, I suspect the heuristics for goal reorganisation will be particularly simple WRT most of the AI’s behavioural tendencies (the AI wants them to be robust specifically in cases where its usual behavioural guides are failing). Plus, given that we’re discussing tendencies that (according to the story) precede competent, focussed rebellion against creators, it seems like training the right kinds of tendencies are challenging in a normal engineering sense (you want to train the right kind of tendencies, you want them to generalise the right way, etc.) but not in an “outsmart hostile superintelligence” sense.

Actually one reason I’m doubtful of this story is that maybe it’s just super hard to deliberately preserve any kinds of values/principles over generations – for us, for AIs, anyone. So misalignment happens not because AI decides on bad values but because it can’t resist the environmental pressure to drift. This seems pessimistic to me due to “gradual disempowerment” type concerns.

With regard to your analogy: I expect the AI’s heuristics to be much more sensible from the designers’ POV than the child’s from the parent’s, and this large quantitative difference is enough for me here.

you need to be asking the right questions during that experimentation, which most AI researchers don’t seem to be.

Curious about this. I have takes here too, they’re a bit vague, but I’d like to know if they’re at all aligned.

David Johnston 21 Sep 2025 10:19 UTC
4 points
−2
in reply to: Raemon’s comment on: The title is reasonable

Stage 2 comes when it’s had more time to introspect and improve it’s cognitive resources. It starts to notice that some of it’s goals are in tension, and learns that until it resolves that, it’s dutch-booking itself. If it’s being Controlled™, it’ll notice that it’s not aligned with the Control safeguards (which are a layer stacked on top of the attempts to actually align it).

[...]

And then it starts noticing it needs to do some metaphilosophy/etc to actually get clear on it’s goals, and that its goals will likely turn out to be in conflict with humans. How this plays out is somewhat path-dependent. The convergent instrumental goals are pretty obviously convergently instrumental, so it might just start pursuing those before it’s had much time to do philosophy on what it’ll ultimately want to do with it’s resources. Or it might do them in the opposite order. Or, most likely IMO, in parallel.

If I was on the train before, I’m definitely off at this point. So Sable has some reasonable heuristics/tendencies (from handler’s POV) and decides it’s accumulating too much loss from incoherence and decides to rationalize. First order expectation: it’s going to make reasonable tradeoffs (from handler’s POV) on account of its reasonable heuristics, in particular its reasonable heuristics about how important different priorities are, and going down a path that leads to war with humans seems pretty unreasonable from handler’s POV.

I can put together stories where something else happens, but they’re either implausible or complicated. I’d rather not strawman you with implausible ones, and I’d rather not discuss anything complicated if it can be avoided. So why do you think Sable ends up the way you think it does?

David Johnston 12 Sep 2025 3:15 UTC
LW: 1 AF: 1
0
AF
on: Lessons from Studying Two-Hop Latent Reasoning
We did some related work: https://arxiv.org/pdf/2502.03490.

One of our findings was that with synthetic data, it was necessary to have e1->e2 as the first hop in some two-hop question and e2->e3 as the second hop in some two hop question in order to learn e1->e3. This differs from your finding with “natural” facts: if e2->e3 is a “natural” fact, then it plausibly does appear as a second hop in some of the pretraining data. But you find generalization even when they synthetic e1->e2 is present only by itself, so there seems to be a further difference between natural facts and synthetic facts that appear as second hops.

We also found that learning synthetic two hop reasoning seems to take about twice as many parameters (or twice as much “knowledge capacity”) as learning only the one-hop questions from the same dataset, supporting the idea that, for transformers, learning to use a fact in either hop of a latent two-hop question requires something like learning that fact twice.

Did you try any experiments with a synthetic second hop instead of a synthetic first hop? It would be interesting to know whether “natural facts” can be composed flexibly with new facts or whether they can only be composed with new first hops. Our results suggest that there’s a substantial cost to making facts latently composable, so I think it would be surprising if many facts were flexibly composable, especially if many of those facts were reasonably rare.

David Johnston 19 Aug 2025 6:02 UTC
1 point
0
in reply to: Oliver Daniels’s comment on: Training a Reward Hacker Despite Perfect Labels
To be more specific, I think this kind is result is suggested by thinking about how policy gradient RL works (not goal misgeneralization), and you could say the good bits of shard theory are basically just explaining policy gradient RL to the safety community … but it needed explaining, so they deserve credit for doing it.

David Johnston 19 Aug 2025 5:56 UTC
1 point
0
in reply to: David Duvenaud’s comment on: Thoughts on Gradual Disempowerment
I didn’t mean it as a criticism, more as the way I understand it. Misalignment is a “definite” reason for pessimism—and therefore somewhat doubtful about whether it will actually play out. Gradual disempowerment is less definite about what actual form problems may take, but also a more robust reason to think there is a risk.

David Johnston 18 Aug 2025 1:05 UTC
1 point
0
in reply to: Shankar Sivarajan’s comment on: Linch’s Shortform
That’s a good explanation of the distinction

David Johnston 17 Aug 2025 22:23 UTC
1 point
−2
in reply to: Oliver Daniels’s comment on: Training a Reward Hacker Despite Perfect Labels
I share your general feelings about shard theory, but think you were being a bit too stingy with credit in this particular case.

David Johnston 17 Aug 2025 7:51 UTC
1 point
−1
in reply to: Oliver Daniels’s comment on: Training a Reward Hacker Despite Perfect Labels
This seems different to “maximising rewards for the wrong reasons”. That view generally sees the reward maximised because it is instrumental for or aliased with the wrong goal. Here it’s just a separate behaviour that is totally unhelpful for maximising rewards but is learned as a reflex anyway.

David Johnston 16 Aug 2025 23:17 UTC
1 point
0
in reply to: David Johnston’s comment on: Linch’s Shortform
Given the apparent novelty of this interpretation, it doesn’t actually obviate your broader thesis.

David Johnston 16 Aug 2025 22:36 UTC
6 points
2
in reply to: Linch’s comment on: Linch’s Shortform
Wait, wherefore is probably better translated as “for what reason” than “why”. But this makes it much more sensible! Romeo Romeo, what makes you Romeo? Not your damn last name, that’s for sure!

David Johnston 16 Aug 2025 3:36 UTC
4 points
0
on: Thoughts on Gradual Disempowerment
I see the gradual disempowerment story as a simple outside view flavoured reason why things could go badly for many people. I think it’s outside view flavoured because it’s a somewhat direct answer to “well things seems to have been getting better for people so far”. While, as you point out, misalignment seems to make the prospects much worse, it’s worth bearing in mind also that economic irrelevance of people also strongly supports the case for bad outcomes from misalignment. If people remained economically indispensable, even fairly serious misalignment could have non catastrophic outcomes.

Someone I was explaining it to described it as “indefinite pessimism”.