Kaarel comments on Alignment remains a hard, unsolved problem

Kaarel 3 Dec 2025 7:58 UTC
10 points
0
training on a purely predictive loss should, even in the limit, give you a predictor, not an agent

I think at least this part is probably false!

Or really I think this is kind of a nonsensical statement when taken literally/pedantically, at least if we use the to-me-most-natural meaning of “predictor”, because I don’t think [predictor] and [agent] are mutually exclusive classes. Anyway, the statement which I think is meaningful and false is this:
- If you train a system purely to predict stuff, then even when we condition on it becoming really really good at predicting stuff, it probably won’t be scary. In particular, when you connect it to actuators, it probably doesn’t take over.
I think this is false because I think claims 1 and 2 below are true.

Claim 1. By default, a system sufficiently good at predicting stuff will care about all sorts of stuff, ie it isn’t going to only ultimately care about making a good prediction in the individual prediction problem you give it. ^[[1]]

If this seems weird, then to make it seem at least not crazy, instead of imagining a pretrained transformer trained on internet text, let’s imagine a predictor more like the following:
- It has a lot of internal tokens to decide what probability distribution it eventually outputs. Sometimes, on the way to making a prediction, it writes itself textbooks on various questions relevant to making that prediction. Maybe it is given access to a bunch of information about the world. Maybe it can see what predictions it made “previously” and it thinks about what went wrong when it made mistakes in similar cases in the past. Maybe it does lots of other kinds of complicated thinking. Maybe there are a bunch of capability ideas involved. Like, I’m imagining some setup where there are potentially many losses, but there’s still some most outer loss or fitness criterion or whatever that is purely about how good the system is at predicting some pre-recorded data. ^[[2]] And then maybe it doesn’t seem at all crazy for such a thing to eg be curious and like some aspects of prediction-investigations in a way that generalizes eg to wanting to do more of that stuff.
I’m not going to really justify claim 1 beyond this atm. It seems like a pretty standard claim in AI alignment (it’s very close to the claim that capable systems end up caring broadly about stuff by default), but I don’t actually know of a post or paper arguing for this that I like that much. This presentation of mine is about a very related question. Maybe I should write something about this myself, potentially after spending some more time understanding the matter more clearly.

Claim 2. By default, a system sufficiently good at predicting stuff will be able to (figure out how to) do scary real-world stuff as well.

Like, predicting stuff really really well is really hard. Sometimes, to make a really really good prediction, you basically have to figure out a bunch of novel stuff. There is a level of prediction ability that makes it likely you are very very good at figuring out how to cope in new situations. A good enough predictor would probably also be able to figure out how to grab a ball by controlling a robotic hand or something (let’s imagine it being presented hand control commands which it can now use in its internal chain of thought and grabbing the ball being important to it for some reason)? There’s nothing sooo particularly strange or complicated about doing real-world stuff. This is like how if we were in a simulation but there were a way to escape into the broader universe, with enough time, we could probably figure out how to do a bunch of stuff in the broader universe. We are sufficiently good at learning that we can also get a handle on things in that weird case.

Combining claims 1 and 2 should give that if we made such an AI and connected it to actuators, it would take over. Concretely, maybe we somehow ask it to predict what a human with a lot of time who is asked to write safe ASI code would output, with it being clear that we will just run what our predictor outputs. I predict that this doesn’t go well for us but goes well for the AI (if it’s smart enough).

That said, I think it’s likely that even pretrained transformers like idk 20 orders of magnitude larger than current ones would not be doing scary stuff. I think this is also plausible in the limit. (But I would also guess they wouldn’t be outputting any interesting scientific papers that aren’t in the training data.)
1. ↩︎
  If we want to be more concrete: if we’re imagining that the system is only able to affect the world through outputs which are supposed to be predictions, then my claim is that if you set up a context such that it would be “predictively right” to assign a high probability to “0“ but assigning a high probability to “1” lets it immediately take over the world, and this is somehow made very clear by other stuff seen in context, then it would probably output “1”.
2. ↩︎
  Actually, I think “prediction problem” and “predictive loss” are kinda strange concepts, because one can turn very many things into predicting data from some certain data-generating process. E.g. one can ask about what arbitrary turing machines (which halt) will output, so about provability/disprovability of arbitrary decidable mathematical statements.
- Mateusz Bagiński 3 Dec 2025 11:00 UTC
  4 points
  0
  Parent
  Also, some predictions are performative, i.e., capable of influencing their own outcomes. In the limit of predictive capacity, a predictor will be able to predict which of its possible predictions are going to elicit effects in the world that make their outcome roughly align with the prediction. Cf. https://www.lesswrong.com/posts/SwcyMEgLyd4C3Dern/the-parable-of-predict-o-matic.
  Moreover, in the limit of predictive capacity, the predictor will want to tame/legibilize the world to make it easier to predict.
  - [ ]
    [deleted]
- evhub 9 Dec 2025 21:27 UTC
  2 points
  0
  Parent
  I think we discuss pretty much all of these points in Conditioning Predictive Models.