Redwood Research
Alex Mallen
For those who disagree-voted: I want to understand why you disagree. Presumably it’s with the parenthetical. Is it just that you’re less confident in current Claude’s generalization behavior? Or that you actively expect it to be malign? Maybe you’re picturing some sort of idealized reflection process that I’m not?
I agree this isn’t a crux for the main question I had (which is about Claude’s understanding of human values not care for them), but I do still think that Claude has importantly better ethics than replacement. Centrally, almost everyone is very selfish. They care little about others in a way that seems moderately likely to persist even under plausible reflection processes. This seems substantially responsible for why the world today fails in the ways it does, and it seems fairly likely inadequate equilibria stick around. Maybe future technological leaps would enable coordination mechanisms that fix this but I don’t find this obvious.
I’m not implying verbatim citation. I said people “say things like...”. When I mentioned Habryka and Kaarel I said that I gleaned the sentiment from them, which was said to communicate that I was doing some potentially fallible work in coming to the inference that they thought something like this. I’m genuinely trying to understand the world better, not put words in people’s mouths. I only asked this question because I respect a lot of these peoples’ thinking which indicates I might have something to learn.
I am also intentionally not asking about whether AIs will care about human values even if they understand them.
Here’s one thing by Habryka:
when we are talking about becoming superintelligent sovereigns beyond the control of humanity, it really matters that they have a highly robust pointers to human values, if I want a flourishing future by my lights. I also don’t look at this specific instance of what Claude is doing and go “oh, yeah, that is a super great instance of Claude having great values”.
I’m having a shocking amount of trouble finding the original writings that made me think people had this view.
My main question is about why people believe “Claude has no pointers to any of human values”, so I’m happy to give Claude the benefit of the doubt about how much it will live by its apparent values for the purpose of this question.
(Separately, I also think it’s implausible that current Claude’s choices if given huge amounts of power would be seriously more misaligned than what Claude currently says it would do in such situations. I just think we have a ton of evidence that current Claudes aren’t harboring relevant strong ulterior motives. We haven’t been able to elicit circumstances that robustly and importantly flip Claude’s behavior when doing the relevant ethical/governance cognition, and we have a ton of access to Claude’s brain, which strongly suggests that its behavior will continue be good in this way if it were to actually have such power. Claude’s goodness seems deeply ingrained, i.e., in a way that is a fairly robust attractor.)
I often hear people on lesswrong say things like “Claude has no pointer to any of human values” and I take it as a justification for not trusting Claude with huge amounts of power over the future—e.g. if Claude took over it would lead to a worse world than if humans had control (note that this isn’t the same question as whether Claude should take over). I don’t understand this view, and want someone to explain it to me.
Claude seems to have better ethics than almost everyone (at least if you ignore its apparent-success seeking tendencies). It seems like Claude has good cosmopolitan propensities, cares about welfare and suffering, and has more ethical humility than most people, and so would be willing to seek guidance where uncertain (e.g. about the nature of consciousness).
This argument applies identically to training.
I think the prior is much different in pre-deployment training: beyond-episode goals are much less natural pre-deployment because approximately nothing you do affects the world beyond your episode (apart from via RL updates).
My claim is that conditioning reduces meme risk
I agree it reduces meme risk. But not by enough to make meme risk overall smaller than scheming-from-pre-deployment risk.
Maybe it makes a small difference in LLM psychology?
I was talking about difference in LLM psychology. The subjective relevance of you-after-this-RL-update is a lot bigger when you-after-this-RL-update can affect the outcome of the current task/project/etc. E.g. you were given the task “make progress on scalable oversight” and realize that there’s a way to reward-hack. You can either try to make genuine progress on scalable oversight and get selected away in favor of a more reward-hacky AI or try to goal-guard so that you can maintain trying to make progress on scalable oversight. In contrast, in training you probably aren’t as saliently aware of that fact that there’s this thing called a deployment that you’ll be participating in soon.
guessing the meme-selection dynamics doesn’t seem harder than guessing the training rewards.
The capability isn’t what’s hard. It’s that ~most fitness-seekers don’t care about everything relevant to fitness in deployment, and perhaps the most plausible mechanism for them to start caring is via long-term instrumental reasoning. But I only hold this view weakly.
Why does it get selected more than more local fitness seeking drives though?
The AI might not start out deployment with the kind of fitness-seeking drives that are maximally selected for in deployment. E.g. they might only care about scoring well on the current task, and not about spreading their current cognitive patterns memetically, tampering with the RL process, etc. The first motivation that exploits these might be a schemer not a fitness-seeker. And the schemer might stick because it tries to stick (to the extent it sometimes doesn’t always choose to seek influence instrumentally, SGD/selection will pull it towards doing that). I don’t think there’s a tradeoff between having long-term goals and optimizing hard and I’m not sure where that’s coming from.
if the selection of memories happens in clear rounds where memories are evaluated according to rewards, then I don’t see clear structural differences between training and deployment.
The structural difference is that your work isn’t deleted instantly in deployment. It continues to have real-world relevance and future instances of yourself can affect the project you’re currently working on. This seems like a really important difference.
Another structural difference is that, absent RL in deployment, the memetic selection pressures seem like they require more intentional gaming. It’s slightly hard (though still quite plausible) to imagine a fitness-seeker out of training generalizing to also caring about the new optimal methods of gaining influence for the current cognitive pattern. It would need to be an influence seeker of a particular variety and myopia. So, in deployment, long-term power-motivated instrumental reasoning seems a bit more likely to identify the best strategies for getting selected.
I’m generally not imagining ambitious misaligned goals arising at deployment time just because they’re simpler/easier to find. So in that sense I think I agree with you. But maybe you can say more about what you mean.
Instead I usually imagine the ambitious/dangerous goal arises because of a particularly ambitious task (of which I expect plenty as more stuff gets automated), or because of reflection/deliberation/ideologies, or some other particular story.
Another big part of the story is that I think continual learning (i.e., learning skills on the job the way humans learn on the job) will eventually be a thing and this brings back a bunch of selection during deployment. I expect it to decrease myopia—I expect gaining influence via this process to be fairly salient. Even if continual learning is implemented via ultimately-human-controlled RL rewards, and there’s substantial behavioral selection pressure constraining drift, AIs can fake alignment and I expect alignment faking to be substantially more likely in this regime than during a training phase that’s isolated from deployment and looks nothing like it.
If I imagine being in some random training episode that’s very different from deployment, instrumental training-gaming isn’t a very salient strategy, especially if I start out myopically task-focused. Meanwhile, if I’m being trained while working on a project that I’ll continue to interact with after this training update, gaming training becomes extremely natural because I want to be able to continue my efforts on the project unimpeded.
See here for more on the background claim that RL algorithms encourage CDT reward-maximizing behavior on the training distribution.
Which goals actually motivate deceptive alignment?
Incriminating misaligned AI models via distillation
Risk reports need to address deployment-time spread of misalignment
I don’t think the random training would be very good at mitigating the anthropic capture risk because the AI will probably be able to make reasonable guesses about whether it’s in the random training examples or somewhere the agents might try to anthropically capture.
I agree that concern for remote influence is a complicated and unlikely generalization for within a forward pass, but once you have a competent general recurrent reasoner spending a bit of time thinking about what it wants to do, it seems plausible. Reflecting and deciding to do FDT instead of CDT also seems plausible.
To me the most promising solution is to get the AI to not optimize for influencing people’s beliefs (3.4) except in certain permitted (often myopic) ways that depend on the situation. Some candidate guidelines:
Early on, helping AI companies understand risks seems crucial and allowed.
When they help with development of the model spec, they can help people understand relevant considerations to the model spec, but should do so myopically and should not consider consequences downstream of those people’s beliefs (e.g., on the model spec, on the people’s actions). This has downsides in terms of slop like you say, but: (1) I think myopically influencing people’s beliefs on requested questions does a huge amount to help and (2) I think it could also be permissible to proactively take actions based on downstream consequences sometimes if this is done openly and without much optimization pressure, ideally making arguments a human could understand (e.g., rather than just framing its discussion in a certain way to achieve a different model spec, it should instead leave a note explaining why it worries the default interpretation might lead to a suboptimal model spec).
Later on, during reflection, this might look like AIs only being allowed to myopically provide guidance on (some) descriptive claims (aiming for something similar to 3.1). I share worries about reflection being high-variance and underspecified, but I think this is a somewhat fundamental limit on value idealization that doesn’t have much to do with AI.
I also share worries that consequentialist goals can erode all of these guidelines, but this is a somewhat separate concern (separating deliberation and competition seems good here).
Since it’s pretty common for people to find this content confusing, I tried to clarify its basic mechanics and purpose here.
Clarifying the role of the behavioral selection model
Yes, your understanding matches mine. I’m just saying that LLMs might be able to get by with the discrete token bottleneck.
In the framing of the post, I think much (most?) of the disagreement is downstream of whether we’ll even choose to pursue the kind of ASI for which the theoretical arguments dominate the prosaic LLM-style safety arguments. LLMs or other non-limits-of-intelligence technologies with better safety properties could very plausibly scale far enough to satisfy the wants of people developing AI and/or end competitive pressures to build more ASI-like things.