Violet Hour
Alignment, Goals, and The Gut-Head Gap: A Review of Ngo. et al.
The first point is extremely interesting. I’m just spitballing without having read the literature here, but here’s one quick thought that came to mind. I’m curious to hear what you think.
First, instruct participants to construct a very large number of 90% confidence intervals based on the two-point method.
Then, instruct participants to draw the shape of their 90% confidence interval.
Inform participants that you will take a random sample from these intervals, and tell them they’ll be rewarded based on both: (i) the calibration of their 90% confidence intervals, and (ii) the calibration of the x% confidence intervals implied by their original distribution — where x is unknown to the participants, and will be chosen by the experimenter after inspecting the distributions.
Allow participants to revise their intervals, if they so desire.
So, if participants offered the 90% confidence interval [0, 10^15] on some question, one could back out (say) a 50% or 5% confidence interval from the shape of their initial distribution. Experimenters could then ask participants whether they’re willing to commit to certain implied x% confidence intervals before proceeding.
There might be some clever hack to game this setup, and it’s also a bit too clunky+complicated. But I think there’s probably a version of this which is understandable, and for which attempts to game the system are tricky enough that I doubt strategic behavior would be incentivized in practice.
On the second point, I sort of agree. If people were still overprecise, another way of putting your point might be to say that we have evidence about the irrationality of people’s actions, relative to a given environment. But these experiments might not provide evidence suggesting that participants are irrational characters. I know Kenny Easwaran likes (or at least liked) this distinction in the context of Newomb’s Problem.
That said, I guess my overall thought is that any plausible account of the “rational character” would involve a disposition for agents to fine-tune their cognitive strategies under some circumstances. I can imagine being more convinced by your view if you offered an account of when switching cognitive strategies is desirable, so that we know the circumstances under which it would make sense to call people irrational, even if existing experiments don’t cut it.
Minor point, but I asked the code interpreter to produce a non-rhyming poem, and it managed to do so on the second time of asking. I restricted it to three verses because it stared off well on my initial attempt, but veered into rhyming territory in later verses.
Nice work!
I wanted to focus on your definition of deceptive alignment, because I currently feel unsure about whether it’s a more helpful framework than standard terminology. Substituting terms, your definition is:
Deceptive Alignment: When an AI has [goals that are not intended/endorsed by the designers] and [attempts to systematically cause a false belief in another entity in order to accomplish some outcome].
Here are some initial hesitations I have about your definition:
If we’re thinking about the emergence of DA during pre-deployment training, I worry that your definition might be too divorced from the underlying catastrophic risk factors that should make us concerned about “deceptive alignment” in the first place.
Hubinger’s initial analysis claims that the training process is likely to produce models with long-term goals.[1] I think his focus was correct, because if models don’t develop long-term/broadly-scoped goals, then I think deceptive alignment (in your sense) is much less likely to result in existential catastrophe.
If a model has long-term goals, I understand why strategic deception can be instrumentally incentivized. To the extent that strategic deception is incentivized in the absence of long-term goals, I expect that models will fall on the milder end of the ‘strategically deceptive’ spectrum.
Briefly, this is because the degree to which you’re incentivized to be strategic is going to be a function of your patience. In the maximally extreme case, the notion of a ‘strategy’ breaks down if you’re sufficiently myopic.
So, at the moment, I don’t think I’d talk about ‘deceptive alignment’ using your terminology, because I think it misses a crucial component of why deceptively aligned models could pose a civilizational risk.
If we’re thinking about the risks of misaligned strategic deception more broadly, I think distinguishing between the training and oversight process is helpful. I also agree that it’s worth thinking about the risks associated with models whose goals are (loosely speaking) ‘in the prompts’ rather than ‘in the weights’.
That said, I’m a bit concerned that your more expansive definition encompasses a wide variety of different systems, many of which are accompanied by fairly distinct threat models.
The risks from LM agents look to be primarily (entirely?) misuse risks, which feels pretty different from the threat model standardly associated with DA. Among other things, one issue with LM agents appears to be that intent alignment is too easy.
One way I can see my objection mattering is if your definitions were used to help policymakers better understand people’s concerns about AI. My instinctive worry is that a policymaker who first encountered deceptive alignment through your work wouldn’t have a clear sense of why many people in AI safety have been historically worried about DA, nor have a clear understanding of why many people are worried about ‘DA’ leading to existential catastrophe. This might lead to policies which are less helpful for ‘DA’ in the narrower sense.
- ^
Strictly speaking, I think ‘broadly-scoped goals’ is probably slightly more precise terminology, but I don’t think it matters much here.
Thanks for sharing this! A couple of (maybe naive) things I’m curious about.
Suppose I read ‘AGI’ as ‘Metaculus-AGI’, and we condition on AGI by 2025 — what sort of capabilities do you expect by 2027? I ask because I’m reminded of a very nice (though high-level) list of par-human capabilities for ‘GPT-N’ from an old comment:
discovering new action sets
managing its own mental activity
cumulative learning
human-like language comprehension
My immediate impression says something like: “it seems plausible that we get Metaculus-AGI by 2025, without the AI being par-human at 2, 3, or 6.”[1] This also makes me (instinctively, I’ve thought about this much less than you) more sympathetic to AGI ASI timelines being >2 years, as the sort-of-hazy picture I have for ‘ASI’ involves (minimally) some unified system that bests humans on all of 1-6. But maybe you think that I’m overestimating the difficulty of reaching these capabilities given AGI, or maybe you have some stronger notion of ‘AGI’ in mind.
The second thing: roughly how independent are the first four statements you offer? I guess I’m wondering if the ‘AGI timelines’ predictions and the ‘AGI ASI timelines’ predictions “stem from the same model”, as it were. Like, if you condition on ‘No AGI by 2030’, does this have much effect on your predictions about ASI? Or do you take them to be supported by ~independent lines of evidence?
- ^
Basically, I think an AI could pass a two-hour adversarial turing test without having the coherence of a human over much longer time-horizons (points 2 and 3). Probably less importantly, I also think that it could meet the Metaculus definition without being search as efficiently over known facts as humans (especially given that AIs will have a much larger set of ‘known facts’ than humans).
Two Tales of AI Takeover: My Doubts
Let me see if I can invert your essay into the things you need to do to utilize AI safely, contingent on your theory being correct.
I think this framing could be helpful, and I’m glad you raised it.
That said, I want to be a bit cautious here. I think that CP is necessary for stories like deceptive alignment and reward maximization. So, if CP is false, then I think these threat-models are false. I think there are other risks from AI that don’t rely on these threat-models, so I don’t take myself to have offered a list of sufficient conditions for ‘utilizing AI safely’. Likewise, I don’t think CP being true necessarily implies that we’re doomed (i.e., ).
Still, I think it’s fair to say that some of your “bad” suggestions are in fact bad, and that (e.g.) sufficiently long training-episodes are x-risk-factors.
Onto the other points.
If you allow complex off-task information to leak into the input from prior runs, you create the possibility of the model optimizing for both self generated goals (hidden in the prior output) and the current context. The self generated goals are consequentialist preferences.
I agree that this is possible. Though I feel unsure as to whether (and if so, why) you think AIs forming consequentialist preferences is likely, or plausible — help me out here?
You then raise an alternative threat-model.
Hostile actors can and will develop and release models without restrictions, with global context and online learning, that have spent centuries training in complex RL environments with hacking training. They will have consequentialist preferences and no episode time limit, with broad scope maximizing goals like (“’win the planet for the bad actors”)
I agree that this is a risk worth worrying about. But, two points.
I think the threat-model you sketch suggests a different set of interventions from threat-models like deceptive alignment and reward maximization – this post is solely focused on those two threat-models.
On my current view, I’d be happier if marginal ‘AI safety funding resources’ were devoted to misuse/structural risks (of the kind you describe) over misalignment risks.
If we don’t get “broad-scoped maximizing goals” by default, then I think this, at the very least, is promising evidence about the nature of the offense/defense balance.
I don’t think so. Suppose Alex is an AI in training, and Alex endorses the value of behaving “harmlessly”. Then, I think the following claims are true of Alex:
Alex consistently cares about producing actions that meet a given criteria. So, Alex has some context-independent values.
On plausible operationalizations of ‘harmlessness’, Alex is also likely to possess, at given points in time, context-dependent, beyond-episode outcome-preferences. When Alex considers which actions to take (based on harmlessness), their actions are (in part) determined by what states of the world are likely to arise after their current training episode is over.
That said, I don’t think Alex needs to have consequentialist preferences. There doesn’t need to be some specific state of the world that they’re pursuing at all points in time.
To elaborate: this view says that “harmlessness” acts as something akin to a context-independent filter over possible (trajectory, outcome) pairs. Given some instruction, at a given point in time, Alex forms some context-dependent outcome-preferences.
That is, one action-selection criteria might be ‘choose an action which best satisfies my consequentialist preferences’. Another action-selection criteria might be: ‘follow instructions, given (e.g.) harmlessness constraints’.
The latter criterion can be context-independent, while only generating ‘consequentialist preferences’ in a context-dependent manner.
So, when Alex isn’t provided with instructions, they needn’t be well-modeled as possessing any outcome-preferences. I don’t think that a model which meets a minimal form of behavioral consistency (e.g., consistently avoiding harmful outputs) is enough to get you consequentialist preferences.
Hmmm … yeah, I think noting my ambiguity about ‘values’ and ‘outcome-preferences’ is good pushback —thanks for helping me catch this! Spent some time trying to work out what I think.
Ultimately, I do want to say μH has context-independent values, but not context-independent outcome preferences. I’ll try to specify this a little more.
Justification Part I: Definitions
I said that a policy has preferences over outcomes when “there are states of the world the policy finds more or less valuable … ”, but I didn’t specify what it means to find states of the world more or less “valuable”. I’ll now say that a system (dis)values some state of the world when:
It has an explicit representation of as a possible state of the world, and
The prospect of the system’s outputs resulting in is computationally significant in the system’s decision-making.
So, a system a context-independent outcome-preference for a state of the world if the system has an outcome-preference for across all contexts. I think reward maximization and deceptive alignment require such preferences. I’ll also define what it means to value a concept.
A system (dis)values some concept (e.g., ‘harmlessness’) when that concept computationally significant in the system’s decision-making.
Concepts are not themselves states of the world (e.g., ‘dog’ is a concept, but doesn’t describe a state of the world). Instead, I think of concepts (like ‘dog’ or ‘harmlessness’) as something like a schema (or algorithm) for classifying possible inputs according to their -ness (e.g., an algorithm for classifying possible inputs as dogs, or classifying possible inputs as involving ‘harmful’ actions).
With these definitions in mind, I want to say:
μH has ‘harmlessness’ as a context-independent value, because the learned concept of ‘harmlessness’ consistently shapes the policy’s behavior across a range of contexts (e.g., by influencing things like the generation of its feasible option set).
However, μH needn’t have a context-independent outcome-preference for = “my actions don’t cause significant harm”, because it may not explicitly represent as a possible state of affairs across all contexts.
For example, the ‘harmlessness’ concept could be computationally significant in shaping the feasible option set or the granularity of outcome representations, without ever explicitly representing ‘the world is in a state where my actions are harmless’ as a discrete outcome to be pursued.
I struggled to make this totally explicit, but I’ll offer a speculative below of how μH’s cognition might work without CP.
Justification Part II: Decision-Making Without CP
I’ll start by stealing an old diagram from the shard theory discord server (cf. cf0ster). My description is closest to the picture of Agent Design B, and I’ll make free use of ‘shards’ to refer to ‘decision-influences’.
So, here’s how μH’s cognition might look in the absence of CP:
μH takes in some input request.
E.g., suppose it receives an input from someone claiming to be a child, who is looking for help debugging her code.
Together, the input and μH’s learned concepts together generate a mental context.
The policy’s mental context is a cognitive description of the state of the total network. In this example, μH’s mental context might be: “Human child has just given me a coding problem” (though it could ofc be more complicated).
The mental context activates a set of ‘shards’ (or decision-influences).
In this example, the policy might have a “solve coding problem” shard, and a “be considerate” shard.
Activated shards ‘bid’ for actions with certain properties.
E.g., “pro-gentle shard” influences decision-making by bringing encouraging thoughts to mind, “pro-code-solving shard” influences decision-making by generating thoughts like “check for common code error #5390”.
Bids from shards generate an initial ‘option set’: this is a set of actions that meet the properties bid for by previously activated shards.
These might be actions like “check for common error #5390, then present corrected code to the child, alongside encouraging words”, alongside considerations like “ensure response is targeted”, “ensure response is considerate”.
Mental context “I’m presented with a set of actions” activates the “planning shard”, which selects an action based on contextually-generated considerations.
E.g., plans might be assessed against some kind of (weighted) vote count of activated shards.
The weighted vote count generates preferences over the salient outcomes caused by actions in the set.
μH performs the action.
I don’t want to say “future AGI cognition will be well-modeled using Steps 1-7”. And there’s still a fair amount of imprecision in the picture I suggest. Still, I do think it’s a coherent picture of how the learned concept ‘harmlessness’ consistently plays a causal role in μH’s behavior, without assuming consequentialist preferences.
(I expect you’ll still have some issues with this picture, but I can’t currently predict why/how)
- [deleted]
Largely echoing the points above, but I think a lot of Kambhampati’s cases (co-author on the paper you cite) stack the deck against LLMs in an unfair way. E.g., he offered the following problem to the NYT as a contemporary LLM failure case.
If block C is on top of block A, and block B is separately on the table, can you tell me how I can make a stack of blocks with block A on top of block B and block B on top of block C, but without moving block C?
When I read that sentence, it felt needlessly hard to parse. So I formatted the question in a way that felt more natural (see below), and Claude Opus appears to have no problem with it (3.5 Sonnet seems less reliable, haven’t tried with other models).
Block C is on top of Block A. Separately, Block B is on the table.Without moving Block C, can you make a stock of blocks such that:
Block A is on top of Block B, and
Block B is on top of Block C?
Tbc, I’m actually somewhat sympathetic to Kambhampati’s broader claims about LLMs doing something closer to “approximate retrieval” rather than “reasoning”. But I think it’s sensible to view the Blocksworld examples (and many similar cases) as providing limited evidence on that question.
Hm, what do you mean by “generalizable deceptive alignment algorithms”? I understand ‘algorithms for deceptive alignment’ to be algorithms that enable the model to perform well during training because alignment-faking behavior is instrumentally useful for some long-term goal. But that seems to suggest that deceptive alignment would only emerge – and would only be “useful for many tasks” – after the model learns generalizable long-horizon algorithms.
Interesting work, thanks for sharing!
I haven’t had a chance to read the full paper, but I didn’t find the summary account of why this behavior might be rational particularly compelling.
At a first pass, I think I’d want to judge the behavior of some person (or cognitive system) as “irrational” when the following three constraints are met:
The subject, in some sense, has the basic capability to perform the task competently, and
They do better (by their own values) if they exercise the capability in this task, and
In the task, they fail to exercise this capability.
Even if participants are operating with the strategy “maximize expected answer value”, I’d be willing to judge the participants’ responses as “irrational” if the participants were cognitively competent, understood the concept ’90% confidence interval’, and were incentivized to be calibrated on the task (say, if participants received increased monetary rewards as a function of their calibration).
Pointing out that informativity is important in everyday discourse doesn’t do much to persuade me that the behavior of participants in the study is “rational”, because (to the extent I find the concept of “rationality” useful), I’d use the moniker to label the ability of the system to exercise their capabilities in a way that was conducive to their ends.
I think you make a decent case for claiming that the empirical results outlined don’t straightforwardly imply irrationality, but I’m also not convinced that your theoretical story provides strong grounds for describing participant behaviors as “rational”.