Alex Turner, postdoctoral researcher at the Center for Human-Compatible AI. Reach me at turner.alex[at]berkeley[dot]edu.
TurnTrout
That is to say, prior to “simulators” and “shard theory”, a lot of focus was on utility-maximizers—agents that do things like planning or search to maximize a utility function; but planning, although instrumentally useful, is not strictly necessary for many intelligent behaviors, so we are seeing more focus on e.g. agents that enact learned policies in RL that do not explicitly maximize reward in deployment but try to enact policies that did so in training.
FYI I do expect planning for smart agents, just not something qualitatively alignment-similar to “argmax over crisp human-specified utility function.” (In the language of the OP, I expect values-executors, not grader-optimizers.)
I have no way of knowing that increasing the candy-shard’s value won’t cause a phase shift that substantially increases the perceived value of the “kill all humans, take their candy” action plan. I ultimately care about the agent’s “revealed preferences”, and I am not convinced that those are smooth relative to changes in the shards.
I’m not either. I think there will be phase changes wrt “shard strengths” (keeping in mind this is a leaky abstraction), and this is a key source of danger IMO.
Basically my stance is “yeah there are going to be phase changes, but there are also many perturbations which don’t induce phase changes, and I really want to understand which is which.”
Lol, cool. I tried the “4 minute” challenge (without having read EY’s answer, but having read yours).
Hill-climbing search requires selecting on existing genetic variance on alleles already in the gene pool. If there isn’t a local mutation which changes the eventual fitness of the properties which that genotype unfolds into, then you won’t have selection pressure in that direction. On the other hand, gradient descent is updating live on a bunch of data in fast iterations which allow running modifications over the parameters themselves. It’s like being able to change a blueprint for a house, versus being able to be at the house in the day and direct repair-people.
The changes happen online, relative to the actual within-cognition goings-on of the agent (e.g. you see some cheese, go to the cheese, get a policy gradient and become more likely to do it again). Compare that to having to try out a bunch of existing tweaks to a cheese-bumping-into agent (e.g. make it learn faster early in life but then get sick and die later), where you can’t get detailed control over its responses to specific situations (you can just tweak the initial setup).
Gradient descent is just a fundamentally different operation. You aren’t selecting over learning processes which unfold into minds, trying out a finite but large gene pool of variants, and then choosing the most self-replicating; you are instead doing local parametric search over what changes outputs on the training data. But RL isn’t even differentiable, you aren’t running gradients through it directly. So there isn’t even an analogue of “training data” in the evolutionary regime.
I think I ended up optimizing for “actually get model onto the page in 4 minutes” and not for “explain in a way Scott would have understood.”
Yeah, this read really bizarrely to me. This is a good way of making sense of that section, maybe. But then I’m still confused why Scott concluded “oh I was just confused in this way” and then EY said “yup that’s why you were confused”, and I’m still like “nope Scott’s question seems correctly placed; evolutionary history is indeed screened off by the runtime hyperparameterization and dataset.”
Thanks for leaving this comment, I somehow only just now saw it.
Given your pseudocode it seems like the only point of
planModificationSample
is to produce plan modifications that lead to high outputs ofself.diamondShard(self.WM.getConseq(plan))
. So why is that not “optimizing the outputs of the grader as its main terminal motivation”?I want to make a use/mention distinction. Consider an analogous argument:
“Given gradient descent’s pseudocode it seems like the only point of
backward
is to produce parameter modifications that lead to low outputs ofloss_fn
. Gradient descent selects over all directional derivatives for the gradient, which is the direction of maximal loss reduction. Why is that not “optimizing the outputs of the loss function as gradient descent’s main terminal motivation”?”[1]Locally reducing the loss is indeed an important part of the learning dynamics of gradient descent, but this (I claim) has very different properties than “randomly sample from all global minima in the loss landscape” (analogously: “randomly sample a plan which globally maximizes grader output”).
But I still haven’t answered your broader I think you’re asking for a very reasonable definition which I have not yet given, in part because I’ve remained somewhat confused about the exact grader/non-grader-optimizer distinction I want to draw. At least, intensionally. (which is why I’ve focused on giving examples, in the hope of getting the vibe across.)
I gave it a few more stabs, and I don’t think any of them ended up being sufficient. But here they are anyways:
A “grader-optimizer” makes decisions primarily on the basis of the outputs of some evaluative submodule, which may or may not be explicitly internally implemented. The decision-making is oriented towards making the outputs come out as high as possible.
In other words, the evaluative “grader” submodule is optimized against by the planning.
IE the process plans over “what would the grader say about this outcome/plan”, instead of just using the grader to bid the plan up or down.
I wish I had a better intensional definition for you, but that’s what I wrote immediately and I really better get through the rest of my comm backlog from last week.
Here are some more replies which may help clarify further.
in particular, such a system would produce plans like “analyze the diamond-evaluator to search for side-channel attacks that trick the diamond-evaluator into producing high numbers”, whereas
planModificationSample
would not do that (as such a plan would be rejected byself.diamondShard(self.WM.getConseq(plan))
).Aside—I agree with both bolded claims, but think they are separate facts. I don’t think the second claim is the reason for the first being true. I would rather say, “
plan
would not do that, becauseself.diamondShard(self.WM.getConseq(plan))
would reject side-channel-attack plans.”Grader is complicit: In the diamond shard-shard case, the grader itself would say “yes, please do search for side-channel attacks that trick me”
No, I disagree with the bolded underlined part.
self.diamondShardShard
wouldn’t be tricking itself, it would be tricking another evaluative module in the AI (i.e.self.diamondShard
).[2]If you want an AI which tricks the
self.diamondShardShard
, you’d need it to primarily useself.diamondShardShardShard
to actually steer planning. (Or maybe you can find a weird fixed-point self-tricking shard, but that doesn’t seem central to my reasoning; I don’t think I’ve been imagining that configuration.)and what modification would you have to make in order to make it a grader-optimizer with the grader
self.diamondShard(self.WM.getConseq(plan))
?Oh, I would change
self.diamondShard
toself.diamondShardShard
?- ^
I think it’s silly to say that GD has a terminal motivation, but I’m not intending to imply that you are silly to say that the agent has a terminal motvation.
- ^
Or more precisely,
self.diamondGrader
-- since “self.diamondShard
” suggests that the diamond-value directly bids on plans in the grader-optimizer setup. But I’ll stick toself.diamondShardShard
for now and elide this connotation.
This post crystallized some thoughts that have been floating in my head
+1. I’ve explained a less clear/expansive version of this post to a few people this last summer. I think there is often some internal value-violence going on when many people fixate on Impact.
Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model
FWIW I don’t consider myself to be arguing against planning over a world model.
the important thing to realise is that ‘human values’ do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values.
Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.
Can you give me some examples here? I don’t know that I follow what you’re pointing at.
How can reward be unnecessary as a ground-truth signal about alignment? Especially if “reward chisels cognition”?
Reward’s purpose isn’t to demarcate “this was good by my values.” That’s one use, and it often works, but it isn’t intrinsic to reward’s mechanistic function. Reward develops certain kinds of cognition / policy network circuits. For example, reward shaping a dog to stand on its hind legs. I don’t reward the dog because I intrinsically value its front paws being slightly off the ground for a moment. I reward the dog at that moment because that helps develop the stand-up cognition in the dog’s mind.
Strong upvoted. I appreciate the strong concreteness & focus on internal mechanisms of cognition.
I think a bunch of alignment value will/should come from understanding how models work internally—adjudicating between theories like “unitary mesa objectives” and “shards” and “simulators” or whatever—which lets us understand cognition better, which lets us understand both capabilities and alignment better, which indeed helps with capabilities as well as with alignment.
But, we’re just going to die in alignment-hard worlds if we don’t do anything, and it seems implausible that we can solve alignment in alignment-hard worlds by not understanding internals or inductive biases but instead relying on shallowly observable in/out behavior. EG I don’t think loss function gymnastics will help you in those worlds. Credence:75% you have to know something real about how loss provides cognitive updates.
So in those worlds, it comes down to questions of “are you getting the most relevant understanding per unit time”, and not “are you possibly advancing capabilities.” And, yes, often motivated-reasoning will whisper the former when you’re really doing the latter. That doesn’t change the truth of the first sentence.
this seems like a bottleneck to RL progress—not knowing why your perfectly reasonable setup isn’t working
Isn’t RL tuning problems usually because of algorithmic mis-implementation, and not models learning incorrect things?
not enough insight for you to know the reflective equilibrium of values in your agent, which seems required for it to be alignment relevant
Required to be alignment relevant? Wouldn’t the insight be alignment relevant if you “just” knew what the formed values are to begin with?
Thanks for the effortpost. I feel like I have learned something interesting reading it—maybe sharpened thoughts around cell membranes vis a vis boundaries and agency and (local) coherence which adds up to global coherence.
My biggest worry is that your essay seems to frame consequentialism as empirically quantified by a given rule (“the pattern traps that accumulated the patterns-that-matter”), which may well be true, but doesn’t give me much intensional insight into why this happened, into what kinds of cognitive algorithms possess this interesting-seeming “consequentialism” property!
Or maybe I’m missing the point.
You can always try to see a rock as an agent—no one will arrest you. But that lens doesn’t accurately predict much about what the inanimate object will do next. Rocks like to sit inert and fall down, when they can; but they don’t get mad, or have a conference to travel to later this month, or get excited to chase squirrels. Most of the cognitive machinery you have for predicting the scheming of agents lies entirely fallow when applied to rocks.
The intentional stance seems useful insofar as brain has picked up a real regularity. Which I think it has. Just noting that reaction.
I worry this is mostly not about territory but mostly map’s reaction to territory, in a way which may not be tracked by your analysis?
Agents are bubbles of reflexes, when those reflexes are globally coherent among themselves. And exactly what way those reflexes are globally coherent (there are many possibilities) fixes what the agent cares about terminally tending toward.
This isn’t quite how I, personally, would put it (“reflexes” seems too unsophisticated and non-generalizing for my taste, even compared to “heuristics”). But I really like the underlying sentiment/insight/frame.
Some of my disagreements with List of Lethalities
Thomas Kwa suggested that consequentialist agents seem to have less superficial (observation, belief state) → action mappings. EG a shard agent might have:
An “it’s good to give your friends chocolate” subshard
A “give dogs treats” subshard
-> An impulse to give dogs chocolate, even though the shard agent knows what the result would be
But a consequentialist would just reason about what happens, and not mess with those heuristics. (OFC, consequentialism would be a matter of degree)
In this way, changing a small set of decision-relevant features (e.g. “Brown dog treat” → “brown ball of chocolate”) changes the consequentialist’s action logits a lot, way more than it changes the shard agent’s logits. In a squinty, informal way, the (belief state → logits) function has a higher Lipschitz constant/is more smooth for the shard agent than for the consequentialist agent.
So maybe one (pre-deception) test for consequentialist reasoning is to test sensitivity of decision-making to small perturbations in observation-space (e.g. dog treat → tiny chocolate) but large perturbations in action-consequence space (e.g. happy dog → sick dog). You could spin up two copies of the model to compare.
One reason why the latter may happen is that possibly becomes so complicated that it’s “hard to attach more behavior to it”; maybe it’s just simpler to create an entirely new module that solves this task and doesn’t care about diamonds. If something like this happens often enough, then eventually, the diamond shard may lose all its influence.
I don’t currently share your intuitions for this particular technical phenomenon being plausible, but imagine there are other possible reasons this could happen, so sure? I agree that there are some ways the diamond-shard could lose influence. But mostly, again, I expect this to be a quantitative question, and I think experience with people suggests that trying a fun new activity won’t wipe away your other important values.
I think it’s worth it to use magic as a term of art, since it’s 11 fewer words than “stuff we need to remind ourselves we don’t know how to do,” and I’m not satisfied with “free parameters.”
11 fewer words, but I don’t think it communicates the intended concept!
If you have to say “I don’t mean one obvious reading of the title” as the first sentence, it’s probably not a good title. This isn’t a dig—titling posts is hard, and I think it’s fair to not be satisfied with the one I gave. I asked ChatGPT to generate several new titles; lightly edited:
“Uncertainties left open by Shard Theory”
“Limitations of Current Shard Theory”
“Challenges in Applying Shard Theory”
“Unanswered Questions of Shard Theory”
“Exploring the Unknowns of Shard Theory”
After considering these, I think that “Reminder: shard theory leaves open important uncertainties” is better than these five, and far better than the current title. I think a better title is quite within reach.
But how do we learn that fact?
I didn’t claim that I assign high credence to alignment just working out, I’m saying that it may as a matter of fact turn out that shard theory doesn’t “need a lot more work,” because alignment works out as a matter of fact from the obvious setups people try.
There’s a degenerate version of this claim, where ST doesn’t need more work because alignment is “just easy” for non-shard-theory reasons, and in that world ST “doesn’t need more work” because alignment itself doesn’t need more work.
There’s a less degenerate version of the claim, where alignment is easy for shard-theory reasons—e.g. agents robustly pick up a lot of values, many of which involve caring about us.
“Shard theory doesn’t need more work” (in sense 2) could be true as a matter of fact, without me knowing it’s true with high confidence. If you’re saying “for us to become highly confident that alignment is going to work this way, we need more info”, I agree.
But I read you as saying “for this to work as a matter of fact, we need X Y Z additional research”:
At best we need more abstract thought about this issue in order to figure out what an approach might even look like, and at worst I think this is a problem the necessitates a different approach.
And I think this is wrong. 2 can just be true, and we won’t justifiably know it. So I usually say “It is not known to me that I know how to solve alignment”, and not “I don’t know how to solve alignment.”
Does that make sense?
The policy of truth is a blog post about why policy gradient/REINFORCE suck. I’m leaving a shortform comment because it seems like a classic example of wrong RL theory and philosophy, since reward is not the optimization target. Quotes:
Our goal remains to find a policy that maximizes the total reward after time steps.
And hence the following is a general purpose algorithm for maximizing rewards with respect to parametric distributions:
If you start with a reward function whose values are in and you subtract one million from each reward, this will increase the running time of the algorithm by a factor of a million, even though the ordering of the rewards amongst parameter values remains the same.
The latter is pretty easily understandable if you imagine each reward providing a policy gradient, and not the point of the algorithm to find the policy which happens to be an expected fix-point under a representative policy gradient (aka the “optimal” policy). Of course making all the rewards hugely negative will mess with your convergence properties! That’s very related to a much higher learning rate, and to only getting inexact gradients (due to negativity) instead of exact ones. Different dynamics.
Policy gradient approaches should be judged on whether they can train interesting policies doing what we want, and not whether they make reward go brr. Often these are related, but they are importantly not the same thing.
FWIW I strong-disagreed that comment for the latter part:
Gradient descent isn’t really different from what evolution does. It’s just a bit faster, and takes a slightly more direct line. Importantly, it’s not more capable of avoiding local maxima (per se, at least).
I feel neutral/slight-agree about the relation to the linked titular comment.
“Magic,” of course, in the technical sense of stuff we need to remind ourselves we don’t know how to do. I don’t mean this pejoratively, locating magic is an important step in trying to demystify it.
I think this title suggests a motte/bailey, and also seems clickbait-y. I think most people scanning the title will conclude you mean it in a perjorative sense, such that shard theory requires impossibilities or unphysical miracles to actually work. I think this is clearly wrong (and I imagine you to agree). As such, I’ve downvoted for the moment.
AFAICT your message would be better conveyed by “Shard theory alignment has many free parameters.”
If shard theory alignment seemed like it has few free parameters, and doesn’t need a lot more work, then I think you failed to see the magic. I think the free parameters haven’t been discussed enough precisely because they need so much more work.
The things you say about known open questions mostly seem like things I historically have said (which is not at all to say that you haven’t brought any new content to the table!). For example,
“In environments of moderate complexity (e.g. Atari, MuJoCo), we can study how to build curricula that impart different generalization behaviors, and try to make predictive models of this process. Even if shard theory alignment doesn’t pan out, this sounds like good blue-sky research.” → My shortform comment from last July.
I also talk about many of the same things in the original diamond alignment post (“open questions” which I called “real research problems”) and in my response to Nate’s critique of it:
“I agree that the reflection process seems sensitive in some ways. I also give the straightforward reason why the diamond-values shouldn’t blow up: Because that leads to fewer diamonds. I think this a priori case is pretty strong, but agree that there should at least be a lot more serious thinking here, eg a mathematical theory of value coalitions.”
You might still maintain there should be more discussion. Yeah, sure. [EDIT: clipped a part which I think isn’t responding to what you meant to communicate]
and doesn’t need a lot more work,
I think it’s quite plausible that you don’t need much more work for shard theory alignment, because value formation really is that easy / robust.
If the curriculum is only made up only of simple environments, then the RL agent will learn heuristics that don’t need to refer to humans. But if you push the complexity up too fast, the RL process will fail, or the AI will be more likely to learn heuristics that are better than nothing but aren’t what we intended. Does a goldilocks zone where the agent learns more-or-less what we intended exist? How can we build confidence that it does, and that we’ve found it?
Not obvious to me that there is something like a one-dimensional “goldilocks” zone, buttressed by bad zones. An interesting framing, but not one that feels definitive (if you meant it that way).
At best we need more abstract thought about this issue in order to figure out what an approach might even look like, and at worst I think this is a problem the necessitates a different approach.
At best, shard theory alignment just works as-is, with some thought being taken like in the diamond alignment post.
Eliezer’s reasoning is surprisingly weak here. It doesn’t really interact with the strong mechanistic claims he’s making (“Motivated reasoning is definitely built-in, but it’s built-in in a way that very strongly bears the signature of ‘What would be the easiest way to build this out of these parts we handily had lying around already’”).
He just flatly states a lot of his beliefs as true:
Local invalidity via appeal to dubious authority; conventional explanations are often total bogus, and in particular I expect this one to be bogus.
But Eliezer just keeps stating his dubious-to-me stances as obviously True, without explaining how they actually distinguish between mechanistic hypotheses, or e.g. why he thinks he can get so many bits about human learning process hyperparameters from results like Wason (I thought it’s hard to go from superficial behavioral results to statements about messy internals? & inferring “hard-coding” is extremely hard even for obvious-seeming candidates).
Similarly, in the summer (consulting my notes + best recollections here), he claimed ~”Evolution was able to make the (internal physiological reward schedule) ↦ (learned human values) mapping predictable because it spent lots of generations selecting for alignability on caring about proximate real-world quantities like conspecifics or food” and I asked “why do you think evolution had to tailor the reward system specifically to make this possible? what evidence has located this hypothesis?” and he said “I read a neuroscience textbook when I was 11?”, and stared at me with raised eyebrows.
I just stared at him with a shocked face. I thought, surely we’re talking about different things. How could that data have been strong evidence for that hypothesis? Really, neuroscience textbooks provide huge evidence for evolution having to select the reward->value mapping into its current properties?
I also wrote in my journal at the time:
Eliezer seems to attach some strange importance to the learning process being found by evolution, even though the learning initial conditions screen off evolution’s influence..? Like, what?
I still don’t understand that interaction. But I’ve had a few interactions like this with him, where he confidently states things, and then I ask him why he thinks that, and offers some unrelated-seeming evidence which doesn’t—AFAICT—actually discriminate between hypotheses.