I would disagree that it is an assumption. That same draft talks about the outsized role of self-supervised learning on determining particular ordering and kinds of concepts that humans desires latch onto. Learning from reinforcement is a core component in value formation (under shard theory), but not the only one.
cfoster0(Charles Foster)
I disagree with how this post uses the word “values” throughout, rather than “desires” (or “preferences”) which (AFAICT) would be a better match to how the term is being used here.
This has definitely been a point of confusion. There are a couple of ways one might reasonably interpret the phrase “human values”:
the common denominator between all humans ever about what they care about
the ethical consensus of (some subset of) humanity
the injunctions a particular human would verbally endorse
the cognitive artifacts inside each particular human that implement that human’s valuing-of-X, including the cases where they verbally endorse that valuing (along with a bunch of other kinds of preferences, both wanted and unwanted)
I think the shard theory workstream generally uses “human values” in the last sense.
I agree that at the moment, everything written about shard theory focuses on (1), since the picture is clearest there. Until very recently we didn’t feel we had a good model of how (2) worked. That being said, I believe the basic information inaccessibility problems remain, as the genome cannot pick out a particular thought to be reinforced based on its content, as opposed to based on its predictive summary/scorecard.
As I understand Steve’s model, each Thought Assessor takes in context signals from the world model representing different concepts activated by the current thought, and forms a loop with a generically hardwired control circuit (for ex. salivation or cortisol levels). As a result, the ground truth used to supervise the loop must be something that the genome can directly recognize outside of the Learning Subsystem, like “We’re tasting food, so you really should’ve produced saliva already”. Those context signals then are trained to make long-term predictions relevant to saliva production, in learned from scratch contexts like when you’re sitting at a restaurant reading the entree description on the menu.
Each of those loops needs to be grounded in some way through control circuitry that the genome can construct within the Steering Subsystem, which means that absent some other mechanism, the ground truth signals that are predicted by the Thought Assessors cannot be complex, learned from scratch concepts, even if the inputs to the Thought Assessors are. And as far as I can tell, the salivation Thought Assessor doesn’t know that its inputs are firing because I’m thinking “I’m in a restaurant reading a tasty sounding description” (the content of the thought) as opposed to thinking any other salivation-predictive thought, making the content inaccessible to it. It would seem like there are lots of kinds of content that it’d be hard to ground out this way. For example, how would we set up such a circuit for “deception”?
There would seem to be many possible alignment schemes that would be unlocked if one had “master[ed] the theory of value formation in trained intelligences”, which is what I understand the primary goal of shard theory to be. My understanding is that we’re still working out a lot of the details of that theory, and that they need to be proved out in basic RL setups with small models.
If that’s the case, is there a reason to be singling out this particular ML technique (chain-of-thought prompting on LMs) + training goal (single value formation + punting the rest to alignment researchers) as “the shard theory alignment scheme” this early in the game?
In terms of detailed plans: What about, for example, figuring out enough details about shard theory to make preregistered predictions about the test-time behaviors and internal circuits you will find in an agent after training in a novel toy environment, based on attributes of the training trajectories? Success at that would represent a real win within the field, with a lot of potential further downstream of that.
Re: the rest, even if all of those 4 approaches you listed are individually promising (which I’m inclined to agree with you on), the conjunction of them might be much less likely to work out. I personally consider them as separate bets that can stand or fall on their own, and hope that if multiple pan out then their benefits may stack.
At the risk of reading too much into wording, I think the phrasing of the above two comments contains an interesting difference.
The first comment (TurnTrout) talks about reward as the thing providing updates to the agent’s cognition, i.e. “reward schedules produce … cognitive updates”, and expresses confusion about a prior quote that mentioned implementing our wishes through reward functions.
The second comment (paulfchristiano) talks about picking “rewards that would implement human wishes” and strategies for doing so.
These seem quite different. If I try to inhabit my model of TurnTrout, I expect he might say “But rewards don’t implement wishes! Our wishes are for the agent to have a certain kind of cognition.” and perhaps also “Whether a reward event helps us get the cognition we want is, first and foremost, a question of which circuits inside the agent are reinforced/refined by said event, and whether those particular circuits implement the kind of cognition we wanted the agent to have.”
Mostly agree with the content of this post, though I had some trouble with the Alice & Rick part.
I buy that the active chains-of-thought/actions in Alice’s mind will typically be ones that aren’t objected to by her current shards, both those that implement her Rick-approval-seeking and those that implement her aversion-to-converting. If they had been significantly objected to by those shards, the relevant thoughts & actions would have been discarded in favor of other ones. Given that, it makes sense to me that when you eliminate those, the remaining courses of action still leading to Alice converting might mainly involve “slow”/covert value drift.
What I am uncertain about is the appropriate level of agency to ascribe to her Rick-approval shard etc. I don’t personally imagine that such a shard is foresightedly (even if non-introspectively) searching for plans in the way described (for example, not bidding up the direct conversion plan if it were considered); any search it does would be much more passive. Like perhaps the behavior of the bundle of circuits could be “whenever the Rick-approval node turns off, make Alice feel a sharp pang of longing”, and “when Alice thinks of something that lights up the Rick-approval node, bid for it no matter how irrational the thought is”, and maybe even “when the Rick-approval node lights up, trigger the general-purpose rationalization machinery in the language area to start babbling”. But I imagine that the heavy lifting & sophistication is largely outside the shard.
Definitely glad to see some investigation into the path dependence question.
I expect that the primary source of safety-relevant path dependence in future systems will be due to causal influence of the model’s behavior on its training data / supervision signal. That should occur by default in reinforcement & active learning, but not in typical teacher-forced self-supervised learning (like GPT). So I think I would answer the question of “Are we in a high path-dependence world?” differently conditioned on different AI development models.
- 21 Sep 2022 2:02 UTC; 17 points) 's comment on Towards deconfusing wireheading and reward maximization by (
Even for GPTs, the recently popular “chain-of-thought” family of techniques seem poised to bring path-dependence into the mix, by creating feedback loops between the language model and the reasoning traces it produces.
Hard to say how strongly a decision-heuristic that says “try new things in case they’re good” will measure up against the countervailing “keep doing the things you know are good” (or even a conservative extension to it, like “try new things if they’re sufficiently similar to things you know are good”). The latter would seemingly also be reinforced if it were considered. I do not feel confident reasoning about abstract things like these yet.
Likely of interest to you:
Enjoyed this post. Though it uses different terminology from the shard theory posts, I think it hits on similar intuitions. In particular it hits on some key identifiability problems that show up in the embedded context.
Thus, for sufficiently powerful RL algorithms (not policies!), we should expect them to tend to choose policies which implement wireheading.
I think that by construction, policies and policy fragments that exploit causal short circuits between their actions and the source of reinforcement will be reinforced if they are instantiated. That seems like a relatively generic property of RL algorithm design. But note that this is conditional on “if they are instantiated”. In general, credit assignment can only accurately reinforce what it gets feedback on, not based on counterfactual computations (because you don’t know what feedback you would have otherwise gotten). For example, when you compute gradients on a program that contains branching control flow, the gradients computed are based on how the computations on the actual branch taken contributed to the output, not based on how computations on other branches would have contributed.
For this reason, I think selection-level arguments around what we should expect of RL agents are weak, because they fail to capture the critical dependency of policy-updates on the existing policy. I expect that dependency to hold regardless of how “strong” the RL algorithm is, and especially so when the policy in question is clever in the way a human is clever. Maybe there is some way to escape this and thereby construct an RL method that can route around self-serving policies towards reward-maximizing ones, but I am doubtful.
In particular, one consequence of this is we also don’t need to postulate the existence of some kind of special as yet unknown algorithm that only exists in humans to be able to explain why humans end up caring about things in the world. Whether humans wirehead is determined by the same thing that determines whether RL agents wirehead.
Yep. This was one of the conclusions I took away from Reward is not the optimization target.
Not sure where the disagreement is here, if there is one. I agree that there are lots of sophisticated RL setups possible, including ones that bias the agent towards caring about certain kinds of things (like addictive sweets or wireheading). I also agree that even a basic RL algorithm like TD learning may be enough to steer the cognition of complex, highly capable agents in some cases. What is it that you are saying depends on the RL algorithm being weak?
if the simplest policy is one that cares about memory registers, then you will probably just end up with that.
I am generally very skeptical about simplicity arguments. I think that that sort of reasoning generally does not work for making specific predictions about how neural networks will behave, particularly in the context of RL.
Not the OP but this jumped out at me:
If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn’t (and the reverse will not happen, or at least will happen less often and only due to random noise).
This failure mode seems plausible to me, but I can think of a few different plausible sequences of events that might occur, which would lead to different outcomes, at least in the shard lens.
Sequence 1:
The agent develops diamond-shard
The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned
The agent exploits the gaps between the diamond-concept and the label-process-concept, which reinforces the label-process-shard within it
The label-process-shard drives the agent to continue exploiting the above gap, eventually (and maybe rapidly) overtaking the diamond-shard
So the agent’s values drift away from what we intended.
Sequence 2:
The agent develops diamond-shard
The diamond-shard becomes part of the agent’s endorsed preferences (the goal-content it foresightedly plans to preserve)
The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned
The agent understands that if it exploited the gaps between the diamond-concept and the label-process-concept, it would be reinforced into developing a label-process-shard that would go against its endorsed preference for diamonds (ie. its diamond-shard), so it chooses not exploit that gap, in order to avoid value drift.
So agent continues to value diamonds in spite of the imperfect labeling process
These different sequences of events would seem to lead to different conclusions about whether imperfections in the labeling process are fatal.
Possibly. Though I think it is extremely easy in a context like this. Keeping the diamond-shard in the driver’s seat mostly requires the agent to keep doing the things it was already doing (pursuing diamonds because it wants diamonds), rather than making radical changes to its policy.
Throwing in a perspective, as someone who has been a reviewer on most of the shard theory posts & is generally on board with them. I agree with your headline claim that “niceness is unnatural” in the sense that niceness/friendliness will not just happen by default, but not in the sense that it has no attractor basin whatsoever, or that it is incoherent altogether (which I don’t take you to be saying). A few comments on the four propositions:
There are lots of ways to do the work that niceness/kindness/compassion did in our ancestral environment, without being nice/kind/compassionate.
Yes! Re-capitulating those selection pressures (the ones that happened to have led to niceness-supporting reward circuitry & inductive biases in our case) indeed seems like a doomed plan. There are many ways for that optimization process to shake out, nearly all of them ruinous. It is also unnecessary. Reverse-engineering the machinery underlying social instincts doesn’t require us to redo the evolutionary search process that produced them, nor is that the way I think we will probably develop the relevant AI systems.
The specific way that the niceness/kindness/compassion cluster shook out in us is highly detailed, and very contingent on the specifics of our ancestral environment (as factored through its effect on our genome) and our cognitive framework (calorie-constrained massively-parallel slow-firing neurons built according to DNA), and filling out those details differently likely results in something that is not relevantly “nice”.
Similar to the above, I agree that the particular form of niceness in humans developed because of “specifics of our ancestral environment”, but note that the effects of those contingencies are pretty much screened off by the actual design of human minds. If we really wanted to replicate that niceness, I think we could do so without reference to DNA or calorie-constraints or firing speeds, using the same toolbox as we already use in designing artificial neural networks & cognitive architectures for other purposes. That being said, I don’t think “everyday niceness circa 2022″ is the right kind of cognition to be targeting, so I don’t worry too much about the contingent details of that particular object, whereas I worry a lot about getting something that terminally cares about other agents at all, which seems to me like one of the hard parts of the problem.
Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture. More generally, there are lots of different ways for the AI’s mind to work differently from how you hope it works.
If empathy or niceness or altruism—or whatever other human-compatible cognition we need the AI’s mind to contain—depends critically on some particular architectural choice like “modeling others with the same circuits as the ones with which you model yourself”, then… that’s the name of the game, right? Those are the design constraints that we have to work under. I separately also believe we will make some similar design choices because (1) the near-term trajectory of AI research points in that general direction and (2) as you note, they are easy shortcuts (ML always takes easy shortcuts). I do not expect those views to be shared, though.
The desirable properties likely get shredded under reflection. Once the AI is in the business of noticing and resolving conflicts and inefficiencies within itself (as is liable to happen when its goals are ad-hoc internalized correlates of some training objective), the way that its objectives ultimately shake out is quite sensitive to the specifics of its resolution strategies.
Maybe? It seems plausible to me that, if an agent already terminally values altruism and endorses that valuing, then as it attempts to resolve the remaining conflicts within itself, it will try to avoid resolutions that forseeably-to-it remove or corrupt its altruism-value. It sounds like you are thinking specifically about the period after the AI has internalized the value somewhat, but before the AI reflectively endorses it? If so, then yes I agree, ensuring that a particular value hooks into the reflective process well enough to make itself permanent is likely nontrivial. This is what I believe TurnTrout was pointing at in “A shot at the diamond alignment problem”, in the major open questions list:
4. How do we ensure that the diamond shard generalizes and interfaces with the agent’s self-model so as to prevent itself from being removed by other shards?
Note: “ask them for the faciest possible thing” seems confused.
How I would’ve interpreted this if I were talking with another ML researcher is “Sample the face at the point of highest probability density in the generative model’s latent space”. For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.
I’m guessing what he has in mind is more like “take a GAN discriminator / image classifier & find the image that maxes out the face logit”, but if so, why is that the relevant operationalization? It doesn’t correspond to how such a model is actually used.
EDIT: Here is what the first looks like for StyleGAN2-ADA.
- 2 Nov 2022 15:13 UTC; 21 points) 's comment on All AGI Safety questions welcome (especially basic ones) [~monthly thread] by (EA Forum;
- 18 Oct 2022 13:45 UTC; 3 points) 's comment on Decision theory does not imply that we get to have nice things by (
Upvoted because I agree with all of the above.
AFAICT the original post was using the faces analogy in a different way than Nate is. It doesn’t claim that the discriminators used to supervise GAN face learning or the classifiers used to detect faces are adversarially robust. That isn’t the point it’s making. It claims that learned models of faces don’t “leave anything important out” in the way that one might expect some key feature to be “left out” when learning to model a complex domain like human faces or human values. And that seems well-supported: the trajectory of modern ML has shown learning such complex models is far easier than we might’ve thought, even if building adversarially robust classifiers is very hard. (As much as I’d like to have supervision signals that are robust to arbitrarily-capable adversaries, it seems non-obvious to me that that is even required for success at alignment.)
Not the OP, so I’ll try to explain how I understood the post based on past discussions. [And pray that I’m not misrepresenting TurnTrout’s model.]
As I read it, the post is not focused on some generally-applicable suboptimality of SGD, nor is it saying that policies that would maximize reward in training need to explicitly represent reward.
It is mainly talking about an identifiability gap within certain forms of reinforcement learning: there is a range of cognition compatible with the same reward performance. Computations that have the side effect of incrementing reward—because, for instance, the agent is competently trying to do the rewarded task—would be reinforced if the agent adopted them, in the same way that computations that act *in order to* increment reward would. Given that, some other rationale beyond the reward performance one seems necessary in order for us to expect the particular pattern of reward optimization (“reward but no task completion”) from RL agents.
In addition to the identifiability issue, the post (as well as Steve Byrnes in a sister thread) notes a kind of inner alignment issue. Because an RL agent influences its own training process, it can steer itself towards futures where its existing motivations are preserved instead of being modified (for example, modified into reward optimizing ones). In fact, that seems more and more likely as the agent grows towards strategic awareness, since then it could model how its behavior might lead to its goals being changed. This second issue is dependent on the fact we are doing local search, in that the current agent can sway which policies are available for selection.
Together these point towards a certain way of reasoning about agents under RL: modeling their current cognition (including their motivations, values etc.) as downstream of past reinforcement & punishment events. I think that this kind of reasoning should constrain our expectations about how reinforcement schedules + training environments + inductive biases lead to particular patterns of behavior, in a way that is more specific than if we were only reasoning about reward-optimal policies. Though I am less certain at the moment about how to flesh that out.