Gunnar_Zarncke comments on Understanding and avoiding value drift

Gunnar_Zarncke 22 Oct 2022 21:22 UTC
LW: 7 AF: 4
4
AF
I don’t think that shards are distinct—neither physically nor logically, so they can’t hide stuff in the sense of keeping it out of view of the other shards.
Also, I don’t think “querying for plans” is a good summary of what goes on in the brain.
I’m coming more from a brain-like AGI lens, and my account of what goes on would be a bit different. I’m trying to phrase this in shard theory terminology.
First, a prerequisite: Why do Alice’s shards generate thoughts that value Rick’s state, to begin with? The Risk-shard has learned that actions that make Rick happy result in states of Alice that are reinforced (Alice being happy/healthy).
Given that, I see the process as follows:
1. Alice’s Rick-shards generate thoughts at different levels of abstraction about Alice being happy/healthy because Rick is happy/likes her. Examples:
  1. Conversion (maybe a cached thought out of discussions they had) → will have low predicted value for Alice
  2. Going to church with Rick → mixed
  3. Being close to Rick in the Church (emphasis on closeness, Church in the background, few aspects active) → trend positively
  4. Being in the Church and thinking it’s wrong → consistent
  5. Rick being happy that she joins him → positive
2. So the world model returns no plan but only fragments of potential plans, some where she converts and goes to church with Rick, some not, some other combinations.
3. As there is no plan no purpose must be hidden. Shards only bid for or against parts of plans.
4. Some of these fragments satisfy enough requirements of both retaining atheist Alice’s values (which are predicted to be good for her) as well as scoring on Rick-happiness. Elaborating on these fragments will lead to the activation of more details that are at least somewhat compatible with all shards too. We only call the result of this a “rationalization.”
5. So that she eventually generates enough detailed thoughts that score positively that she actually decides to implement an aggregate of these fragments, which we can call a church-going plan.
6. So that she gets positive reinforcement for going to church,
7. which reinforces all aspects of the experience, including being in church, which, in aggregate, we can call a religion-shard,
  1. I agree that this changes her internal shard balance significantly—she has learned something she didn’t know before, and that leaves her better off (as measured by some fundamental health/happyness measurements).
  2. I think this can be meaningfully called value drift only with respect to either existing shards (though these are an abstraction we are using, not something that’s fundamental to the brain), or with respect to Alice’s interpretations/verbalizations—thoughts that are themselves reinforced by shards.
8. so that more such thoughts come up, and she eventually converts,
9. so that Rick ends up happier and liking Alice more—though that was never the “plan” to begin with.
In short: There is no top-down planning but bottom-up action generation. All planning is constructed out of plan fragments that are compatible with all (existing) shards.
- TurnTrout 24 Oct 2022 23:22 UTC
  LW: 4 AF: 3
  2
  AF Parent
  Thanks for detailing it. I understand you to describe ~iterative filtering and refinement of a crude proto-plan (church-related thoughts and impulses) which filter down into a more detailed plan, where each piece is selected to be amenable to all relevant shards (without explicit planning).
  I think it doesn’t sound quite right to me, still, for a few reasons I’ll think about more.
  - Gunnar_Zarncke 2 Feb 2025 9:06 UTC
    2 points
    0
    Parent
    what is your current thinking on how shards lead to plans?
  - Gunnar_Zarncke 24 Oct 2022 23:33 UTC
    2 points
    0
    Parent
    Yes, iterative filtering and refinement is a good summary.
    I agree that the way I formulated it doesn’t flow smoothly. Maybe I stayed too close to shard theory terminology and your process structure. But I’m happy that you see something in it, and I’m looking forward to the results of your thinking.