I think some of the central models/advice in this post [1] are in an uncanny valley of being substantially correct but also deficient, in ways that are liable to lead some users of the models/advice to harm themselves. (In ways distinct from the ones addressed in the post under admonishments to “not be an idiot”.)
In particular, I’m referring to the notion that
The Yumminess You Feel When Imagining Things Measures Your Values
I agree that “yumminess” is an important signal about one’s values. And something like yumminess or built-in reward signals are what shape one’s values to begin with. But there are a some further important points to consider. Notably: Some values are more abstract than others[2]; values differ a lot in terms of
How much abstract/S2 reasoning any visceral reward has to route through in order to reinforce that value.
How much abstract/S2 reasoning is required to determine how to satisfy that value, or to determine whether an imagined state-of-affairs satisfies (or violates) that value.
(Or, conversely:) How readily S1 detects the presence (or lack/violation) of that value in any given imagined state-of-affairs, for various ways of imagining that state-of-affairs.
Also, we are computationally limited meat-bags, sorely lacking in the logical omniscience department.
This has some consequences:
It is possible to imagine or even pursue goals that feel yummy but which in fact violate some less-obvious-to-S1 values, without ever realizing that any violation is happening.[3]
Pursuing more abstract values is likely to require more willpower, or even incur undue negative reinforcement, and end up getting done less.[4][5]
More abstract values V are liable to get less strongly reinforced by the brain’s RL than more obviously-to-S1-yummy values W, even if V in fact contributed more to receiving base/visceral reward signals.
Which in turn raises questions like
Should we be very careful about how we imagine possible goals to pursue? How do we ensure that we’re not failing to consider the implications of some abstract values, which, if considered, would imply that the imagined goal is in fact low-or-negative value?
Should we correct for our brains’ stupidity by intentionally seeking more reinforcement for more abstract values, or by avoiding reinforcing viscerally-yummy values too much?
Should we correct for our brain’s past stupidity (failures to appropriately reinforce more abstract values) by assigning higher priority to more abstract values despite their lower yumminess?[6]
Or does “might make right”? Should we just let whatever values/brain-circuits have the biggest yumminess-guns determine what we pursue and how our minds get reinforced/modified over time? (Degenerate into wireheaders in the limit?)
The endeavor of answering the above kinds of questions—determining how to resolve the “shoulds” in them—is itself value-laden, and also self-referential/recursive, since the answer depends on our meta-values, which themselves are values to which the questions apply.
Doing that properly can get pretty complicated pretty fast, not least because doing so may require Tabooing “I/me” and dissecting the various constituent parts of one’s own mind down to a level where introspective access (and/or understanding of how one’s own brain works) becomes a bottleneck.[7]
But in conclusion: I’m pretty sure that simply following the most straightforward interpretation of
The Yumminess You Feel When Imagining Things Measures Your Values
would probably lead to doing some kind of violence to one’s own values, to gradually corrupting[8] oneself, possibly without ever realizing it or feeling bad at any point. The probable default being “might makes right” / letting the more obvious-to-S1 values eat up ever more of one’s soul, at the expense of one’s more abstract values.
Addendum:
I’d maybe replace
The Yumminess You Feel When Imagining Things Measures Your Values
with
The Yumminess You Feel When Imagining Things is evidence about how some parts of your brain value the imagined things, to the extent that your imagination adequately captured all relevant aspects of those things.
Because S1 yumminess-detectors don’t grok the S2 reasoning required to understand that a goals scores highly according to the abstract value, so pursuing the goal feels unrewarding.
I think some of the central models/advice in this post [1] are in an uncanny valley of being substantially correct but also deficient, in ways that are liable to lead some users of the models/advice to harm themselves. (In ways distinct from the ones addressed in the post under admonishments to “not be an idiot”.)
In particular, I’m referring to the notion that
I agree that “yumminess” is an important signal about one’s values. And something like yumminess or built-in reward signals are what shape one’s values to begin with. But there are a some further important points to consider. Notably: Some values are more abstract than others[2]; values differ a lot in terms of
How much abstract/S2 reasoning any visceral reward has to route through in order to reinforce that value.
How much abstract/S2 reasoning is required to determine how to satisfy that value, or to determine whether an imagined state-of-affairs satisfies (or violates) that value.
(Or, conversely:) How readily S1 detects the presence (or lack/violation) of that value in any given imagined state-of-affairs, for various ways of imagining that state-of-affairs.
Also, we are computationally limited meat-bags, sorely lacking in the logical omniscience department.
This has some consequences:
It is possible to imagine or even pursue goals that feel yummy but which in fact violate some less-obvious-to-S1 values, without ever realizing that any violation is happening.[3]
Pursuing more abstract values is likely to require more willpower, or even incur undue negative reinforcement, and end up getting done less.[4][5]
More abstract values V are liable to get less strongly reinforced by the brain’s RL than more obviously-to-S1-yummy values W, even if V in fact contributed more to receiving base/visceral reward signals.
Which in turn raises questions like
Should we be very careful about how we imagine possible goals to pursue? How do we ensure that we’re not failing to consider the implications of some abstract values, which, if considered, would imply that the imagined goal is in fact low-or-negative value?
Should we correct for our brains’ stupidity by intentionally seeking more reinforcement for more abstract values, or by avoiding reinforcing viscerally-yummy values too much?
Should we correct for our brain’s past stupidity (failures to appropriately reinforce more abstract values) by assigning higher priority to more abstract values despite their lower yumminess?[6]
Or does “might make right”? Should we just let whatever values/brain-circuits have the biggest yumminess-guns determine what we pursue and how our minds get reinforced/modified over time? (Degenerate into wireheaders in the limit?)
The endeavor of answering the above kinds of questions—determining how to resolve the “shoulds” in them—is itself value-laden, and also self-referential/recursive, since the answer depends on our meta-values, which themselves are values to which the questions apply.
Doing that properly can get pretty complicated pretty fast, not least because doing so may require Tabooing “I/me” and dissecting the various constituent parts of one’s own mind down to a level where introspective access (and/or understanding of how one’s own brain works) becomes a bottleneck.[7]
But in conclusion: I’m pretty sure that simply following the most straightforward interpretation of
would probably lead to doing some kind of violence to one’s own values, to gradually corrupting[8] oneself, possibly without ever realizing it or feeling bad at any point. The probable default being “might makes right” / letting the more obvious-to-S1 values eat up ever more of one’s soul, at the expense of one’s more abstract values.
Addendum: I’d maybe replace
with
or, the models/advice many readers might (more or less (in)correctly) construe from this post
Examples of abstract values: “being logically consistent”, “being open-minded/non-parochial”, “bite philosophical bullets”, “take ideas seriously”, “value minds independently of the substrate they’re running on”.
To give one example: Acting without adequately accounting for scope insensitivity.
Because S1 yumminess-detectors don’t grok the S2 reasoning required to understand that a goals scores highly according to the abstract value, so pursuing the goal feels unrewarding.
Example: wanting heroin, vs wanting to not want heroin.
Depends on (i.a.) the extent to which we value “being the kind of person I would be if my brain weren’t so computationally limited/stupid”, I guess.
IME. YMMV.
as judged by a more careful, reflective, and less computationally limited extrapolation of one’s current values