Think of the stuff that, when you imagine it, feels really yummy.
Also worth taking into consideration: things that feel anti-yummy. Fear/disgust/hate/etc are also signals about your values.
Think of the stuff that, when you imagine it, feels really yummy.
Also worth taking into consideration: things that feel anti-yummy. Fear/disgust/hate/etc are also signals about your values.
I think the “your values” -framing itself already sneaks in assumptions which are false for a lot of minds/brains. Notably: most minds are not perfectly monolithic/unified things well-modeled as a coherent “you/I/me”. And other minds are quite unified/coherent, but are in the unfortunate situation of running on a brain that also contains other (more or less adversarial) mind-like programs/wetware.
Example:
It is entirely possible to have strongly-held values such as “I reject so-and-so arbitrary/disgusting parts of the reward circuitry Evolution designed into my brain; I will not become a slave to the Blind Idiot God’s whims and attempts to control me”. In that case, the “I” that holds those values clearly excludes at least some parts of its host brain’s yumminess-circuitry.[1] (I.e., feelings of yumminess forced upon the mind are not signals about that mind’s values, but rather more like attempts by a semi-adversarial brain to hack that mind.)
Another example:
Alex has some shitty experiences in childhood, and strongly internalizes a schema S like “if I do X, I will be safe”, and thereafter has strong yumminess feelings about doing X. But later upon reflection, Alex realizes that yumminess feelings are coming from S, and that S’s implicit models of reality aren’t even remotely accurate now in adulthood. Alex would like to delete S from their brain, but can’t. So the strong yumminess-around-X persists. Is X one of Alex’s values?
So, I object to what I perceive to be an attempt to promote a narrative/frame about what constitutes “you/I/me” or “your values” for people in general. (Albeit that I’m guessing that there was no malice involved in that promoting.) Especially when it is a frame that seems to imply that many people (as they conceive of themselves) are not really/fully persons, and/or that they should let arbitrary brain-circuits corrupt their souls (if those brain-circuits happen to have the ability to produce feelings of yumminess).
Please be more careful about deploying/rolling your own metaethics.
Maybe that “I” could be described as a learned mesaoptimizer, something that arose “unintentionally” from perspective of some imaginable/nonexistent Evolution-aligned mind-designer. But so what? Why privilege some imaginary Evolution fairy over an actually existing person/mind?
I think some of the central models/advice in this post [1] are in an uncanny valley of being substantially correct but also deficient, in ways that are liable to lead some users of the models/advice to harm themselves. (In ways distinct from the ones addressed in the post under admonishments to “not be an idiot”.)
In particular, I’m referring to the notion that
The Yumminess You Feel When Imagining Things Measures Your Values
I agree that “yumminess” is an important signal about one’s values. And something like yumminess or built-in reward signals are what shape one’s values to begin with. But there are a some further important points to consider. Notably: Some values are more abstract than others[2]; values differ a lot in terms of
How much abstract/S2 reasoning any visceral reward has to route through in order to reinforce that value.
How much abstract/S2 reasoning is required to determine how to satisfy that value, or to determine whether an imagined state-of-affairs satisfies (or violates) that value.
(Or, conversely:) How readily S1 detects the presence (or lack/violation) of that value in any given imagined state-of-affairs, for various ways of imagining that state-of-affairs.
Also, we are computationally limited meat-bags, sorely lacking in the logical omniscience department.
This has some consequences:
It is possible to imagine or even pursue goals that feel yummy but which in fact violate some less-obvious-to-S1 values, without ever realizing that any violation is happening.[3]
Pursuing more abstract values is likely to require more willpower, or even incur undue negative reinforcement, and end up getting done less.[4][5]
More abstract values V are liable to get less strongly reinforced by the brain’s RL than more obviously-to-S1-yummy values W, even if V in fact contributed more to receiving base/visceral reward signals.
Which in turn raises questions like
Should we be very careful about how we imagine possible goals to pursue? How do we ensure that we’re not failing to consider the implications of some abstract values, which, if considered, would imply that the imagined goal is in fact low-or-negative value?
Should we correct for our brains’ stupidity by intentionally seeking more reinforcement for more abstract values, or by avoiding reinforcing viscerally-yummy values too much?
Should we correct for our brain’s past stupidity (failures to appropriately reinforce more abstract values) by assigning higher priority to more abstract values despite their lower yumminess?[6]
Or does “might make right”? Should we just let whatever values/brain-circuits have the biggest yumminess-guns determine what we pursue and how our minds get reinforced/modified over time? (Degenerate into wireheaders in the limit?)
The endeavor of answering the above kinds of questions—determining how to resolve the “shoulds” in them—is itself value-laden, and also self-referential/recursive, since the answer depends on our meta-values, which themselves are values to which the questions apply.
Doing that properly can get pretty complicated pretty fast, not least because doing so may require Tabooing “I/me” and dissecting the various constituent parts of one’s own mind down to a level where introspective access (and/or understanding of how one’s own brain works) becomes a bottleneck.[7]
But in conclusion: I’m pretty sure that simply following the most straightforward interpretation of
The Yumminess You Feel When Imagining Things Measures Your Values
would probably lead to doing some kind of violence to one’s own values, to gradually corrupting[8] oneself, possibly without ever realizing it or feeling bad at any point. The probable default being “might makes right” / letting the more obvious-to-S1 values eat up ever more of one’s soul, at the expense of one’s more abstract values.
Addendum: I’d maybe replace
The Yumminess You Feel When Imagining Things Measures Your Values
with
The Yumminess You Feel When Imagining Things is evidence about how some parts of your brain value the imagined things, to the extent that your imagination adequately captured all relevant aspects of those things.
or, the models/advice many readers might (more or less (in)correctly) construe from this post
Examples of abstract values: “being logically consistent”, “being open-minded/non-parochial”, “bite philosophical bullets”, “take ideas seriously”, “value minds independently of the substrate they’re running on”.
To give one example: Acting without adequately accounting for scope insensitivity.
Because S1 yumminess-detectors don’t grok the S2 reasoning required to understand that a goals scores highly according to the abstract value, so pursuing the goal feels unrewarding.
Example: wanting heroin, vs wanting to not want heroin.
Depends on (i.a.) the extent to which we value “being the kind of person I would be if my brain weren’t so computationally limited/stupid”, I guess.
IME. YMMV.
as judged by a more careful, reflective, and less computationally limited extrapolation of one’s current values
So what do you do about the growing aversion to information which is unpleasant to learn? This list is incomplete, and I appreciate your help by expanding it.
The underlying problem seems to be something like “System 1 fails to grok that the Map is not the Territory”. So the solution would likely be something that helps S1 grok that.
Possibly helpful things:
Imagine, in as much concrete/experiential detail as possible, the four worlds corresponding to “unpleasant thing is true/false” x “I do/don’t believe the thing”. Or at least the world where “unpleasant thing is true but I don’t believe it”.
In the post and comments, you’ve said that you’re reflectively stable, in the sense of endorsing your current values. In combination with the sadistic kinks/values described above, that raises some questions:
What exactly stops you from inflicting suffering on people, other than the prospect of social or legal repercussions? Do you have some values that countervail against the sadism? If yes, what are they, and how do you reconcile them with the sadism? [1]
Asking partly because: I occasionally run into sadistic parts in myself, but haven’t found a way to reconcile them with my more empathetic parts, so I usually just suppress/avoid the sadistic parts. And I’d like to find a way to reconcile/integrate them instead.
Could it be due to aliefs about attainability of success becoming lower, and that leading to lower motivation? (Cf. “motivation equation”.) (It’s less likely we’ll be able to attain a flourishing post-human future if the world is deeply insane, mostly run by sociopaths, or similarly horrible.)
Or maybe: As one learns about horrors, the only thing that feels worth working on is mitigating the horrors; but that endeavour is difficult, has sparse (or zero) rewards, low probability of success, etc., and consequently does not feel very exciting?
(Also: IIUC, you keep updating towards “world is more horrible than I thought”? If so: why not update all the way, to the point that you can no longer predict which way you’ll update in future?)
Suppose you succeed at doing impactful science in AI. What is your plan for ensuring that those impacts are net-positive? (And how would you define “positive” in this context?)
(CTRL+F’ing this post yielded zero safety-relevant matches for “safe”, “beneficial”, or “align”.)
It’s unclear whether there is a tipping point where [...]
Yes. Also unclear whether the 90% could coordinate to take any effective action, or whether any effective action would be available to them. (Might be hard to coordinate when AIs control/influence the information landscape; might be hard to rise up against e.g. robotic law enforcement or bioweapons.)
Don’t use passive voice for this. [...]
Good point! I guess one way to frame that would be as
by what kind of process do the humans in law enforcement, military, and intelligence agencies get replaced by AIs? Who/what is in effective control of those systems (or their successors) at various points in time?
And yeah, that seems very difficult to predict or reliably control. OTOH, if someone were to gain control of the AIs (possibly even copies of a single model?) that are running all the systems, that might make centralized control easier? </wild, probably-useless speculation>
A potentially somewhat important thing which I haven’t seen discussed:
People who have a lot of political power or own a lot of capital, are unlikely to be adversely affected if (say) 90% of human labor becomes obsolete and replaced by AI.
In fact, so long as property rights are enforced, and humans retain a monopoly on decisionmaking/political power, such people are not-unlikely to benefit from the economic boost that such automation would bring.
Decisions about AI policy are mostly determined by people with a lot of capital or political power. (E.g. Andreessen Horowitz, JD Vance, Trump, etc.)
(This looks like a decisionmaker is not the beneficiary -type of situation.)
Why does that matter?
It has implications for modeling decisionmakers, interpreting their words, and for how to interact with them.[1]
If we are in a gradual-takeoff world[2], then we should perhaps not be too surprised to see the wealthy and powerful push for AI-related policies that make them more wealthy and powerful, while a majority of humans become disempowered and starve to death (or live in destitution, or get put down with viruses or robotic armies, or whatever). (OTOH, I’m not sure if that possibility can be planned/prepared for, so maybe that’s irrelevant, actually?)
For example: we maybe should not expect decisionmakers to take risks from AI seriously until they realize those risks include a high probability of “I, personally, will die”. As another example: when people like JD Vance output rhetoric like “[AI] is not going to replace human beings. It will never replace human beings”, we should perhaps not just infer that “Vance does not believe in AGI”, but instead also assign some probability to hypotheses like “Vance thinks AGI will in fact replace lots of human beings, just not him personally; and he maybe does not believe in ASI, or imagines he will be able to control ASI”.
Here I’ll define “gradual takeoff” very loosely as “a world in which there is a >1 year window during which it is possible to replace >90% of human labor, before the first ASI comes into existence”.
Thank you for (being one of the horrifyingly few people) doing sane reporting on these crucially important topics.
Typo: “And humanity needs all the help we it can get.”
Out of (1)-(3), I think (3)[1] is clearly most probable:
I think (2) would require Altman to be deeply un-strategic/un-agentic, which seems in stark conflict with all the skillful playing-of-power-games he has displayed.
(3) seems strongly in-character with the kind of manipulative/deceitful maneuvering-into-power he has displayed thus far.
I suppose (1) is plausible; but for that to be his only motive, he would have to be rather deeply un-strategic (which does not seem to be the case).
(Of course one could also come up with other possibilities besides (1)-(3).)[2]
or some combination of (1) and (3)
E.g. maybe he plans to keep ASI to himself, but use it to implement all-of-humanity’s CEV, or something. OTOH, I think the kind of person who would do that, would not exhibit so much lying, manipulation, exacerbating-arms-races, and gambling-with-everyone’s-lives. Or maybe he doesn’t believe ASI will be particularly impactful; but that seems even less plausible.
Note that our light cone with zero value might also eclipse other light cones that might’ve had value if we didn’t let our AGI go rogue to avoid s-risk.
That’s a good thing to consider! However, taking Earth’s situation as a prior for other “cradles of intelligence”, I think that consideration returns to the question of “should we expect Earth’s lightcone to be better or worse than zero-value (conditional on corrigibility)?”
To me, those odds each seem optimistic by a factor of about 1000, but ~reasonable relative to each other.
(I don’t see any low-cost way to find out why we disagree so strongly, though. Moving on, I guess.)
But this isn’t any worse to me than being killed [...]
Makes sense (given your low odds for bad outcomes).
Do you also care about minds that are not you, though? Do you expect most future minds/persons that are brought into existence to have nice lives, if (say) Donald “Grab Them By The Pussy” Trump became god-emperor (and was the one deciding what persons/minds get to exist)?
IIUC, your model would (at least tentatively) predict that
if person P has a lot of power over person Q,
and P is not sadistic,
and P is sufficiently secure/well-resourced that P doesn’t “need” to exploit Q,
then P will not intentionally do anything that would be horrible for Q?
If so, how do you reconcile that with e.g. non-sadistic serial killers, rapists, or child abusers? Or non-sadistic narcissists in whose ideal world everyone else would be their worshipful subject/slave?
That last point also raises the question: Would you prefer the existence of lots of (either happily or grudgingly) submissive slaves over oblivion?
To me it seems that terrible outcomes do not require sadism. Seems sufficient that P be low in empathy, and want from Q something Q does not want to provide (like admiration, submission, sex, violent sport, or even just attention).[1] I’m confused as to how/why you disagree.
Also, AFAICT, about 0.5% to 8% of humans are sadistic, and about 8% to 16% have very little or zero empathy. How did you arrive at “99% of humanity [...] are not so sadistic”? Did you account for the fact that most people with sadistic inclinations probably try to hide those inclinations? (Like, if only 0.5% of people appear sadistic, then I’d expect the actual prevalence of sadism to be more like ~4%.)
It seems like you’re assuming people won’t build AGI if they don’t have reliable ways to control it, or else that sovereign (uncontrolled) AGI would be likely the be friendly to humanity.
I’m assuming neither. I agree with you that both seem (very) unlikely. [1]
It seems like you’re assuming that any humans succeeding in controlling AGI is (on expectation) preferable to extinction? If so, that seems like a crux: if I agreed with that, then I’d also agree with “publish all corrigibility results”.
I expect that unaligned ASI would lead to extinction, and our share of the lightcone being devoid of value or disvalue. I’m quite uncertain, though.
It’s more important to defuse the bomb than it is to prevent someone you dislike from holding it.
I think there is a key disanalogy to the situation with AGI: The analogy would be stronger if the bomb was likely to kill everyone, but also had a some (perhaps very small) probability of conferring godlike power to whomever holds it. I.e., there is a tradeoff: decrease the probability of dying, at the expense of increasing the probability of S-risks from corrupt(ible) humans gaining godlike power.
If you agree that there exists that kind of tradeoff, I’m curious as to why you think it’s better to trade in the direction of decreasing probability-of-death for increased probability-of-suffering.
So, the question I’m most interested in is the one at the end of the post[1], viz
What (crucial) considerations should one take into account, when deciding whether to publish—or with whom to privately share—various kinds of corrigibility-related results?
Didn’t put it in the title, because I figured that’d be too long of a title.
Taking a stab at answering my own question; an almost-certainly non-exhaustive list:
Would the results be applicable to deep-learning-based AGIs?[1] If I think not, how can I be confident they couldn’t be made applicable?
Do the corrigibility results provide (indirect) insights into other aspects of engineering (rather than SGD’ing) AGIs?
How much weight one gives to avoiding x-risks vs s-risks.[2]
Who actually needs to know of the results? Would sharing the results with the whole Internet lead to better outcomes than (e.g.) sharing the results with a smaller number of safety-conscious researchers? (What does the cost-benefit analysis look like? Did I even do one?)
How optimistic (or pessimistic) one is about the common-good commitment (or corruptibility) of the people who one thinks might end up wielding corrigible AGIs.
Something like the True Name of corrigibility might at first glance seem applicable only to AIs of whose internals we have some meaningful understanding or control.
If corrigibility were easily feasible, then at first glance, that would seem to reduce the probability of extinction (via unaligned AI), but increase the probability of astronomical suffering (under god-emperor Altman/Ratcliffe/Xi/Putin/...).
Given that the basic case for x-risks is so simple/obvious [[1]] , I think most people arguing against any risk are probably doing so due to some kind of myopic/irrational subconscious motive. (It’s entirely reasonable to disagree on probabilities, or what policies would be best, etc.; but “there is practically zero risk” is just absurd.)
So I’m guessing that the deeper problem/bottleneck here is people’s (emotional) unwillingness to believe in x-risks. So long as they have some strong (often subconscious) motive to disbelieve x-risks, any conversation about x-risks is liable to keep getting derailed or be otherwise very unproductive. [[2]]
I think some common underlying reasons for such motivated disbelief include
Subconscious Map-Territory confusions:
Believing X makes X feel more real. And so, System 1 “solves” x-risks by disbelieving them away.
System 1 makes decisions—including decisions for what to believe—under the assumption that the current Map is the Territory. For example: If the current Map says that “x-risks are not real and that’s good because now I can keep making money developing AGI capabilities”, then according to that Map, it would be bad to update towards believing that x-risks are real (because then maybe you’d needlessly stop making money!)
Cognitive immune response. Lots of humans have a strong subconscious instinct along the lines of “If a clever human is trying, with abstract arguments, to convince me of X, and I anticipate that believing X would change my policy/behavior in ways that feel unpleasant, then I reject/disbelieve X”. [[3]]
I’m not sure what the best approaches are to addressing the above kinds of dynamics. Trying to directly point them out seems likely to end badly (at least with most neurotypical people). If you can somehow get people to (earnestly) do them, small mental exercises like Split and Commit or giving oneself a line of retreat might help for (1.)? For (2.), maybe
avoid trying to persuade; instead ask questions that prompt the person to think concretely about the situation themself,
show them something tangible, like some AI model doing something impressive, or that slowed-down video of a train station (along with “this is what humans would look like to an entity that thinks 100 times faster—statues”)?
If you try the above, I’d be curious to see a writeup of the results.
Building a species of superhumanly smart & fast machine aliens without understanding how they work seems very dangerous. And yet, various companies and nations are currently pouring trillions of dollars into making that happen, and appear to be making rapid progress. (Experts disagree on whether there’s a 99% chance we all die, or if there’s only a 10% chance we all die and a 90% chance some corporate leaders become uncontested god-emperors, or if we end up as pets to incomprehensible machine gods, or if the world will be transformed beyond human comprehension and everyone will rely on personal AI assistants to survive. Sounds good, right?)
A bit like trying to convince a deeply religious person via rational debate. It’s not really about the evidence/reasoning.
I wouldn’t be too surprised if this kind of instinct were evolved, rather than just learned. Even neurotypical humans try to hack each other all the time, and clever psychopaths have probably been around for many, many generations.