I believe LW calls it “ethical injunctions”.
quetzal_rainbow
You start to appreciate all the points about “predicting technological progress is very hard actually” from “There is no fire alarm for AGI” when you realize that “Attention Is All You Need” was published several months earlier.
Another thing is narrow self-concept.
In original thread, people often write about things they have and their clone would want, like family. They fail to think about things they don’t have due to having families, like cocaine orgies, or volunteering to war for just cause, or monastic life in search of enlightenment, so they could flip a coin and go pursue alternative life in 50% of cases. I suspect it’s because thinking about desirable things you won’t have on the best available course of your life is very sour-grapes-flavored.
When I say “human values” without reference I mean “type of things that human-like mind can want and their extrapolations”. Like, blind from birth person can want their vision restored, even if they have sufficiently accommodating environment and other ways to orient, like echolocation. Able-bodied human can notice this and extrapolate this into possible new modalities of perception. You can be not vengeful person, but concept of revenge makes sense to almost any human, unlike concept of paperclip-maximization.
It’s nice ideal to strive for, but sometimes you need to make judgement call based on things you can’t explain.
Okay, but yumminess is not values. If we pick ML analogy, yumminess is reward signal or some other training hyperparameter.
My personal operationalization of values is “the thing that helps you to navigate trade-offs”. You can have yummi feelings about saving life of your son or about saving life of ten strangers, but we can’t say what you value until you consider situation where you need to choose between two. And, conversely, if you have good feelings about parties and reading books, your values direct what you choose.
Choice in case of real, value-laden trade-offs is usually defined by significant amount of reflection about values and memetic ambience supplies known summaries of such reflection in the past.
This reason only makes sense if you expect first person to develop AGI to create singleton which takes over the world and locks in pre-installed values, which, again, I find not very compatible with low p(doom). What prevents scenario “AGI developers look around for a year after creation of AGI and decide that they can do better” if not misaligned takeover and not suboptimal value lock-in?
The reason to work on preventing AI takeover now, as opposed to working on already invented AGI in the future, is the first try problem: if you have unaligned takeover-capable AGI, takover just happens and you don’t get to iterate. The same happens with problem of extremely good future only if you believe that the main surviving scenario is “aligned-with-developer-intention singleton takes over the world very quickly, locking in pre-installed values”. People who believe in such scenario usually have very high p(doom), so I assume you are not one of them.
What exactly prevents your strategy here from being “wait for aligned AGI, ask it how to make future extremely good and save some opportunity cost”?
Sure, set of available options is defined in problem setup. It’s “one-box” and “two-box”.
I feel like it’s confusion about type signature of decision theory? Decision theory talks about mappings from observations and probabilistic models to actions. In case of humans, actions are motor outputs. Decision theory asks “what sort of motor output is the best?” and answers “which leads you to leave with one box”. You are allowed to be really indecisive in process and cry “it feels wrong to leave the second box!”, Omega in this scenario doesn’t care.
I think the difference between reality and fiction is that fiction contains heroes—high-agency people in very personal relationships with villains.
Unrealistic part of being hero to designated villain is that in real life enemies are screened off from each other by many layers of social reality. Roosevelt was kinda hero to Hitler’s villainy, but Roosevelt mostly didn’t spend his nights thinking about Hitler could think, because 1) FDR was busy managing his country, 2) anything that Hitler could do was overconstrained by abilities of Germany, so it was more productive to think about what Germany as a whole could do. As a result, there is no interpersonal drama in which supervillain monologue fits.
Stanislaw Lem once wrote (I translated it from Russian, because I can’t find another source in the open):
...sometimes it seems incredible to me that there are so many billions of people, many thousands of whom are gifted with imagination and expertise in the field of hypothesis generation, but no one takes on what I do. How is it that in some remote corner of Slavic land, between the North Pole and the Balkans, one madman suffers in solitude? Even from my favored statistical point of view, it follows that there must be at least a few such Quasi-Lems, Anti-Lems, Para-Lems, or Proto-Lems in the world, but here there are none.
Alternative hypothesis: when you recontextualize, you train model “you will get away with X, given this signal (encouragement to do X in system prompt)”. When model sees no such signal, it decides that it won’t get away with bad behavior and acts good.
I wonder what would happen if you recontextualize with pair of prompts “your actions and thoughts will be recorded and evaluated” vs “you are acting absolutely autonomously, no oversight”.
It depends on properties of bounded search itself.
I.e., if you are properly calibrated domain expert who can make 200 statements on topic with assigned probability 0.5% and be wrong on average 1 time, then, when you arrive at probability 0.5% as a result of your search for examples, we can expect that your search space was adequate and wasn’t oversimplified, such that your result is not meaningless.
If you operate in confusing, novel, adversarial domain, especially when domain is “the future”, when you find yourself assigning probabilities 0.5% for any reason which is not literally theorems and physical laws, your default move should be to say “wait, this probability is ridiculous”.
A video game based around interacting with GenAI-based elements will achieve break-out status.
Nope. This continues to be a big area of disappointment. Not only did nothing break out, there wasn’t even anything halfway decent.
We have at least two problems on a way here:
Artistic community hates GenAI guts.
It’s an absolute copyright hell and most computer game shops do not platform AI games for this reason
if you are first (immensely capable) then you’ll pursue (coherence) as a kind of side effect, because it’s pleasant to pursue.
I’m certain it’s very straw motivation.
Imagine that you are Powerful Person. You find yourself lying in bed all day wallowing in sorrows of this earthly vale. You feel sad and you don’t do anything.
This state is clearly counterproductive for any goal you can have in mind. If you care about sorrows of this earthly vale, you would do better if you earn additional money and donate it, if you don’t, then why suffer? Therefore, you try to mold your mind in shape which doesn’t allow for laying in bed wallowing in sorrows.
From my personal experience, I have ADHD and I’m literally incapable to even write this comment without at least some change of my mindset from default.
it looks like this just kinda sucks as a means
It certainly sucks, because it’s not science and engineering, it’s collection of tricks which may work for you or may not.
On the other hand, we are dealing with selection effects—highly-coherent people don’t need artificial means to increase it and people actively seeking artificial coherence are likely to have executive function deficits or mood disorders.
Also, some methods of increasing coherence are not very dramatic. Writing can plausibly make you more coherent because during writing you will think about your thought process and nobody will notice, because it’s not as sudden as personality change after psychedelics.
I think that in natural environments both kind of actions are actually actions taken by the same kind of people. The most power-seeking cohort on Earth (San-Francisco start up enterpreneurs) is obsessed with mindfulness, meditations, psychedelics, etc. If you squint and look at history of esoterism, you will see tons of powerful people who wanted to become even more powerful through greater personal coherence (alchemical Magnum Opus, this sort of stuff).
IIRC, canonical old-MIRI definition of intelligence is “intelligence is cross-domain optimization” and it captures your definition modulo emotional/easy-to-understand-for-human part?
Relentlessness comes both from “optimization as looking for the ‘best’ solution” and “cross-domaining as ignoring conventional boundaries”, resourcefulness comes from picking more resources from non-standard domains, creativity is just consequence of sufficiently long optimization in sufficiently large search space.
Okay, but it looks like original inner misalignment problem? Either model has wrong representation for “human values”, or we fail to recognize proper representation and make it optimize for something else?
On the other hand, properly optimized for human values world should look very weird. It likely includes a lot of aliens having a lot of weird alien fun, and weird qualia factories and...
I express this by saying “sufficiently advanced probabilistic reasoning is indistinguishable from prophetic intuition”.