A task X reduces to task Y if and only if...
Shouldn’t X and Y be the other way around there?
A task X reduces to task Y if and only if...
Shouldn’t X and Y be the other way around there?
I feel like this conversation might be interesting to continue, if I had more bandwidth, but I don’t. In any case, thanks for the linked article, looks interesting based on the abstract.
Whatever happened here is an interesting datapoint about [...]
I think using the word “interesting” here is kinda assuming the conclusion?
Whatever happened here is a datapoint about the long-term evolution of thermodynamic systems away from equilibrium.
Pretty much all systems in the universe can be seen as “thermodynamic systems”. And for a system to evolve at all, it necessarily has to be away from equilibrium. So it seems to me that that sentence is basically saying
“Whatever happened here is a datapoint about matter and energy doing their usual thing over a long period of time.”
And… I don’t see how that answers the question “why would an ASI find it interesting?”
From the biological anchors paper [...] the point is that what happened here is not so trivial or boring that its clear that an ASI would not have any interest in it.
I agree that a lot of stuff has happened. I agree that accurately simulating the Earth (or even just the biological organisms on Earth) is not trivial. What I don’t see (you making an actual argument for) is why all those neural (or other) computations would be interesting to an ASI. [1]
I’m sure people have written more extensively about this, about an ASI freezing some selection of the human population for research purposes or whatever.
Right. That sounds like a worse-than-death scenario. I agree those are entirely plausible, albeit maybe not the most likely outcomes. I’d expect those to be caused by the AI ending up with some kind of human-related goals (due to being trained with objectives like e.g. “learn to predict human-generated text” or “maximize signals of approval from humans”), rather than by the ASI spontaneously developing a specific interest in the history of how natural selection developed protein-based organic machines on one particular planet.
I just find the idea that the ASI will want my atoms for something trivial, when [...]
As mentioned above, I’d agree that there’s some chance that an Earth-originating ASI would end up with a goal of “farming” (simulated) humans for something (e.g. signals of approval), but I think such goals are unlikely a priori. Why would an ASI be motivated by “a grand exploration of the extremes of thermodynamics” (whatever that even means)? (Sounds like a waste of energy, if your goal is to (e.g.) maximize the number of molecular squiggles in existence.) Are you perhaps typical-minding/projecting your own (laudable) human wonder/curiosity onto a hypothetical machine intelligence?
Analogy: If you put a few kilograms of fluid in a box, heat it up, and observe it for a few hours, the particles will bop around in really complicated ways. Simulating all those particle interactions would take a huge amount of computation, it would be highly non-trivial. And yet, water buckets are not particularly exciting or interesting. Complexity does not imply “interestingness”.
Why would an ASI be interested in the Earth’s ecosystems?
Why not both? Why leave value lying around? (Also, the asteroid belt containing Ceres and Vesta contains several orders of magnitude less matter than Earth. Maybe you meant “why not go colonize the Milky Way and other galaxies”?)
I agree with some parts of what (I think) you’re saying; but I think I disagree with a lot of it. My thoughts here are still blurry/confused, though; will need to digest this stuff further. Thanks!
I really wish there was an “agree/disagree” button for posts. I’d like to upvote this post (for epistemic virtue / presenting reasonable “contrarian” views and explaining why one holds them), but I also strongly disagree with the conclusions and suggested policies. (I ended up voting neither up nor down.)
EDIT: After reading Akash’s comments, and re-reading the post more carefully: I largely agree with Akash (and updated towards thinking that my standards for “epistemic virtue” are/were too low).
Thanks for the thoughtful answer. After thinking about this stuff for a while, I think I’ve updated a bit towards thinking (i.a.) that alignment might not be quite as difficult as I previously believed.[1]
I still have a bunch of disagreements and questions, some of which I’ve written below. I’d be curious to hear your thoughts on them, if you feel like writing them up.
I.
I think there’s something like an almost-catch-22 in alignment. A simplified caricature of this catch-22:
In order to safely create powerful AI, we need it to be aligned.
But in order to align an AI, it needs to have certain capabilities (it needs to be able to understand humans, and/or to reason about its own alignedness, or etc.).
But an AI with such capabilities would probably already be dangerous.
Looking at the paragraphs quoted below,
I now realize that I have an additional assumption that I didn’t explicitly put in the post, which is something like… alignment and capabilities may be transmitted simultaneously. [...]
[...] a successive series of more sophisticated AI systems that get gradually better at understanding human preferences and being aligned with them (the way we went from GPT-1 to GPT-3.5)
I’m pattern-matching that as proposing that the almost-catch-22 is solvable by iteratively
1.) incrementing the AI’s capabilities a little bit
2.) using those improved capabilities to improve the AI’s alignedness (to the extent possible); goto (1.)
Does that sound like a reasonable description of what you were saying? If yes, I’m guessing that you believe sharp left turns are (very) unlikely?
I’m currently quite confused/uncertain about what happens (notably: whether something like sharp left turns happen) when training various kinds of (simulator-like) AIs to very high capabilities. But I feel kinda pessimistic about it being practically possible to implement and iteratively interleave steps (1.)-(2.) so as to produce a powerful aligned AGI. If you feel optimistic about it, I’m curious as to what makes you optimistic.(?)
II.
I don’t know how exactly we get from the part where the AI is modeling the human to the part where it actively wants to fulfill the human’s preferences [...]
I think {the stuff that paragraph points at} might contain a crux or two.
One possible crux:
[...] that drive shows up in human infants and animals, so I’d guess that it wouldn’t be very complicated. [2]
Whereas I think it would probably be quite difficult/complicated to specify a good version of D := {(a drive to) fulfill human H’s preferences, in a way H would reflectively endorse}. Partly because humans are inconsistent, manipulable, and possessed of rather limited powers of comprehension; I’d expect the outcome of an AI optimizing for D to depend a lot on things like
just what kind of reflective process the AI was simulating H to be performing
in what specific ways/order did the AI fulfill H’s various preferences
what information/experiences the AI caused H to have, while going about fulfilling H’s preferences.
I suspect one thing that might be giving people the intuition that {specifying a good version of D is easy} is the fact that {humans, dogs, and other agents on which people’s intuitions about PF are based} are very weak optimizers; even if such an agent had a flawed version of D, the outcome would still probably be pretty OK. But if you subject a flawed version of D to extremely powerful optimization pressure, I’d expect the outcome to be worse, possibly catastrophic.
And then there’s the issue of {how do you go about actually “programming” a good version of D in the AI’s ontology, and load that program into the AI, as the AI gains capabilities?}; see (I.).
Maybe something like it being rewarding if the predictive model finds actions that the adult/human seems to approve of.
I think this would run into all the classic problems arising from {rewarding proxies to what we actually care about}, no? Assuming that D := {understand human H’s preferences-under-reflection (whatever that even means), and try to fulfill those preferences} is in fact somewhat complex (note that crux again!), it seems very unlikely that training an AI with a reward function like R := {positive reward whenever H signals approval} would generalize to D. There seem to be much, much simpler (a priori more likely) goals Y to which the AI could generalize from R. E.g. something like Y := {maximize number of human-like entities that are expressing intense approval}.
III.
Side note: I’m weirded out by all the references to humans, raising human children, etc. I think that kind of stuff is probably not practically relevant/useful for alignment; and that trying to align an ASI by… {making it “human-like” in some way} is likely to fail, and probably has poor hyperexistential separation to boot. I don’t know if anyone’s interested in discussing that, though, so maybe better I don’t waste more space on that here. (?)
To put some ass-numbers on it, I think I’m going from something like
85% death (or worse) from {un/mis}aligned AI
10% some humans align AI and become corrupted, Weak Dystopia ensues
5% AI is aligned by humans with something like security mindset and cosmopolitan values, things go very well
to something like
80% death (or worse) from {un/mis}aligned AI
13% some humans align AI and become corrupted, Weak Dystopia ensues
7% AI is aligned by humans with something like security mindset and cosmopolitan values, things go very well
Side note: The reasoning step {drive X shows up in animals} → {X is probably simple} seems wrong to me. Like, Evolution is stupid, yes, but it’s had a lot of time to construct prosocial animals’ genomes; those genomes (and especially the brains they produce in interaction with huge amounts of sensory data) contain quite a lot of information/complexity.
Interesting!
Those complicated things are only complicated because we don’t introspect about them hard enough, not for any intrinsic reasons.
My impression is that the human brain is in fact intrinsically quite complex![1]
I also think most people just don’t have enough self-awareness to be able to perceive their thoughts forming and get a sense of the underlying logic.
I think {most people’s introspective abilities} are irrelevant. (But FWIW, given that lots of people seem to e.g. conflate a verbal stream with thought, I agree that median human introspective abilities are probably kinda terrible.)
Consider all the psychological wisdom of Buddhism [...]
Unfortunately I’m not familiar with the wisdom of Buddhism; so that doesn’t provide me with much evidence either way :-/
An obvious way to test how complex a thing X really is, or how well one understands it, is to (attempt to) implement it as code or math. If the resulting software is not very long, and actually captures all the relevant aspects of X, then indeed X is not very complex.
Are you able to write software that implements (e.g.) kindness, prosociality, or “an entity that cares intrinsically about other entities”[2]? Or write an informal sketch of such math/code? If yes, I’d be very curious to see it! [3]
Like, even if only ~1% of the information in the human genome is about how to wire the human brain, that’d still be ~10 MB worth of info/code. And that’s just the code for how to learn from vast amounts of sensory data; an adult human brain would contain vastly more structure/information than that 10 MB. I’m not sure how to estimate how much, but given the vast amount “training data” and “training time” that goes into a human child, I wouldn’t be surprised if it were in the ballpark of hundreds of terabytes. If even 0.01% of that info is about kindness/prosociality/etc., then we’re still talking of something like 10 GB worth of information. This (and other reasoning) leads me to feel moderately sure that things like “kindness” are in fact rather complex.
...and hopefully, in addition to “caring about other entities”, also tries to do something like “and implement the other entities’ CEV”.
Please don’t publish anything infohazardous, though, obviously.
(Ah, to clarify: I wasn’t saying that Kaj’s post seems insane; I was referring to the fact that lots of thinking/discourse in general about AI seems to be dangerously insane.)
Some voice in my head started screaming something like
“A human introspecting on human phenomenology does not provide reliable evidence about artificial intelligences! Remember that humans have a huge amount of complex built-in circuitry that operates subconsciously and makes certain complicated things—like kindness—feel simple/easy!”
Thought I’d share that. Wondering if you disagree with the voice.
I’m guessing this might be due to something like the following:
(There is a common belief on LW that) Most people do not take AI x/s-risk anywhere near seriously enough; that most people who do think about AI x/s-risk are far too optimistic about how hard/easy alignment is; that most people who do concede >10% p(doom) are not actually acting with anywhere near the level of caution that their professed beliefs would imply to be sensible.
If alignment indeed is difficult, then (AI labs) acting based on optimistic assumptions is very dangerous, and could lead to astronomical loss of value (or astronomical disvalue)
Hence: Pushback against memes suggesting that alignment might be easy.
I think there might sometimes be something going on along the lines of “distort the Map in order to compensate for a currently-probably-terrible policy of acting in the Territory”.
Analogy: If, when moving through Territory, you find yourself consistently drifting further east than you intend, then the sane solution is to correct how you move in the Territory; the sane solution is not to skew your map westward to compensate for your drifting. But what if you’re stuck in a bus steered by insane east-drifting monkeys, and you don’t have access to the steering wheel?
Like, if most people are obviously failing egregiously at acting sanely in the face of x/s-risks, due to those people being insane in various ways
(“but it might be easy!”, “this alignment plan has the word ‘democracy’ in it, obviously it’s a good plan!”, “but we need to get the banana before those other monkeys get it!”, “obviously working on capabilities is a good thing, I know because I get so much money and status for doing it”, “I feel good about this plan, that means it’ll probably work”, etc.),
then one of the levers you might (subconsciously) be tempted to try pulling is people’s estimate of p(doom). If everyone were sane/rational, then obviously you should never distort your probability estimates. But… clearly everyone is not sane/rational.
If that’s what’s going on (for many people), then I’m not sure what to think of it, or what to do about it. I wish the world were sane?
IIUC, in this model, in order to align an AGI, we need it to first be able to simulate {humans and their preferences} in reasonably high detail. But if the AGI is able to simulate humans, it must already be at least “human-level” in terms of capabilities.
Thus, question: How do you reliably load a goal built out of complicated abstractions (“implement these humans’ preferences, for just the right concept of ‘preferences’”) into an AGI after having trained that AGI to “human+ level” capabilities? (Without the AGI killing you before you load the goal into it?)
the name of the game is just to figure out how to prevent this much optimization pressure being applied without imposing too high a capabilities tax
Hmm. I wonder if you’d agree that the above relies on at least the following assumptions being true:
(i) It will actually be possible to (measure and) limit the amount of “optimization pressure” that an advanced A(G)I exerts (towards a given goal).
(ii) It will be possible to end the acute risk period using an A(G)I that is limited in the above way.
If so, how likely do you think (i) is to be true? If you have any ideas (even very rough/vague ones) for how to realize (i), I’d be curious to read them.
I think realizing (i) would probably be at least nearly as hard as the whole alignment problem. Possibly harder. (I don’t see how one would in actual practice even measure “optimization pressure”.)
Whether we can build artificial empathy into AI systems also has clear relevance to AI alignment.
I disagree. My tentative guess would be that in the majority of worlds where humanity survives and flourishes, {AGI having empathy} contributed ~nothing to achieving that success. (For most likely interpretations of “empathy”.)
If we can create empathic AIs, then it may become easier to make an AI be receptive to human values, even if humans can no longer completely control it.
I suspect that {the cognitive process that produced the above sentence} is completely devoid of security mindset. If so, might be worth trying to develop security mindset? And/or recognize that one is liable to (i.a.) be wildly over-optimistic about various alignment approaches. (I notice that that sounded unkind; sorry, not meaning to be unkind.)
You pointed out that empathy is not a silver bullet. I have a vague (but poignant) intuition that says that the problem is a lot worse than that: Not only is empathy not a silver bullet, it’s a really really imprecise heuristic/proxy/shard for {what we actually care about}, and is practically guaranteed to break down when subjected to strong optimization pressure.
Also, doing a quick bit of Rationalist Taboo on “empathy”, it looks to me like that word is pointing at a rather complicated, messy swath of territory. I think that swath contains many subtly and not-so-subtly different things, most of which would not begin to be sufficient for alignment (albeit that some might be necessary).
Also maybe worth noting: In order for the AI to even be able to knowingly lie/deceive, it would have to be capable of reasoning about things like
Would the text I am about to output cause the operators to believe things which do not match reality?
Before that level of capability, it seems unclear whether there could be any {activations correlated with lying}, since the AI would not really even be capable of (intentionally) lying. And after that level of capability, the AI would be able to reason about itself, the operators, their epistemic states, etc.; i.e. the AI might have all the intelligence/understanding needed to invent the kinds of deceptive self-modifications described in the previous comment.
And so there might not be any capability regime in which {A Misaligned Model Would Have Activations Correlated With Lying}. Or that regime might be very short-lived/narrow.
A Misaligned Model Would Have Activations Correlated With Lying
Humans routinely deceive others by deceiving themselves. A sufficiently intelligent AI might be able to do something similar. E.g., it could modify itself such that
it contains two goals: G1 = {do as the operators want} and G2 = {actual, misaligned goal},
it has a subnet S that tracks something like C := “do I have a decisive strategic advantage yet?”, and that subnet is made reflectively inaccessible to other parts of the AI,
if C is false, the AI genuinely, honestly pursues G1,
when C becomes true, the AI self-modifies so as to discard G1 and pursue G2 instead.
Of course, this would require considerable ability to self-modify in rather precise ways; realistic near-future ML systems may or may not be able to do that kind of stuff. But if an AI were capable of that kind of stuff, it seems like it would break the {There Are Activations Correlated With Lying} condition?
Denote natural selection’s (NS’s) objective by X. That is, X is something like {finding and propagating patterns (genetic or otherwise) that continue to exist}.
I think it’s important to distinguish between
(i) Humanity as a whole is aligned with X.
(ii) Most individual humans are (mostly) aligned with X.
To the extent that X and (i) are coherent concepts/claims, I’d agree that (i) is likely true (for now, before TAI).[1] OTOH, (i) seems kinda vacuous, since {humanity as a whole} is (currently) basically an evolving animal population, i.e. an “instantiation” of NS? And of course NS is aligned with NS.
I think (ii) is sketchy at best: Sure, lots of people have a desire/shard to produce things that will last; things like e.g. music, art, genetic offspring, mausoleums, etc. But my impression is that for most humans, that desire is just one among many, and usually not even the strongest desire/shard. (And often the desire to produce lasting things is just a proxy/means for gaining some other thing, e.g. status.)
Thus: I think that—to the extent that it makes sense to think of NS as an optimizer with an objective—individual humans (i.e. the intelligences that NS designed) are in fact unaligned/misaligned with NS’s objectives. I continue to see {NS designing humans} as an example of an optimization process P creating new optimization processes that are misaligned with P.
I feel like I probably missed a bunch of nuance/bits-of-info in the post, though. I’m guessing OP would disagree with my above conclusion. If so, I’m curious what I missed / why they disagree.
Then again, under a sufficiently broad interpretation of X, almost any process is perhaps aligned with X; since any process eventually evolves into a heat-dead universe, which in turn is a very persistent/continues-to-exist pattern?
I think it might be relevant to note here that it’s not really humans who are building current SOTA AIs—rather, it’s some optimizer like SGD that’s doing most of the work. SGD does not have any mechanistic understanding of intelligence (nor anything else). And indeed, it takes a heck of a lot of data and compute for SGD to build those AIs. This seems to be in line with Yudkowsky’s claim that it’s hard/inefficient to build something without understanding it.
I think it’s important to distinguish between
Scaling up a neural network, and running some kind of fixed algorithm on it.
Scaling up a neural network, and using SGD to optimize the parameters of the NN, so that the NN ends up learning a whole new set of algorithms.
IIUC, in Artificial Mysterious Intelligence, Yudkowsky seemed to be saying that the former would probably fail. OTOH, I don’t know what kinds of NN algorithms were popular back in 2008, or exactly what NN algorithms Yudkowsky was referring to, so… *shrugs*.