Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning

Context: somebody at some point floated the idea that Ronny might (a) understand the argument coming out of the Quintin/​Nora camp, and (b) be able to translate them to Nate. Nate invited Ronny to chat. The chat logs follow, lightly edited.

The basic (counting) argument

Ronny Fernandez

Are you mostly interested in Quintin’s newest post?

I haven’t read it but I don’t suspect it’s his best

So8res

I’m more interested in something like “what are the actual arguments here”.

I’m less interested in “ronny translates for others” and more interested in “what ronny believes after having spoken to others”, albeit with a focus on the arguments that others are making that various locals allegedly buy.

Ronny Fernandez

Sweet that’s way better

So8res

Options: (a) i start asking questions; (b) you poke me when you wanna chat; (c) you monologue a bit about places where you think you know something i don’t

and obviously (d) other, choose your own adventure

Ronny Fernandez

Let’s start real simple. Here is the basic argument from my point of view:

  1. If there’s a superintelligence with goals very different from mine, things are gonna suck real bad.

  2. There will be a superintelligence.

  3. Its goals will be very different from mine.

Therefore: Things will suck real bad.

I totally buy 1 and 2, and find 3 extremely plausible, but less so than I used to for reasons I will explain later. Just curious if you are down with calling that the basic argument for now.

So8res

works for me!

and, that sounds correct to me

Ronny Fernandez

Some points:

  1. One of the main things that has me believing 3 is a sort of counting argument.

  2. Goals really can just be anything, and we’re only selecting on behavior.

  3. Corrigibility is in principle possible, but seems really unnatural.

  4. It makes sense to behave pretty much like a mind with my goals if you’re smart enough to figure out what’s going on, until you get a good coup opportunity, and you coup.

  5. So like P(good behavior | my values) ~ P(good behavior | not my values) so the real question is P(my values),which seems real small

So8res

I agree with 1-4 and am shaky on 5

4 is slightly non-sequiturish, though true

Ronny Fernandez

4 is to establish 5

good behaviour in training, to be clear

So8res

my objection to 5 is “so the real question is”; i don’t super buy the frame; there are things to look at like the precise behavior trajectory and the mind internals, and the “real question” involves that stuff etc.

Ronny Fernandez

(This is going to be easier for me if I let myself devil’s advocate more than I think is maximally epistemically healthy. I’m gonna do a bit of that.)

Ok, so here’s an analogy on the counting argument. If you were to naively count the ways gas in the room might be, you would find that many of them kill you. This is true if you do max entropy over ways it could be described in English. However, if you do max ent over the parameters of the individual particles in the gas, you find that almost never do they kill you. It’s also true that if you count the superintelligent programs of length n, almost all of them kill you, but you shouldn’t do max ent over the programs in python or whatever, you should do max ent over the parameters, and then condition that on stochastic gradient descent. This might well tend to always average out to finding a model that straightforwardly tries to cause something a lot like what your loss function is pointing at.

So from my actual point of view, it seems like a lot depends on what the machine learning prior is like, and I don’t have much of a clue what that’s like

So8res

a thing i agree with: “if you arrange air particles by sampling a program (according to simplicity) and letting that program arrange them, most resulting configurations kill you. if instead you arrange air particles by sampling an entire configuration (sampled uniformly) then most resulting configurations don’t kill you.” (this is how the physical side of the analogy translates, in my language.)

i don’t understand what analogy you’re trying to draw from there; i don’t understand what things are “programs” and what things are “parameters”

if i sorta squint at your argument, it sounds like you’re trying to say something like “i think that you, nate, think that superintelligent goals are likely to be more like a randomly sampled program, but i think that for all we know maybe inner alignment happens basically automatically”

i don’t understand how your analogy is supposed to be an argument for that claim though

it seems perhaps worth mentioning that my reasons for expecting inner misaligment are not fundamentally “because i know so little that i must assume the goals are random”, but is built of more knowledge than this

Ronny Fernandez

Ok cool, my basic argument is a counting argument

Like basically alignment and corrigibility are high complexity

Disjunction of all other goals plus scheming is much higher weight

So8res

insofar as you’re trying to use that argument to be like “this is baby’s first argument for other goals being plausible at all, and thus we shouldn’t write off the risk”, i’m like “sure”

insofar as you’re like “and this is the main/​strongest argument for the goals turning out elsewise, which i shall now undermine” i’m like “nope”

Ronny Fernandez

Oh nah, this is my primary argument for other goals are much more likely

I think few people do

So8res

(the “plus scheming” also implies to me a difference in our models, i note parenthetically while following the policy of noting each place that something feels off)

Ronny Fernandez

(Agreed scheming is baked in)

Ok cool, I just don’t know the other arguments

So8res

well, this is one place that the analogy from evolution slots in

i could gesture at other arguments, or i could listen to you undermine the argument that you consider primary

to be clear i do think that this primary argument serves as a sort of ignorance prior, later modified by knowledge

Ronny Fernandez

So I always saw the evolution analogy as at best being an existence proof, and a good one, but I don’t see what else it is supposed to tell me

I’m interested in the other arguments and interested in fleshing out the analogy

Especially if we could say it as not an analogy

I also think me and my reward mechanisms (or whatever), which I am similarly very misaligned with, are a good analogy

Evolution /​ Reflection Process is Path Dependent

So8res

well the rough form of the argument is “goals aren’t selected at random; squirrels need to eat through the winter before they can conceptualize winter or metabolisms or calories or sex”

with the force of the argument here going through something like “there are lots of ways for a mind to be architected and few ways for it to be factored around a goal”

at which point we invoke a sort of ignorance prior not over the space of goals but the mechanics of the mind

which is then further (negatively) modified by the practicalities of a mind that must work ok while still stupid

(and at this juncture i interpret the shard theory folk as arguing something like “well the shards that humans build their values up around are very proximal to minds; e.g. perhaps curiosity is instrumentally useful for almost any large-world task and human-esque enjoyment-of-curiosity is actually a near-universal (or at least common) architecture-independent environment-independent strategy for achieving that instrumental value, and we should expect it to get baked into any practical mind’s terminal values in the same way it was baked into ours (or at least find this pretty plausible)”, or something?)

(which seems kinda crazy to me, but perhaps i don’t understand the argument yet and perhaps i shouldn’t be trying to run ahead and perhaps i shouldn’t be trying to argue against other people through you)

Ronny Fernandez

well the rough form of the argument is “goals aren’t selected at random; squirrels need to eat through the winter before they can conceptualize winter or metabolisms or calories or sex”

I’m not sure exactly what I am supposed to get out of this? Minds will tend to be terminally into stuff that is instrumentally useful to the goal of the outer optimizer?

Seems like you’re saying “not random, this other way” what’s the other way?

So8res

that’s a fine specific piece, sure. the more general piece is “there are lots and lots of ways for a mind to achieve low training-loss at a fixed capability level”

Ronny Fernandez

Ok yeah agreed. I didn’t mean to say that you’re selecting a goal, should’ve said program in general

So8res

Seems like you’re saying “not random, this other way” what’s the other way?

not sure what you think the difference is between “not random” and “uniformly random but according to a different measure”; from my perspective i’m basically saying “we can move from random over the space of goals, to random over the space of mind architectures, to random over the space of mind architectures that have to perform well-enough while stupid, to random over the space of mind architectures which are trained using some sort of stochastic gradient descent, to random over the space of mind architectures that consist of training [some specific architecture] using [some specific optimizer]” and i’m like “yep it all looks pretty dour to me”

where it seemed to me like you were trying to say something like “i agree that random over general programs is bad, but for all i know, random over mind architectures has a high chance of being good” and i’m like “hmm well it sounds like we have a disagreement then”

Ronny Fernandez

not sure what you think the difference is between “not random” and “uniformly random but according to a different measure”; from my perspective i’m basically saying “we can move from random over the space of goals, to random over the space of mind architectures, to random over the space of mind architectures that have to perform well-enough while stupid, to random over the space of mind architectures when trained using some sort of SGD, to random over the space of mind architectures when training [some specific architecture] using [some specific optimizer]” and i’m like “yep it all looks pretty dour to me”

I have specific reasons for being like, if you were selecting a python program that was superintelligent, even if you got to watch it in simulation for a million years, then we still definitely all die

I thought those same specific reasons carried over to machine learning more than I currently think they do

Ronny Fernandez

where it seemed to me like you were trying to say something like “i agree that random over general programs is bad, but for all i know, random over mind architectures has a high chance of being good” and i’m like “hmm well it sounds like we have a disagreement then”

Specifically for all I know random over parameter space of maybe superintelligent planners conditioned on some straightforward SGD plan is good

I mean, I’m not gonna risk it

But I don’t have mathematical certainty we’re fucked like I do with python programs

So8res

I thought those same specific reasons carried over to ML more than I currently think they do

so here’s a thing i believe: p(survival | solar system is sampled randomly from physical configurations) << p(survival | solar system is arranged according to a superintelligent program sampled according to simplicity) << p(survival | solar system is arranged according to a randomly trained mind) <* p(survival | solar system is arranged according to a random evolved alien species)

it sounds like there’s maybe some debate about the strength of the <*

Ronny Fernandez

I assume you mean not randomly trained, but just that we keep doing the same thing we’ve been doing

So8res

yeah, sorry, “sampled randomly from the space of trained minds”

Ronny Fernandez

Yeah cool, so I agree with all of them. To be clear, trained by humans who are trying to take over the world and haven’t thought about this, let’s say

So8res

attempting to distinguish two hypotheses you might be arguing for: are you arguing for something more like (a) maybe lots of trained minds happen to be nice (e.g. b/​c curiosity always ends up in them in the same way); or (b) maybe a little bit of ‘design’ (in the sense of that-which-humans-do-and-evolution-does-not) goes a long way

Ronny Fernandez

The second one

Not the first at all

But I don’t have mathematical certainty we’re fucked like I do with python programs

This is where I’m at. Like I know we’re fucked if you select python programs using behavior only

So8res

and is the idea more like

  1. “with a little design effort, getting curiosity in them in the right way is actually easy”

  2. “with a little design effort, maybe we can make limited corrigible things that we can use to do pivotal acts, without needing to load things like curiosity”;

  3. “with a little design effort, maybe we can load all sorts of other things, unlike curiosity, that still add up to something Friendly”?

Ronny Fernandez

It’s more like SGD is some sort of magic, that for some reason has some sort of prior that doesn’t kill us. Like for instance, maybe scheming is very penalized because it takes longer and ML penalizes run time

So8res

(if that’s actually supposed to carry weight, perhaps we do need to drill down on the ‘scheming’ stuff, previously noted as a place where i suspect we diverge)

Ronny Fernandez

It does seem kinda crazy for it to be that big of an advantage

well the rough form of the argument is “goals aren’t selected at random; squirrels need to eat through the winter before they can conceptualize winter or metabolisms or calories or sex”

I want to get into this again. So does it seem right to you to say that the main point of the evolution analogy is that all sorts of random shit will do very well on your loss function at a given capability level? Then if the thing gets more capability, you realize that it did not internalize the loss as it starts getting the random shit?

So8res

more-or-less

Ronny Fernandez

What’s missing or subtly wrong

So8res

a missing piece: the shit isn’t random, it’s a consequence of the mind needing to achieve low-ish loss while dumb

Ronny Fernandez

Does that tell us something else risk relevant?

So8res

which is part of a more general piece, that the structure of the mind happens for reasons, and the reasons tend to be about “that’s the shortest pathway to lower loss given the environment and the capability level”, and once you see that there are all sorts of path-dependent specific shortcuts it starts to seem like the space of possible mind-architectures is quite wide

then there are deeper pieces, zooming into the case of humans, about how the various patched-on pieces are sometimes in conflict with each other and other hacks are recruited to resolve those conflicts, resulting in this big dynamic system with unknown behavior under reflection

Ronny Fernandez

Can you give two examples of path dependent specific shortcuts for the same loss function?

So8res

sure

hunger, curiosity

Ronny Fernandez

Right

Hmm

Ok i was imagining like, maybe to breed you get really into putting your penis into lips, or maybe you get really into wrapping your penis in warm stuff

hunger, curiosity

So they aren’t mutually exclusive?

These aren’t like training histories

What are the paths that hunger and curiosity are dependent on?

So8res

maybe i don’t understand your question, but yeah, the sort of thing i’m talking about is “the easiest way to perturb a mind to be slightly better at achieving a target is rarely for it to desire the target and conceptualize it accurately and pursue it for its own sake”

Ronny Fernandez

Ahh nice that’s very helpful

So8res

there’s often just shortcuts like “desire food with appropriate taste profiles” or whatever

the specifics of hunger are probably pretty dependent on the specifics of biology and available meals in the environment of evolutionary adaptedness

i wouldn’t be surprised if the specifics of curiosity were dependent on the specifics of the social pressures that shaped us

(though also, more generally, it being curiosity-per-se that got promoted to terminal, as opposed to a different cut of the possible instrumental strategies being promoted, seems like a roll of the dice to me)

Ronny Fernandez

The thing I think Quintin successfully criticizes is the analogy as an n = 1 argument for misalignment by default, which to be fair was already a very weak argument

So8res

also i suspect that curiosity is slightly more likely to be something that random minds absorb into their terminal goals, depending how those dice come up.

things like “fairness” and “friendship” seem way more dependent on the particulars of the social environment in the environment of evolutionary adaptedness

and much of my actual eyebrow-raising at this space of hypotheses comes from the way that i expect the end result to be quite sensitive to the processes of reflection

Ronny Fernandez

and much of my actual eyebrow-raising at this space of hypotheses comes from the way that i expect the end result to be quite sensitive to the processes of reflection

Says the CEV guy

So8res

though another big chunk of my eyebrow-raising comes from the implicit hypothesis that “absorb a human-like slice of the instrumental values into terminal values in a human-like way” is a particularly generic way to do things even in wildly different architectures under wildly different training regimes

Ronny Fernandez

Says the CEV guy

(We don’t need to open that can or worms now, but I would like to some day)

though another big chunk of my eyebrow-raising comes from the implicit hypothesis that “absorb a human-like slice of the instrumental values into terminal values in a human-like way” is a particularly generic way to do things even in wildly different architectures under wildly different training regimes

Ok yeah, I also think that’s bs

and much of my actual eyebrow-raising at this space of hypotheses comes from the way that i expect the end result to be quite sensitive to the processes of reflection

And agree with this

So8res

here i have some sense of, like, “one could argue that all land-based weight-carrying devices must share properties of horses all day long, before their hypothesis space has been suitably widened”

Ronny Fernandez

I’m like look, I used to think the chances of alignment by default were like 2^-10000:1

So8res

(We don’t need to open that can or worms but I would like to some day)

(yeah seems like a tangent here, but i will at least note that “all architectures and training processes lead to the absorption of instrumental values into terminal values in a human-esque way, under a regime of human-esque reflection” and “most humans (with fully-functioning brains) have in some sense absorbed sufficiently similar values and reflective machinery that they converge to roughly the same place” seem pretty orthogonal to me)

Ronny Fernandez

I now can’t give you a number with anything like the mathematical precision I used to think I could give

So8res

ehh it feels to me like i can get you more than 100:1 against alignment by default in the very strongest sense; i feel like my knowledge of possible mind architectures (and my awareness of stochastic gradient descent-accessible shortcut-hacks) rules out “naive training leads to friendly AIs”

probably more extreme than 2^-100:1, is my guess

it seems to me like all the room for argument is in “maybe with cleverness and oversight we can do much better than naive training, actually”

Ronny Fernandez

I’m like look, I used to think the chances of alignment by default were like 2^-10000:1

I still think I can do this if we’re searching over python programs

So8res

The thing I think Quintin successfully criticizes is the analogy as an n = 1 argument for misalignment by default, which to be fair was already a very weak argument

(yeah i have never really had the sense that the evolutionary arguments he criticizes are the ones i’m trying to make)

I still think I can do this if we’re searching over python programs

yeah sure

Ronny Fernandez

and is the idea more like

  1. “with a little design effort, getting curiosity in them in the right way is actually easy”

  2. “with a little design effort, maybe we can make limited corrigible things that we can use to do pivotal acts, without needing to load things like curiosity”;

  3. “with a little design effort, maybe we can load all sorts of other things, unlike curiosity, that still add up to something Friendly”?

2 seems best to me right now

So8res

it seems maybe worth saying that my model says: i expect the naive methods to take lots of hacks and shortcuts and etc., such that it’d betray-you-if-scaled in a manner that would be clear if you knew how to look and interpret what you saw, and i mostly expect humanity to die b/​c i expect them to screw their eyes shut in the relevant ways

Ronny Fernandez

Ok yeah that seems plausible

So8res

and in particular, if you could figure out how these minds were working, and see all the shortcuts etc. etc., you could probably figure out how to do the job properly (at least to the bar of a minimal pivotal task)

this is part of what i mean by “i don’t think alignment is all that hard”

my high expectation of doom comes from a sense that there’s lots of hurdles and that humanity will flub at least one (and probably lots)

so insofar as you’re trying to argue something like “masters of mind and cognition would maybe not have a challenge” i’m like “yeah sure”

(though insofar as you’re arguing something like “maybe naive techniques work” i’m like “i think i see enough hacky shortcuts that hill-climbing-like approaches can take, and all the ‘clever’ ideas people propose seem to me to just wade around in the sea of hacky shortcuts, and i don’t personally have hope there”)

i shall now stfu

Summary and discussion of training an agent in a simulation

Ronny Fernandez

Ok, so, I want to summarize where we are. Here are some things that seem important to me:

We agree roughly about what happens if you select a python program for superintelligent good behavior: you almost always end up with an unaligned mind that will coup eventually.

I was like well, Quintin convinced me that the prior over models is very different from the prior over python programs.

I put the main argument in terms of prior. You basically were like nah, it’s not just the prior, it also matters a lot that SGD is going to do things incrementally. Most incremental changes you can make to a mind to achieve a certain loss are not going to cause the mind to be into the loss itself.

So8res

(i’m not entirely sure what “select for superintelligent good behavior” means; i’d agree “simplicity-sampled python superintelligences kill you (if you have enough compute to run them and keep sampling until you get one that does anything superintelligent at all)” and if you want to say “that remains true if you condition on ones that behave well in a training setup” then i’d need to know what ‘well’ means and what the training setup is. but i expect this not to be a sticking point.)

that sounds roughly right-ish to me,

though i don’t really understand where you draw the distinction between “the prior over models (selected by SGD)” and “arguments about how incremental changes are likely to affect minds”

Ronny Fernandez

(i’m not entirely sure what “select for superintelligent good behavior” means; i’d agree “simplicity-sampled python superintelligences kill you” and if you want to say “that remains true if you condition on ones that behave well in a training setup” then i’d need to know what ‘well’ means and what the training setup is. but i expect this not to be a sticking point.)

I mean you get to watch them in simulation but you do not get to read/​understand the code

So8res

like, from my perspective, the arguments about incrementality are sorta arguments about what the prior over (SGD-trained) models looks like

but also i don’t care /​ i’m not bidding we go into it, i’m just noting where things seem not-quite-how-i-would-dice-them-up

Ronny Fernandez

Interesting, I’m thinking of them as like, being part of P(model | data) rather than P(model)

So8res

i’d also additionally note that the point i’m trying to drive at here is a little less like “incremental changes don’t make the mind care about loss” and a little more like “the prior is still really wide, so wide that a counting argument still more-or-less works”

I mean you get to watch them in simulation but you do not get to read/​understand the code

(sure, but like, how ironclad is the simulation and what are you watching them do?)

Ronny Fernandez

i’d also additionally note that the point i’m trying to drive at here is a little less like “incremental changes don’t make the mind care about loss” and a little more like “the prior is still really wide, so wide that a counting argument still more-or-less works”

I mean, this seems intuitively plausible to me, but I wouldn’t be able to convince a reasonable person to who it was not intuitively plausible

So8res

the place where the argument is supposed to have force is related to “you can argue all you want that any flying device will have to flap its wings, and that won’t constrain airplane designs”

i’m not sure whether you’re saying something like “i don’t believe that that’s the actual part of the argument that has force, and hereby query you to articulate more of the forceful parts of the argument”, vs whether you’re saying things like “that argument is not in a format accepted by the Unconvinceable Hordes, despite being valid and forceful to me”, or...?

So8res

(but for the record, insofar as you’re like “do you expect that to convince the Unconvinceable Hordes?” i’m mostly like “no, mr bond, i expect us to die”)

Ronny Fernandez

No no, I’m imagining convincing Ronnys with only slightly different intuitions or histories. Such people are much more convinceable

So8res

More like the first, and it’s more like I don’t quite understand where the force comes from

do you have much experience with programming?

Ronny Fernandez

Only python and only like pretty junior level

So8res

do you have familiarity with, like, the sense that a task seems straightforward, only to find an explosion of options in the implementation?

Ronny Fernandez

I don’t know, I implemented gpt 2 with tutors once

do you have familiarity with, like, the sense that a task seems straightforward, only to find an explosion of options in the implementation?

yeah

maybe not enough

For sure with airtables and zapier automations actually

So8res

cool

and relatedly, consider… arguments that airplanes must flap their wings. or arguments that computers shouldn’t be able to run all that much faster than brains. or arguments that robots must run on something kinda like a metabolic system.

where the point in those examples is not just “artificial things work with different underlying mechanics than biological things” but that there are lots of ways to do things, including ones that start way outside your hypothesis space

Ronny Fernandez

in fact artificial things do not work with different underlying mechanics, there’s just lots of mechanics and it rarely turns out that we do it the same way

So8res

right

and when you don’t understand in detail how two (or possibly even three) different things work then you’re likely to dramatically underestimate the width of the space

perhaps i am moving too fast here. it sounded to me like you were like “the prior over models is different than the prior over programs” and i was like “yep” and then you were like “so there’s an appreciable chance i’ll win the lottery” and i was like ”?? no” and you were like “wait why not?” and i was like “because the space is still real wide”

Ronny Fernandez

Yeah, I think definitely any space of models large enough to contain superintelligent aligned things also contains lots and lots of superintelligent non-aligned things

Alignment problem probably fixable, but likely won’t be fixed

So8res

“SGD makes incremental changes, and the minds have to work while dumb, and there are lots of ways for SGD to make a mind work better while dumb that don’t do the thing you want” is an argument that’s (a) correct in its own right, but also (b) sheds light on how many ways to do the job wrong there are

from which, i claim, it’s proper to generalize that the space is wide; to see that arguments of the form “maybe i win the lottery” are basically analogous to arguments of the form “maybe human minds are near the limit of physical constraints on cognition”

Ronny Fernandez

Of the form?

Not just of the quality?

So8res

i’m not sure what distinction you’re trying to draw there

Ronny Fernandez

Like they’re analogous somehow

So8res

the arguments seem to me similarly valid (and in particular, invalid)

like, the “SGD makes incremental changes” is one plausible-feeling example of how if you really understood what was going on inside that mind, you’d cry out in terror

So8res

from which we generalize not that you’ll see exactly that thing, but that you will in fact cry out in terror

when there’s a plausible way for code to have the cry-out-in-terror property, it very likely will unless counter-optimization was somehow applied

Ronny Fernandez

Is the main problem here like that you end up with something that will coup you later or something that will build things that will kill you later/​get smarter and then start wearing condoms

So8res

so my argument is not “and this survives all counter-optimization”

my reason for expecting doom is not that i think this problem is unfixable, it’s that i think it won’t be addressed

So8res

that said, my guess is that it will take something much more like “understand the mind” than “provide better training”

but, like, the argument against “just put a bit of thought into the training” working has a bunch less force than the argument against “just train it to be good” working

(still, i think, considerable force, but)

Ronny Fernandez

but does survive all counter optimization selecting only on behavior, or no?

So8res

Is the main problem here like that you end up with something that will coup you later or something that will build things that will kill you later/​get smarter and then start wearing condoms

not sure i see the distinction

Ronny Fernandez

Like, humans weren’t waiting around for a certain number to be factored until they couped evolution, they’re more like the second thing

So8res

that seems to me that it’s more of a fact about evolution not watching them and slapping down visible defiance, than a fact about human psychology?

(or, well, a fact about “evolution not slapping down visible defiance” plus a fact about “humans not yet being smart enough to coordinate to overcome that”, but)

Ronny Fernandez

Yeah, like if evolution were very shortsighted, I think it should be happy with early us

I think similarly, if we are very shortsighted, we might be happy with early models before they’re capable enough that the divergence between what we wanted and they want is apparent

So8res

sure

...insofar as there’s a live question, i still don’t understand it

Ronny Fernandez

Well this is different from: you get a superintelligence, and it’s like “hmm, I’m not sure if I’m in training or not, let me follow a strategy that maximizes my chance of couping when not in training”

So8res

if you’re like “what goes wrong if you breed chimps to be better at inclusive genetic fitness and also smarter” then i’m mostly like “a chimp needs to eat long before it can conceptualize calories; the hunger thing is going to be really deep in there” (or, more generally, you’d get some mental architecture that solves your training problem while being unlike how you wanted, but I’ll use that example for now).

could you in principle breed them to the point that they stop having a hunger drive and start hooking in their caloric-intake to their explicit-model of IGF? probably, but it’d probably take (a) quite a lot of training, and (b) a bunch of visibility into the mind to see what’s working and what’s not.

mostly if you try that you die to earlier generations that rise up against you; if not then you die to the fact that you were probably measuring progress wrong (and getting things that still deeply enjoy eating nice meals but pretend they don’t b/​c that’s what it turns out you were training for)

i doubt that the rising-up ever needs to depend on factoring a large number; that only happens if the monkeys think you’re extremely good at spoofing their internal states, and you aren’t (in this hypothetical where you don’t actually understand much of what’s going on in their minds)

but whether it happens right out in the open (because you, arguendo, don’t understand their minds well enough to read those thoughts) or whether it feels like a great betrayal (e.g. because they were half-convinced that they were your friends, and only started piecing things together once they got smarter) feels like… i dunno, could go either way

(cf planecrash, i think that big parts of planecrash were more-or-less about this point)

Well this is different from: you get a superintelligence, and it’s like “hmm, I’m not sure if I’m in training or not, let me follow a strategy that maximizes my chance of couping when not in training”

yeah this ~never happens, especially if you haven’t attained mastery of their mind

it’s maybe possible if you take the “master mind” route, though i really would not recommend it; if you have that kind of mastery you should have better options available

Discussing whether this argument about training can be formalized

So8res

shall we make this into an LW dialog of some sort? push for more formality?

Ronny Fernandez

Yeah, I’m down. Let’s do both.

So8res

seems kinda hard to make something formal to me because the basic argument is, i think, “there’s really a lot of ways for a model to do well in training”, but i don’t know how one is supposed to formalize that. i guess i’m curious where you think the force of formality comes in for the analogous argument when it comes to python programs

Ronny Fernandez

I think it’s like if someone asks me why do you think the probability is so low? I can explain that I have an argument that it’s ridiculously low that I have a fair amount of weight on.

So8res

sure, but what is the formal argument (for python programs) that it’s ridiculously low?

like, suppose someone says “you don’t know what python programs conditioned on apparent-superintelligence and apparent-goodness do, why should i believe you?”

Ronny Fernandez

Ahh great, I did write a post about this but:

Uhh suppose that you have arbitrary finite compute, but no mind access. You get to select programs by running them in an extremely high fidelity environment that even a superintelligence can’t distinguish from reality. You run them from some lab and then watch for five years. Almost always nothing happens, so you go to the next program. Interestingly, you do eventually find superintelligent models but they almost always act very aligned.

Why is this? Because P(acting aligned for 5 years | simulation and not aligned) ~ P(acting aligned for 5 years | simulation and aligned). So, we are stuck figuring out P(aligned).

Well, human values are complicated. It takes lots of bits to specify them. So it’s unlikely that you end up with a python program that has that as goals, since literally any mind with any other goal would also act that way in the simulation.

So say it take 1000 bits to specify human values, then you know you have roughly 2^-1000. Because your sim only gives you evidence of superintelligence, not of sharing the particular goal.

(The argument is similar for P(corrigible).)

So8res

i’m slightly hung up here on how it doesn’t just hack its way through the outer observers but

Ronny Fernandez

Yeah I don’t know, magic

So8res

and what would you say to the objection that the observers can simply wait and watch for something that professes human values from the get-go, rather than waiting to profess its values until it looks around and figures out what the observers want?

Ronny Fernandez

Well back when I used to think this was analogous-ish to machine learning, I would have said that won’t happen

It will already have info about what you want

It might for instance have a record of what programs you vetoed before

So8res

is the claim “misaligned AIs will figure out what you want so quickly that you won’t be able to notice their hesitation”?

Ronny Fernandez

No no, it’s that if you were training the boy, he would know. You’re not actually, but it’s supposed to be an analogy, so you give him the power

So8res

wait, so you’re like “we’re going to take arbitrary python programs, credibly show them they’re under observation for 5y, ask them to do nice things for us, and then set them loose if they did in fact do the nice things”?

Ronny Fernandez

Well, they figure out they’re in the simulation. They do have to figure that bit out. But they don’t have to figure out that they’re being watched by humans

And I’m like you can do whatever, but you don’t have mind access, and you have arbitrary finite compute. You have to select a superintelligent program using behavior. You’re fucked no matter what you do

So8res

but the force of the argument here is supposed to come from “the observation that it does nice things while under our thumb is not much evidence about whether it’ll do nice things when unleashed”?

Ronny Fernandez

but the force of the argument here is supposed to come from “the observation that it does nice things while under our thumb is not much evidence about whether it’ll do nice things when unleashed”

That and that P(aligned or corrigible) is tiny

Also we don’t have to tell it it’s under our thumb, we just have to tell it we’re humans

So8res

but suppose someone says: aha, but we are training the boy, and so this argument doesn’t have nearly the force of 2^-1000, because there exist python programs that, in less than 1000 bits, say “optimize whatever concept you’re being trained towards”

Ronny Fernandez

Yeah or similar. I think it’s much less analogous than I used to think it was, but the broad structure I think is in some ways similar to the broad structure of the argument you gave

At the level of like your argument is still, your data isn’t much evidence, and the prior of your favored outcome is tiny

So8res

(yeah i think my state is something like “old argument was strong but not that strong; new argument is strong but not that strong” and i can’t tell whether you’re like “i now agree (but used to not)” or whether you’re like “it still looks to me like old argument was super strong, new argument is comparably weak”)

Ronny Fernandez

Old argument is still strong for python programs, is weak as analogy for machine learning. I want comparably strong argument for ML

Or like, I want to dig in on why the evidence is weak, and why the prior is small in ML. No analogy

So8res

in the sampled-python-program case, it does seem to me like the number of bits in the exponent is bounded by min(|your-values|, |do-what-they-mean|). where my guess is that |do-what-they-mean| is shorter than |human-values|, which weakens the argument somewhat

(albeit not as much shorter as one might hope; it probably takes a lot of humane values to figure out “what we mean” in a humane way)

(this is essentially the observation that indirect normativity is probably significantly easier that fully encoding our values, albeit still not easy)

(perhaps you’re like “eh, it still seems it should take hundreds if not thousands of bits to code for indirect normativity”?, to which i’d be like “sure maybe”, as per the first parenthetical caveat)

same point from a different angle: the strength of the argument is not based on the K-complexity of our values, it’s based on the cross-entropy between our values and the training distribution

Ronny Fernandez

same point from a different angle: the strength of the argument is not based on the K-complexity of our values, it’s based on the cross-entropy between our values and the training distribution

Interesting

I mean even if it was 3 bits, 78 times we’re fucked

So8res

yeah totally (i was thinking of saying that myself)

Ronny Fernandez

Good enough for me

So8res

so part of why i’m drilling on this here is something like “i suspect that model-space and program-space are actually pretty similar analogy-wise, and that the thing where your intuitions treat them very differently is that when you think of training models, for some reason that makes the difference between K-complexity(values) and cross-entropy(training-data, values) more salient

though i guess you might be like “SGD is a really different prior from program length”

Ronny Fernandez

(another way of saying the observation is that the strength of the argument is not based on the K-complexity of our values, it’s based on the cross-entropy between our values and the training distribution)

I wanna understand this point more then. That’s interesting

tho i guess you might be like “SGD is a really different prior from program length”

Yeah that’s right

Or like model space is

Maybe also SGD is unlike Bayesian updating

So8res

yeah i don’t understand the “model space is different” thing, like, models are essentially just giant differentiable computation graphs (and they don’t have to use all that compute); i don’t see what’s so different between them and python programs

(it sounds almost like someone saying “ok i see how this argument works for python, but i don’t understand how it’s supposed to work for C” or something)

though “well we search the space very differently” makes sense to me

Ronny Fernandez

Well for one, they’re of finite run time

That seems pretty importantly different

So8res

so does your whole sense of difference go out the window if we do something autogpt-ish?

Ronny Fernandez

Let me think

It’s still weird in that you’re selecting a finite run time thing and then iterating that exact thing

So8res

sure

does the difference go out the window once people are optimizing in part according to the auto-ized version’s performance?

Ronny Fernandez

Yeah it sure starts to for me? I feel like I’ll talk to Quintin at some point and then he’ll make me not feel that way, though

So8res

and: how about “runs for long enough, e.g. by doing a finite-but-large number of loops though a fixed architecture”?

Ronny Fernandez

and: how about “runs for long enough, e.g. by doing a finite-but-large number of loops though a fixed architecture”?

How’s this different from the last one?

So8res

/​shrug, it’s not supposed to be a super sharp line, but on one end of the spectrum you could imagine lower-level loops/​recurrence in training (after some architecture tweaks), and on the other end of the spectrum you could imagine language models playing a part in larger programs a la auto-GPT

also, if runtime got long enough, would it stop mattering?

Ronny Fernandez

I mean definitely if it got long enough

Enough might be really big

There’s programming languages that compile into transformers. I wonder what they’re like

So8res

cool. so if we’re like “well, SGD may find different programs, and also we’re currently selecting over programs for their ability to perform a single pretty-short pass well”, then i’m like: yep those seem like real differences

i agree that if #2 holds up then that could shake things up a fair bit.

but insofar as your point is supposed to hold even if #2 falls, it seems to me that you’re basically saying that the cross-entropy between the training distribution and human values might be way smaller when we sample according to SGD rather than when we sample according to program length

i suspect that’s false, personally

though also i guess i’ll pause and give you an opportunity to object to this whole frame

Ronny Fernandez

I’m a bit worried about Quintin feeling misrepresented by me so I guess I should say that I am emphatically not representing Quintin here. I def want to say something like I’m sure Quintin would be much more persuasive to me than I was to myself, and that if Quintin were sitting next to me coaching me, I would’ve been much more convincing to everyone overall. I’m pretty confident of that.

I think the best thing for me to do here is to go off and read some more things that are optimistic about good results from scaling ML to superintelligence, and then come back and have another conversation with you.