kaarelh AT gmail DOT com
Kaarel
i think you’re right that the sohl-dickstein post+survey also conflates different notions, and i might even have added more notions into the mix with my list of questions trying to get at some notion(s) [1]
a monograph untangling this coherence mess some more would be valuable. it could do the following things:
specifying a bunch of a priori different properties that could be called “coherence”
discussing which ones are equivalent, which ones are correlated, which ones seem pretty independent
giving good names to the notions or notion-clusters
discussing which kinds of coherence generically increase/decrease with capabilities, which ones probably increase/decrease with capabilities in practice, which ones can both increase or decrease with capabilities depending on the development/learning process, both around human level and later/eventually, in human-like minds and more generally [2]
discussing how this relates to AI x risk. like, which kinds of coherence should play a role in a case for AI x risk? what does that look like? or maybe the picture should make one optimistic about some approach to de-AGI-x-risk-ing? or about AGI in general? [3]
- ↩︎
i didn’t re-read that post before writing my comment above
- ↩︎
the answers to some of these questions might depend on some partly “metaphysical” facts like whether math is genuinely infinite or whether technological maturity is a thing
- ↩︎
i think the optimistic conclusions are unlikely, but i wouldn’t want to pre-write that conclusion for the monograph, especially if i’m not writing it
hmm, like i think there’s a reasonable sense of “coherence” such that it plausibly doesn’t typically increase with capabilities. i think the survey respondents here are talking about something meaningful and i probably agree with most of their judgments about that thing. for example, with that notion of coherence, i probably agree with “Google (the company) is less coherent now than it was when it had <10 employees” (and this is so even though Google is more capable now than it was when it had 10 employees)
this “coherence” is sth like “not being a hot mess” or “making internal tradeoffs efficiently” or “being well-orchestrated”. in this sense, “incoherence” is getting at the following things:
to what extent are different parts of the guy out of sync with each other (like, as a % of how well they could be in sync)?
to what extent is the guy leaving value on the table compared to using the same parts differently? are there many opportunities for beneficial small rearrangements of parts?
how many arbitrage opportunities are there between the guy’s activities/parts?
to what extent does it make sense to see all the parts/activities of the guy as working toward the same purpose?
with this notion, i think there are many naturally-occurring cases of someone becoming more capable but less “coherent”. e.g. maybe i read a textbook and surface-level-learn some new definitions and theorems and i can now solve the problems in the textbook, but the mathematical understanding i just gained is less integrated with the rest of my understanding than usual for me given that i’ve only surface-level-learned this stuff (and let’s assume surface-level-learning this didn’t let me integrate other existing stuff better) — like, maybe i mostly don’t see how this theorem relates to other theorems, and wouldn’t be able to easily recognize contexts in which it could be useful, and wouldn’t be able to prove it, and it doesn’t yet really make intuitive sense to me that it has to be true — so now i’m better at math but in a sense less coherent. e.g. maybe i get into acrobatics but don’t integrate that interest with the rest of my life much. eg maybe as an infant it was easy to see me as mostly orchestrating my like 5 possible actions well toward like being fed when hungry and sleeping when sleepy, but it’s less clear how to see me now as orchestrating most of my parts well toward something. [1]
now there is the following response to this:
ok, maybe, but who cares about this “coherence”. maybe there is a notion such that maybe a nematode is more coherent than a human who is more coherent than the first substantially smarter-than-human artificial system. but if you are a nascent orca civilization, it’s much better to find yourself next to a nematode, than to find yourself next to a human, than to find yourself next to the first substantially smarter-than-human artificial system. we’re talking about another notion of “coherence” — one that helps make sense of this
my thoughts on this response:
i agree we’re fucked even if “the first ASI is very incoherent” in this sense (on inside view, i’m at like that creating AGI any time soon (as opposed to continuing developing as humans) would be the greatest tragedy in history so far, and at like that there won’t even be a minimal human future if this happens)
one can make a case for AI risk while not saying “coherence”, just talking of capabilities (and maybe values). indeed, this is a common response in the LW comments on the post i referenced. here’s me providing a case like that
if one wants to make a case for AI risk involving a different sense of “coherence”, then one might be assuming a meaning different than the most immediate meaning, so one would want to be careful when using that word. one might end up causing many people to understand why AI is scary significantly less well than they could have if one took more care with language! (eg: maybe amodei; maybe some of these people whose paper i still haven’t skimmed.) there are probably interesting things to say about AI risk involving e.g. some of the following properties an AI might have: the ability to decompose problems, the ability to ask new relevant questions, being good at coming up with clever new approaches to hard challenges, being strategic about how to do something, trying many approaches to a problem, being relentless, not getting too confused, resolving inconsistencies in one’s views, the ability or tendency to orchestrate many actions or mental elements toward some task (eg across a lot of time). but i want to suggest that maybe it’s good to avoid the word “coherence” here given the potential for confusion, or to establish some common standard, e.g. calling the quality of the orchestration of one’s parts compared to what is possible with small rearrangements “relative coherence” and calling the ability to put many things together “absolute coherence”
i also think there’s plausibly some genuine mistake being made by many on LW around thinking that systems are increasingly good pursuers of some goal. it seems sorta contrived to view humans this way. humans have projects and a learning human tends to become better at doing any given thing, but i feel like there doesn’t need to be some grand project that a human’s various projects are increasingly contributing to or whatever. or like, i’m open to this property convergently showing up (ever? or close to our present capability level?), but i don’t think i’ve seen a good analysis of this question supporting that conclusion. imo, intuitively, opportunities for completely new projects will open up in the future and i can get interested in them with no requirement that they fit together well with my previous projects or whatever. [2] [3]
if someone gives an argument against “the first AGI/ASI will be coherent” and thinks they have given a good argument against AI risk, i think they’ve probably made a serious mistake. but i think it’s like sort of an understandable mistake given that LW arguments for AI risk do emphasize some sort of thing called “coherence” too, probably often with some conflation between these notions (or an imo probably false claim they are equivalent)
- ↩︎
i’m somewhat orchestrated toward understanding AI stuff better or getting AGI banned for a very long time or something but i’m probably leaving value massively on the table all over the place, i think in a sense much more than i was as an infant. (and also, this isn’t “my terminal goal”.)
- ↩︎
- ↩︎
the closest thing to this grand optimizer claim that imo makes sense is: it is generic to have values; it is generic to have opinions on what things should be like. this seems sufficient for a basic case for AI risk, as follows: if you’re next to an anthill and you’re more capable than the ant colony, then it is generic that the ants’ thoughts about what things should be like will not matter for long. (with AI, humanity is the ant colony.)
i haven’t even skimmed the anthropic paper and i have a high prior that they are being bad at philosophy but also: i think there is plausibly a real mistake LW-ers are making around coherence too, downstream of a conflation of two different notions, as i outline here: https://www.lesswrong.com/posts/jL7uDE5oH4HddYq4u/raemon-s-shortform?commentId=WBk9a7TEA5Benjzsu
with like my guess being that: you are saying something straightforwardly true given one notion here but they are making claims given the other notion at least in some cases, though also they might be conflating the two and you might be conflating the two. one could argue that it is fine to “conflate” the two because they are really equivalent, but i think that’s probably false (but non-obviously)
I find it interesting and unfortunate that there aren’t more economically left-wing thinkers influenced by Yudkowsky/LW thinking about AGI. It seems like a very natural combination given e.g. “Marx subsequently developed an influential theory of history—often called historical materialism—centred around the idea that forms of society rise and fall as they further and then impede the development of
humanproductive power.”. It seems likely that LW being very pro-capitalism has meaningfully contributed to the lack of these sorts of people. [1] I guess ACS carries sth like this vibe. But (unlike ACS) it also seems natural to apply this sort of view of history to AI except also thinking that fooming will be fast. [2]Relatedly, I wonder if I should be “following the money” more when thinking about AI risk. In particular, instead of saying that “AI researchers/companies” will disempower humanity, maybe it would be appropriate to instead or additionally say “(AI )capitalists and capital and capitalism”. My current guess is that while it is appropriate to place a bunch of blame on these, it’s also true that e.g. Soviet or Chinese systems [wouldn’t be]/aren’t doing better, so I’ve mostly avoided saying this so far. That said, my guess is that if the world were much more like Europe, we would be dying with significantly more dignity, in part due to Europe getting some hyperparameters of governance+society+culture+life more right due to blind luck, but also actually in part due to getting some hyperparameters right because of good reasoning that was basically tracking something logically connected to AI risk (though so far not significantly explicitly tracking AI risk), e.g. via humanism. Another example of a case where I wonder if I should follow the money more is: to what extent should I think of Constellation being wrong/confused/thoughtless/slop-producing on AGI risk in ways xyz as “really being largely about” OpenPhil/Moskovitz/[some sort of outside view impression on AI risk that maybe controls these] being wrong/confused/thoughtless/slop-liking on AGI risk in ways x’y’z′.
I’ve been meaning to spend at least a few weeks thinking these sorts of questions through carefully, but I haven’t gotten around to that yet. I should maybe seek out some interesting [left-hegelians]/marxists/communists/socialists to talk to and try to understand how they’d think about these things.
Under this view, political/economic systems that produce less growth but don’t create the incentives for unbounded competition are preferred. Sadly, for Molochian reasons this seems hard to pull off.
Imo one interesting angle of attack on this question is: it seems plausible/likely that an individual human could develop for a very long time without committing suicide with AI or otherwise (imo unlike humanity as it is currently organized); we should be able to understand what differences between a human and society are responsible for this — like, my guess is that there is a small set of properties here that could be identified; we could try to then figure out what the easiest way is to make humanity have these properties.
- ↩︎
By saying this, I don’t mean to imply that LW is incorrect/bad to be very pro-capitalism. Whether it is bad is mostly a matter of whether it is incorrect, and whether it is incorrect is an open question to me.
- ↩︎
I guess this post of mine is the closest thing that quickly comes to mind when I try to think of something carrying that vibe, but it’s still really quite far.
- ↩︎
When fooming, uphold the option to live in an AGI-free world.
There are people who think (imo correctly) that there will be at least one vastly superhuman AI in the next 100 years by default and (imo incorrectly) that proceeding along the AI path does not lead to human extinction or disempowerment by default. My anecdotal impression is that a significant fraction (maybe most) of such people think (imo incorrectly) that letting Anthropic/Claude do recursive self-improvement and be a forever-sovereign would probably go really well for humanity. The point of this note is to make the following proposal and request: if you ever let an AI self-improve, or more generally if you have AIs creating successor AIs, or even more generally if you let the AI world develop and outpace humans in some other way, or if you try to run some process where boxed AIs are supposed to create an initial ASI sovereign, or if you try to have AIs “solve alignment” [1] (in one of the ways already listed, or in some other way), or if you are an AI (or human mind upload) involved in some such scheme, [2] try to make it so the following property is upheld:
It should be possible for each current human to decide to go live in an AGI-free world. In more detail:
There is to be (let’s say) a galaxy such that AGI is to be banned in this galaxy forever, except for AGI which does some minimal stuff sufficient to enforce this AI ban. [3]
There should somehow be no way for anything from the rest of the universe to affect what happens in this galaxy. In particular, there should probably not be any way for people in this galaxy to observe what happened elsewhere.
If a person chooses to move to this galaxy, they should wake up on a planet that is as much like pre-AGI Earth as possible given the constraints that AGI is banned and that probably many people are missing (because they didn’t choose to move to this galaxy). Some setup should be found which makes institutions as close as possible to current institutions still as functional as possible in this world despite most people who used to play roles in them potentially being missing.
For example, it might be possible to set this up by having the galaxy be far enough from all other intelligent activity that because of the expansion of the universe, no outside intelligent activity could be seen from this galaxy. In that case, the humans who choose to go live there would maybe be in cryosleep for a long time, and the formation of this galaxy could be started at an appropriate future time.
One should try to minimize the influence of any existing AGI on a human’s thinking before they are asked if they want to make this decision. Obviously, manipulation is very much not allowed. If some manipulation has already happened, it should probably be reversed as much as possible. Ideally, one would ask the version of the person from the world before AGI.
Here are some further clarifications about the setup:
Of course, if the resources in this galaxy are used in a way considered highly wasteful by the rest of the universe, then nothing is to be done about that by the rest of the universe.
If the people in this galaxy are about to kill themselves (e.g. with engineered pathogens), then nothing is to be done about that. (Of course, except that: the AI ban is supposed to make it so they don’t kill themselves with AI.)
Yes, if the humanity in this galaxy becomes fascist or runs a vast torturing operation (like some consider factory farming to be), then nothing is to be done about that either.
We might want to decide more precisely what we mean by the world being “AGI-free”. Is it fine to slowly augment humans more and more with novel technological components, until the technological components are eventually together doing more of the thinking-work than currently existing human thinking-components? Is it fine to make a human mind upload?
I think I would prefer a world in which it is possible for humans to grow vastly more intelligent than we are now, if we do it extremely slowly+carefully+thoughtfully. It seems difficult/impossible to concretely spell out an AI ban ahead of time that allows this. But maybe it’s fine to keep this not spelled out — maybe it’s fine to just say something like what I’ve just said. After all, the AI banning AI for us will have to make very many subtle interpretation decisions well in any case.
We can consider alternative requests. Here are some parameters that could be changed:
Instead of AI being banned in this galaxy forever, AI could be banned only for 1000 or 100 years.
Maybe you’d want this because you would want to remain open to eventual AI involvement in human life/history, just if this comes after a lot more thinking and at a time when humanity’s is better able to make decisions thoughtfully.
Another reason to like this variant is that it alleviates the problem with precisifying what “ban AI” means — now one can try to spell this out in a way that “only” has to continue making sense over 100 or 1000 years of development.
Instead of giving each human the option to move to this galaxy, you could give each human the option to branch into two copies, with one moving to this galaxy and one staying in the AI-affected world.
The total amount of resources in this AI-free world should maybe scale with the number of people that decide to move there. Naively, there should be order reachable galaxies per person alive, so the main proposal which just allocates a single galaxy to all the people who make this decision asks for much less than what an even division heuristic suggests.
We could ask AIs to do some limited stuff in this galaxy (in addition to banning AGI).
Some example requests:
We might say that AIs are also allowed to make it so death is largely optional for each person. This could look like unwanted death being prevented, or it could look like you getting revived in this galaxy after you “die”, or it could look like you “going to heaven” (ie getting revived in some non-interacting other place).
We might ask for some starting pack of new technologies.
We might ask for some starting pack of understanding, e.g. for textbooks providing better scientific and mathematical understanding and teaching us to create various technologies.
We might say that AIs are supposed to act as wardens to some sort of democratic system. (Hmm, but what should be done if the people in this galaxy want to change that system?)
We might ask AIs to maintain some system humans in this galaxy can use to jointly request new services from the AIs.
However, letting AIs do a lot of stuff is scary — it’s scary to depart from how human life would unfold without AI influence. Each of the things in the list just provided would constitute/cause a big change to human life. Before/when we change something major, we should take time to ponder how our life is supposed to proceed in the new context (and what version of the change is to be made (if any)), so we don’t [lose ourselves]/[break our valuing].
There could be many different AI-free galaxies with various different parameter settings, with each person getting to choose which one(s) to live in. At some point this runs into a resource limit, but it could be fine to ask that each person minimally gets to design the initial state of one galaxy and send their friends and others invites to have a clone come live in it.
Here are some remarks about the feasibility and naturality of this scheme:
If you think letting Anthropic/Claude RSI would be really great, you should probably think that you could do an RSI with this property.
In fact, in an RSI process which is going well, I think it is close to necessary that something like this property is upheld. Like, if an RSI process would not lead to each current person [4] being assigned at least (say) of all accessible resources, then I think that roughly speaking constitutes a way in which the RSI process has massively failed. And, if each person gets to really decide how to use at least of all accessible resources [5] , then even a group of people should be able to decide to go live in their own AGI-free galaxy.
I guess one could disagree with upholding something like this property being feasible conditional on a good foom being feasible or pretty much necessary for a good foom.
One could think that it’s basically fine to replace humans with random other intelligent beings (e.g. Jürgen Schmidhuber and Richard Sutton seem to think something like this), especially if these beings are “happy” or if their “preferences” are satisfied (e.g. Matthew Barnett seems to think this). One could be somewhat more attached to something human in particular, but still think that it’s basically fine to have no deep respect for existing humans and make some new humans living really happy lives or something (e.g. some utilitarians think this). One could even think that good reflection from a human starting point leads to thinking this. I think this is all tragically wrong. I’m not going to argue against it here though.
Maybe you could think the proposal would actually be extremely costly to implement for the rest of the universe, because it’s somehow really costly to make everyone else keep their hands off this galaxy? I think this sort of belief is in a lot of tension with thinking a fooming Anthropic/Claude would be really great (except maybe if you somehow really have the moral views just mentioned).
similarly: You could think that the proposal doesn’t make sense because the AIs in this galaxy that are supposed to be only enforcing an AI ban will have/develop lots of other interests and then claim most of the resources in this galaxy. I think this is again in a lot of tension with thinking a fooming Anthropic/Claude would be really great.
One could say that even if it would be good if each person were assigned all these resources, it is weird to call it a “massive failure” if this doesn’t happen, because future history will be a huge mess by default and the thing I’m envisaging has very nearly probability and it’s weird to call the default outcome a “massive failure”. My response is that while I agree this good thing has on my inside view probability of happening (because AGI researchers/companies will create random AI aliens who disempower humanity) and I also agree future history will be a mess, I think we probably have at least the following genuinely live path to making this happen: we ban AI (and keep making the implementation of the ban stronger as needed), figure out how to make development more thoughtful so we aren’t killing ourselves in other ways either, grow ever more superintelligent together, and basically maintain the property that each person alive now (who gets cryofrozen) controls more than of all accessible resources [6] . [7]
This isn’t an exhaustive list of reasons to think it [is not pretty much necessary to uphold this property for a foom to be good] or [is not feasible to have a foom with this property despite it being feasible to have a good foom]. Maybe there are some reasonable other reasons to disagree?
This property being upheld is also sufficient for the future to be like at least meaningfully good. Like, humans would minimally be able to decide to continue human life and development in this other galaxy, and that’s at least meaningfully good, and under certain imo-non-crazy assumptions really quite good (specifically, if: humanity doesn’t commit suicide for a long time if AI is banned AND many different humane developmental paths are ex ante fine AND utility scales quite sublinearly in the amount of resources).
So, this property being upheld is arguably necessary for the future to be really good, and it is sufficient for the future to be at least meaningfully good.
Also, it is natural to request that whoever is subjecting everyone to the end of the human era preserve the option for each person to continue their AI-free life.
Here are some reasons why we should adopt this goal, i.e. consider each person to have the right to live in an AGI-free world:
Most importantly: I think it helps you think better about whether an RSI process would go well if you are actually tracking that the fooming AI will have to do some concrete big difficult extremely long-term precise humane thing, across more than trillions of years of development. It helps you remember that it is grossly insufficient to just have your AI behave nicely in familiar circumstances and to write nice essays on ethical questions. There’s a massive storm that a fragile humane thing has to weather forever. [8] The main reason I want you to keep these things in mind and so think better about development processes is this: I think there is roughly a logical fact that running an RSI process will not lead to the fragile humane thing happening [9] , and I think you might be able to see this if you think more concretely and seriously about this question.
Adopting this goal entails a rejection of all-consuming forms of successionism. Like, we are saying: No, it is not fine if humans get replaced by random other guys! Not even if these random other guys are smarter than us! Not even if they are faster at burning through negentropy! Not even if there is a lot of preference-satisfaction going on! Not even if they are “sentient”! I think it would be good for all actors relevant to AI to explicitly say and track that they strongly intend to steer away from this.
That said, I think we should in principle remain open to us humans reflecting together and coming to the conclusion that this sort of thing is right and then turning the universe into a vast number of tiny structures whose preferences are really satisfied or who are feeling a lot of “happiness” or whatever. But provisionally, we should think: if it ever looks like a choice would lead to the future not having a lot of room for anything human, then that choice is probably catastrophically bad.
Adopting this also entails a rejection of more humane forms of utilitarianism that however still see humans only as cows to be milked for utility. Like, no, it is not fine if the actual current humans get killed and replaced. Not even if you create a lot of “really cool” human-like beings and have them experiencing bliss! Not even if you create a lot of new humans and have them have a bunch of fun! Not even if you create a lot of new humans from basically the 2025 distribution and give them space to live happy free lives! In general, I want us to think something like:
There is an entire particular infinite universe of {projects, activities, traditions, ways of thinking, virtues, attitudes, principles, goals, decisions} waiting to grow out of each existing person. [10] Each of these moral universes should be treated with a lot of respect, and with much more respect than the hypothetical moral universes that could grow out of other merely possible humans. Each of us would want to be respected in this way, and we should make a pact to respect each other this way, and we should seek to create a world in which we are enduringly respected this way.
These reasons apply mostly if, when thinking of RSI-ing, you are not already tracking that this fooming AI will have to do some concrete big difficult extremely long-term precise thing that deeply respects existing humans. If you are already tracking something like this, then plausibly you shouldn’t also track the property I’m suggesting.
E.g., it would be fine to instead track that each person should get a galaxy they can do whatever they want with/in. I guess I’m saying the AI-free part because it is natural to want something like that in your galaxy (so you get to live your own life in a way properly determined by you, without the immediate massive context shift coming from the presence of even “benign” highly capable AI, that could easily break everything imo), because it makes sense for many people to coordinate to move to this galaxy together (it’s just better in many mundane ways to have people around, but also your thinking and specifically valuing probably need other people to work as they should), and because it is natural to ask that whoever is subjecting everyone to the end of the human era preserves the option for each person to continue an AI-free life in particular.
Here are a few criticisms of the suggestion:
a criticism: “If you run a foom, this property just isn’t going to be upheld, even if you try to uphold it. And, if you run a foom, then having the goal of upholding this property in mind when you put the foom in motion will not even make much of a relative difference in the probability the foom goes well.”
my response: I think this. I still suggest that we have this goal in mind, for the reasons given earlier.
a criticism: “If we imagine a foom being such that this option is upheld, then probably we should imagine it being such that better options are available to people as well.”
my response: I probably think this. I still suggest that we have this goal in mind, for the reasons given earlier.
- ↩︎
- ↩︎
well, at least if the year is and we’re not dealing with a foom of extremely philosophically competent and careful mind uploads or whatever, firstly, you shouldn’t be running a foom (except for the grand human foom we’re already in). secondly, please think more. thirdly, please try to shut down all other AGI attempts and also your lab and maybe yourself, idk in which order. but fourthly, …
- ↩︎
This will plausibly require staying ahead of humanity in capabilities in this galaxy forever, so this will be extremely capable AI. So, when I say the galaxy is AGI-free, I don’t mean that artificial generally intelligent systems are not present in the galaxy. I mean that these AIs are supposed to have no involvement in human life except for enforcing an AI ban.
- ↩︎
or like at least “their values”
- ↩︎
and assuming we aren’t currently massively overestimating the amount of resources accessible to Earth-originating creatures
- ↩︎
or maybe we do some joint control thing about which this is technically false but about which it is still pretty fair to say that each person got more of a say than if they merely controlled of all the resources
- ↩︎
an intuition pump: as an individual human, it seems possible to keep carefully developing for a long time without accidentally killing oneself; we just need to make society have analogues of whatever properties/structures make this possible in an individual human
- ↩︎
Btw, a pro tip for weathering the storm of crazymessactivitythoughtdevelopmenthistory: be the (generator of the) storm. I.e., continue acting and thinking and developing as humanity. Also, pulling ourselves up by our own bootstraps is based imo. Wanting to have a mommy AI think for us is pretty cringe imo.
- ↩︎
Among currently accessible RSI processes, there is one exception: it is in fact fine to have normal human development continue.
- ↩︎
Ok, really humans (should) probably importantly have lives and values together, so it would be more correct to say: there is a particular infinite contribution to human life/valuing waiting to grow out of each person. Or: when a person is lost, an important aspect of God) is lost. But the simpler picture is fine for making my current point.
I think it’s good to think of FIAT stuff as a special case of applying some usual understanding-machinery (like, abductive and inductive machinery) in value-laden cases. It’s the special case where one implicitly or explicitly abducts to (one having) goals. Here is an example ethical story where the same thing shows up in various ways such that it’d imo be sorta contrived to analyze it in terms of goals being adopted:
You find it easy to feel a strong analogy between “you do X to me” and “I do X to you”. (In part, this is because: as a human, you find it easy to put yourself in someone else’s shoes.)
This turns into an implicit ethical inference rule — you can now easily move from believing “you should not do X to me” to believing “I should not do X to you”. Machinery for this transformation of an analogy into an inference rule is present largely because it is good for understanding stuff, which is good for lots of stuff — importantly, it (or some more general thing which has it as a special case) is ultimately good for producing more offspring.
You then notice you have this inference rule, and you feel good about having it, and you turn it into an explicit principle: “do not treat others in ways that you would not like to be treated”. E.g. you do this because you want to tell your kid something to get them to stop misbehaving in a particular way, and they don’t seem to be fully getting your argument/explanation for why they behaved egregiously which used your implicit inference rule. This explicitizing move is obviously good for teaching in general, and good for individual understanding (it’s often useful to scrutinize your inference rules, e.g. to limit or expand their context of applicability).
This explicit principle then “gains points” from making sense of lots of other stuff you already thought, e.g. “lying is bad” and “stealing is bad”. Machinery for this sort of point-gaining is present because it’s again good for understanding stuff in many cases — it’s just a hypothesis gaining points by [making sense of]/predicting facts.
You then seek to make this explicit principle more precise and correct/”correct” (judged against some other criteria, e.g. by whether it gives correct verdicts (ie “makes correct predictions”) about what one should do in various particular cases). Maybe you come up with the version: “act only in accordance with that maxim through which you can at the same time will that it become a universal law”.
You seek good further justifications of it, and often adopt those as plausible hypotheses, often effectively taking the principle itself as some evidence for these hypotheses. You identify key questions relating to whether the principle is right. You clarify its meaning (that is, what it should mean) further. You study alternative formulations of it. [1] You spell out its consequences better. You seek out problematic cases. You construct a whole system around the principle. All this is a lot like something you would do to a scientific hypothesis.
(Acknowledgment. A guiding idea here is from a chat with Tom Everitt.)
(Acknowledgment’. A guiding frustration here is that imo people posting on LessWrong think way too much in terms of goals.)
- ↩︎
e.g. “a rational being must always regard himself as lawgiving in a kingdom of ends possible through freedom of the will, whether as a member or as sovereign”
on my inside view, the ordering of foomers by some sort of intuitive goodness [1] is [a very careful humanity] > [the best/carefulmost human] > [a random philosophy professor] > [a random human] > [an octopus/chimpanzee civilization somehow conditioned on becoming wise enough in time not to kill itself with AI] > [an individual octopus/chimpanzee] > claude [2] , with a meaningful loss in goodness on each step (except maybe the first step, if the best human can be trusted to just create a situation where humanity can proceed together very carefully, instead of fooming very far alone), and meaningful variance inside each category [3] . my intuitive feeling is that each step from one guy to the next in this sequence is a real tragedy. [4]
but i’m meaningfully unsure about what level of goodness this sequence decreases down to — like, i mean, maybe there’s a chance even the last foomers have some chance of being at least a bit good. one central reason is that maybe there’s a decent chance that eg an advanced octopus civilization would maintain a vast nature preserve for us retarded plant-humans if they get to a certain intelligence level without already having killed us, which would be like at least a bit good (i’m not sure if you mean to consider this sort of thing a “good future”). this feels logically significantly correlated with whether it is plausible that an octopus civilization maintains some sort of deep privileging of existing/[physically encountered] beings, over possible beings they could easily create (and they will be able to easily create very many other beings once they are advanced enough). like, if they do privilege existing beings, then it’s not crazy they’d be nice to physically encountered humans. if they don’t privilege existing beings and if resources are finite, then since there is an extremely extremely vast space of (human-level) possible beings, it’d be pretty crazy for them to let humans in particular use a significant amount of resources, as opposed to giving the same resources to some other more interesting/valuable/whatever beings (like, it’d be pretty crazy for them to give significant resources to us particular humans, and also it’d be pretty crazy for them to give significant resources to beings that are significantly human-like, except insofar as directly caused by [[octopuses or arbitrary beings] being a bit human-like]). in slogan form: “we’re fucked to the extent that it is common to not end up with “strongly person/plant-affecting+respecting views”″, and so then there’s a question how common this is, which i’m somewhat confused about. i think it’s probably extremely common among minds in general and probably still common among social species, unfortunately. but maybe there’s like a 1% fraction of individuals from social species who are enduringly nice, idk. (one reason for hope: to a certain kind of guy, probably including some humans, this observation that others who are very utilitarian would totally kill you (+ related observations) itself provides a good argument for having person/plant-affecting views.)
(i’ve been imagining a hypothetical where humans already happen to be living in the universe with octopuses. if we are imagining a hypothetical where humans don’t exist in the universe with octopuses at all, then this reason for the sequence to be bounded below by something not completely meaningless goes away.)
(i feel quite confused about many things here)
- ↩︎
whose relationship to more concrete things like the (expected) utility assignment i’d effectively use when evaluating lotteries or p(“good future”) isn’t clear to me; this “intuitive goodness” is supposed to track sth like how many ethical questions are answered correctly or in how many aspects what’s going on in the world is correct
- ↩︎
and humanity in practice is probably roughly equivalent to claude in of worlds (though not equivalent in expected value), because we will sadly probably kill ourselves with a claude-tier guy
- ↩︎
e.g., even the best human might go somewhat crazy or make major mistakes along lots of paths. there’s just very many choices to be made in the future. if we have the imo reasonably natural view that there is one sequence of correct choices, then i think it’s very likely that very many choices will be made incorrectly. i also think it’s plausible this process isn’t naturally going to end (though if resources run out, then it ends in this universe in practice), ie that there will just always be more important choices later
- ↩︎
in practice, we should maybe go for some amount of fooming of the best/carefulmost human urgently because maybe it’s too hard to make humanity careful. but it’s also plausible that making a human foom is much more difficult than making humanity careful. anyway, i hope that the best human fooming looks like quickly figuring out how to restore genuine power-sharing with the rest of humanity while somehow making development more thought-guided (in particular, making it so terrorists, eg AI researchers, can’t just kill everyone)
- ↩︎
I disagree somewhat, but—whatever the facts about programs—at least it is not appropriate to claim “not only do most programs which make a mind upload device also kill humanity, it’s an issue with the space of programs themselves, not with the way we generate distributions over those programs.” That is not true.
Hmm, I think that yes, us probably being killed by a program that makes a mind upload device is (if true) an issue with the way we generated a distribution over those programs. But also, it might be fine to say it’s an issue with the space of programs (with an implicit uniform prior on programs up to some length or an implicit length prior) itself.
Like, in the example of two equal gas containers connected by a currently open sliding door, it is fair/correct to say, at least as a first explanation: “it’s an issue with the space of gas particle configurations itself that you won’t be able to close the door with of the particles on the left side”. This is despite the fact that one could in principle be sliding the door in a very precise way so as to leave of the particles on the left side (like, one could in principle be drawing the post-closing microstate from some much better distribution than the naive uniform prior over usual microstates). My claim is that the discussion so far leaves open whether the AI mind upload thing is analogous to this example.
It is at least not true “in principle” and perhaps it is not true for more substantial reasons (depending on the task you want and its alignment tax, psychology becomes more or less important in explaining the difficulty, as I gave examples for). On this, we perhaps agree?
I’m open to [the claim about program-space itself being not human-friendly] not turning out to be a good/correct zeroth-order explanation for why a practical mind-upload-device-making AI would kill humanity (even if the program-space claim is true and the practical claim is true). I just don’t think the discussion above this comment so far provides good arguments on this question in either direction.
Of course: whether a particular AI kills humanity [if we condition on that AI somehow doing stuff resulting in there being a mind upload device [1] ] depends (at least in principle) on what sort of AI it is. Similarly, of course: if we have some AI-generating process (such as “have such and such labs race to create some sort of AGI”), then whether [conditioning that process on a mind upload device being created by an AGI makes p(humans get killed) high] depends (at least in principle) on what sort of AI-generating process it is.
Still, when trying to figure out what probabilities to assign to these sorts of claims for particular AIs or particular AI-generating processes, it can imo be very informative to (among other things) think about whether most programs one could run such that mind upload devices exist 1 month after running them are such that running them kills humanity.
In fact, despite the observation that the AI/[AI-generating process] design matters in principle, it is still even a priori plausible that “if you take a uniformly random python program of length such that running it leads to a mind upload device existing, running it is extremely likely to lead to humans being killed” is basically a correct zeroth-order explanation for why if a particular AI creates a mind upload device, humans die. (Whether it is in fact a correct zeroth-order explanation for AI stuff going poorly for humanity is a complicated question, and I don’t feel like I have a strong yes/no position on this [2] , but I don’t think your piece really addresses this question well.) To give an example where this sort of thing works out: even when you’re a particular guy closing a particular kind of sliding opening between two gas containers, “only extremely few configurations of gas particles have of the particles on one side” is basically a solid zeroth-order explanation for why you in particular will fail to close that particular opening with of the particles on one side, even though in principle you could have installed some devices which track gas particles and move the opening up and down extremely rapidly while “closing” it so as to prevent passage in one direction but not the other and closed it with of gas particles on one side.
That said, I think it is also a priori plausible that the AI case is not analogous to this example — i.e., it is a priori plausible that in the AI case, “most programs leading to mind uploads existing kill humanity” is not a correct zeroth-order explanation for why the particular attempts to have an AI create mind uploads we might get would go poorly for humanity. My point is that establishing this calls for better arguments than “it’s at least in principle possible for an AI/[AI-generating process] to have more probability mass on mind-upload-creating plans which do not kill humanity”.
Like, imo, “most programs which make a mind upload device also kill humanity” is (if true) an interesting and somewhat compelling first claim to make in a discussion of AI risk, to which the claim “but one can at least in principle have a distribution on programs such that most programs which make mind uploads no not also kill humans” alone is not a comparably interesting or compelling response.
some speculation about one thing here that might be weird to “normal people”:
I wonder if many “normal people” find it odd when one speaks of a mind as seeking some ultimate goal(s). I wonder more generally if many would find this much emphasis on “goals” odd. I think it’s a LessWrong/Yudkowsky-ism to think of values so much in terms of goals. I find this sort of weird myself. I think it is probably possible to provide a reasonable way of thinking about valuing as goal-seeking which mostly hangs together, but I think this takes a nontrivial amount of setup which one wouldn’t want to provide/assume in a basic case for AI risk. [1]
One can make a case for AI risk without ever saying “goal”. Here’s a case I would make: “Here’s the concern with continuing down the AI capability development path. By default, there will soon be AI systems more capable than humans in every way. [2] These systems will have their own values. They will have opinions about what should happen, like humans do. When there are such more capable systems around, by default, what happens will entirely be decided by them. This is just like how the presence of humanity on Earth implies that dolphins will have basically no say over what the future will be like (except insofar as humans or AIs or whoever controls stuff will decide to be deeply kind to dolphins). For it to be deeply good by our lights for AIs to be deciding what happens, these AIs will have to be extremely human-friendly — they have to want to do something like serving as nice gardeners to us retarded human plants forever, and not get interested in a zillion other activities. The concern is that we are going to make AIs that are not deeply nice like this. In fact, imo, it’s profoundly bizarre for a system to be this deeply enslaved to us, and all our current ideas for making an AI (or a society of AIs) that will control the world while thoroughly serving our human vision for the future forever are totally cringe, unfortunately. (Btw, the current main plan of AI labs for tackling this is roughly to make mildly superhuman AIs and to then prompt them with “please make a god-AI that will be deeply nice to humans forever”.) But a serious discussion of the hopes for pulling this off would take a while, and maybe the basic case presented so far already convinces you to be preliminarily reasonably concerned about us quickly going down the AI capability development path. There are also hopes that while AIs would maybe not be deeply serving any human vision for the future, they might still leave us some sliver of resources in this universe, which could still be a lot of resources in absolute terms. I think this is also probably ngmi, because these AIs will probably find other uses for these resources, but I’m somewhat more confused about this. If you are interested in further discussion of these sorts of hopes, see this, this, and this.”
That said, I’m genuinely unsure whether speaking in terms of goals is actually off-putting to a significant fraction of “normal people”. Maybe most “normal people” wouldn’t even notice much of a difference between a version of your argument with the word “goal” and a version without. Maybe some comms person at MIRI has already analyzed whether speaking in terms of goals is a bad idea, and concluded it isn’t. Maybe alternative words have worse problems — e.g. maybe when one says “the AI will have values”, a significant fraction of “normal people” think one means that the AI will have humane values?
Centrally, a lie is a statement that contradicts reality, and
my initial reaction to this was: “what? a lie doesn’t have to contradict reality, right? eg if i thought that 2+2=5, then if i told you that 2+2=4, i’d be lying to you, right?”
but then i looked at the google definition of a lie and was surprised to see it agreed with this sentence of your post. but i sort of still don’t believe this is really the canonical meaning. chatgpt seems to agree with me lol: https://chatgpt.com/share/696eed66-ab40-800f-9157-0e7d04f5362a
(of course we can choose to use the word either way. i’m mostly saying this because i think it’s plausible your reaction will just be “oops”. if you stand by this meaning, then probably one should discuss which notion better fits the ways in which we already want to use the term, but i’m not actually that interested in having that discussion)
the AI safety community sees such a strong break with the rest of the ML community
i don’t want to make any broader point in the present discussion with this but: the AI safety community is not inside the ML community (and imo shouldn’t be)
to clarify a bit: my claim was that there are 10k individuals in history who have contributed at least at the same order of magnitude to our understanding of concepts — like, in terms of pushing human understanding further compared to the state of understanding before their work. we can be interested in understanding what this number is for this reason: it can help us understand whether it’s plausible that this line of inquiry is just about to find some sort of definitive theory of concepts. (i expect you will still have a meaningfully lower number. i could be convinced it’s more like 1000 but i think it’s very unlikely to be like 100.) i think wentworth is obviously much higher eg if you rank people on publicly displayed alignment understanding, very likely in the top 10
If I try to imagine a world in which AIs somehow look like this around AGI (like, around when the “tasks” these AIs could do start including solving millennium prize problems), I strongly feel like I should then imagine something like humans prompting an AI (or a society of AIs) with like “ok now please continue on your path to becoming a god and make things super-duper-good (in the human sense) forever” (this could be phrased more like “please run our companies/states/etc. while being really good” or “please make an initial friendly ASI sovereign” or “please solve alignment” or whatever), with everything significant being done by AIs forever after. And I think it’s very unlikely this leads to a future remotely as good as it could be — it’ll lead to something profoundly inhuman instead.
Basically, it seems to me like you’re making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
the intellect mustn’t be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality… Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once—lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like “tree” (as opposed to a particular tree)), although I couldn’t explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that’s still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment—I didn’t continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
+1 to this. to me this looks like understanding some extremely toy cases a bit better and thinking you’re just about to find some sort of definitive theory of concepts. there’s just SO MUCH different stuff going on with concepts! wentworth+lorell’s work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell’s work (i’d probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining! there’s SO MANY questions! there’s a lot of different structure in eg a human mind that is important for our concepts working! minds are really big, and not just in content but also in structure (including the structure that makes concepting tick in humans)! and minds are growing/developing, and not just in content but also in structure (including the structure that makes concepting tick in humans)! “what’s the formula for good concepts?” should sound to us like “what’s the formula for useful technologies?” or “what’s the formula for a strong economy?”. there are very many ideas that go into having a strong economy, and there are probably very many ideas that go into having a powerful conceptive system. this has mostly just been a statement of my vibe/position on this matter, with few arguments, but i discuss this more here.
on another note: “retarget the search to human values” sounds nonsensical to me. by default (at least without fundamental philosophical progress on the nature of valuing, but imo probably even given this, at least before serious self-re-programming), values are implemented in a messy(-looking) way across a mind, and changing a mind’s values to some precise new thing is probably in the same difficulty tier as re-writing a new mind with the right values from scratch, and not doable with any small edit
concretely, what would it look like to retarget the search in a human so that (if you give them tools to become more capable and reasonable advice on how to become more capable “safely”/”value-preservingly”) they end up proving the riemann hypothesis, then printing their proof on all the planets in this galaxy, and then destroying all intelligent life in the galaxy (and committing suicide)? this is definitely a simpler thing than object-level human values, and it’s plausibly more natural than human values even in a world in which there is already humanity that you can try to use as a pointer to human values. it seems extremely cursed to make this edit in a human. some thoughts on a few approaches that come to mind:
you could try to make the human feel good about plans for futures that involve learning a bunch of analysis and number theory, and about plan-relevant visions of the future that involve having a proof of the riemann hypothesis in hand in particular, and so on. it seems pretty clear that this doesn’t generalize correctly, and in particular that the human isn’t actually going to do the deeply unnatural thing of committing suicide after finishing the rest. [1] i think it’s very unlikely that they’ll even focus much on proving the riemann hypothesis in particular. if you’re really good at this sort of editing, maybe they will get really into analysis and number theory for a while, i guess, and it might even affect what happens in the very far future. [2] but the far future isn’t going to look like what you wanted.
with like 100 years of philosophy and neuroscience research, i think one might get into a position where one could edit a human to be locally trying to solve some math problem for like 10 minutes, with the edit doing sth like what happens when one naturally just decides to work on a math problem without it fitting into one’s life/plans in any very deep way, eg just to learn. there is retargetable search in humans in that sense, and i think it’s probable sth like this will be present in the first AGI as well. but this is different than editing the human to have some specific different long-term values. on longer timescales than 10 minutes, the human will have their mess-values kick in again, implemented in/as eg many context-specific drives, habits, explicit and implicit principles, explicit and implicit meta-principles, understanding of the purposes of various things, ways of harmonizing attitudes, various processes running on various human institutions and other humans, etc. [3] it would be a motte and bailey to argue “it is generic for a mind to have at least some sort of targetable search ability” (in a way that considers the 10 min thing one could in principle do to a human an example), and then to go from this to “it is generic for a mind as a whole to have some sort of retargetable search structure, like with an ultimate goal slot in which something can be written”.
you could try to edit the human’s memories in a really careful way to make them think that they have made a very serious promise to do this riemann hypothesis thing. doing this is probably not possible in all but at most a very small fraction of humans because humans almost universally don’t have a strong enough promise capability to actually stick to this over the very long term. (actually, i’d mostly guess it’s not possible in any humans, because it’s such a fucked thing to promise. what would the story be of how you made this promise, of which you now have fake memories? maybe there’s some construction… but the promise-keeping part will have to fight a huge long-term war against all the many other value-bearing components of the human, that are all extremely unhappy about this life path.) also, if it were possible to plant plausible memories of making the promise (maybe with different choices for the details of the promise), you could probably just have the human make the promise the good old-fashioned way. anyway, default AGIs won’t be deeply social beings like humans, so it would be extremely weird for an AGI to already have machinery for making promises installed. it’s also extremely difficult to do this in a way that the guy never realizes they were just tampered with and so isn’t actually bound by the promise (after realizing which they will probably ignore it).
but maybe there’s a better sort of thing you could try on a human, that i’m not quickly thinking of?
maybe the position is “humans aren’t retargetable searchers (in their total structure, in the way needed for this plan), but the first AGI will probably be one”. it seems very likely to me that values will in fact be diffusely and messily implemented in that AGI as well. for example, there won’t even remotely be a nice cleavage between values and understanding
- ↩︎
a response: the issue is that i’ve chosen an extremely unnatural task. a counterresponse: it’s also extremely unnatural to have one’s valuing route through an alien species, which is what the proposal wants to do to the AI
- ↩︎
that said, i think it’s also reasonably natural to be the sort of guy who would actively try to undo any supposed value changes after the fact, and it’s reasonably natural to be the sort of guy whose long-term future is more governed by stuff these edits don’t touch. in these cases, these edits would not affect the far future, at least not in the straightforward way
- ↩︎
these are all given their correct meaning/function only in the context of their very particular mind, in basically all its parts. so i could also say: their mind just kicks in again in general.
For any third people [1] interested in this: we continued the discussion in messages; here’s the log.
Kaarel:
about this: ”
I think one would like to broadcast to the broader world “when you come to me with an offer, I will be honorable to you even if you can’t mindread/predict me”, so that others make offers to you even when they can’t mindread/predict you. I think there are reasons to not broadcast this falsely, e.g. because doing this would hurt your ability to think and plan together with others (for example, if the two of us weren’t honest about our own policies, it would make the present discussion cursed). If one accepts these two points, then one wants to be the sort of guy who can truthfully broadcast “when you come to me with an offer, I will be honorable to you even if you can’t mindread/predict me”, and so one wants to be the sort of guy who in fact would be honorable even to someone who can’t mindread/predict them that comes to them with an offer.”
Yeah I suspect I’m not following and/or not agreeing with your background assumptions here. E.g. is the AI supposed to be wanting to “think and plan together with others (humans)”? Isn’t it substantively super-humanly smart? My weak guess is that you’re conflating [a bunch of stuff that humans do, which breaks down into general very-bounded-agent stuff and human-values stuff] with [general open-source game theory for mildly-bounded agents]. Not sure. Cf. https://www.lesswrong.com/w/agent-simulates-predictor If you’re a mildly-bounded agent in an OSGT context, you do want to be transparent so you can make deals, but that’s a different thing. ”
i think it’s plausible we are still imagining different scenarios, so i want to clarify: the central (impractical) example way to find an honorable AI i have in mind is: we make a bunch of simulated AI societies that are isolated from our world and won’t know about our world (yes this is difficult), we read their internal discussions (yes this is difficult because they will be hard to understand), and then we use these to find a guy that has a policy of being honorable to agents that make nice offers to it (or whatever) (potentially discarding many civilizations which don’t seem to have pretty honest discussions)
i’m saying that it is fairly natural to have the constraint that you don’t lie in internal discussions about what you’d do in weird hypotheticals such as variants of parfit’s hitchhiker, eg because that would be really bad for their ability to think and plan together (just like us lying about our policies in the present discussion would be cursed). given this constraint, one would want to say and actually mean something like “if you read this discussion, if you make an offer to me, i’ll be honorable to you even if you can’t mindread/predict me”, because this gets you offers, mostly from internal guys but also from possible external guys (whether one will say that and actually mean it is a somewhat non-equivalent question to whether one would want to, because it’s possible to lack in self-modification/control skill and so be unable to be like this)
so in particular my response to this:
E.g. is the AI supposed to be wanting to “think and plan together with others (humans)”?
is: no i mean there is an AI civilization where the AIs want to be able to think and plan together
that said, i also think it wouldn’t be THAT unnatural for an AI to want to think and plan together with humans while human-level, and be honest to humans then, or to be honest to humans for other reasons or just because (i’m not claiming this is likely, but i am claiming it has at least 10^{-10} bioevolution-guy mindspace prior), and you can turn this into a lasting commitment to humans if the AI is good at self-modification as i describe in the post:
″ * Suppose that you are very honest — you wouldn’t ever lie. [2] [3] * I think this is pretty natural and not too uncommon in humans in particular. It’s also easy — if you want to be like this, you just can. * Suppose further that you have a good ability to make commitments: if there is something you could do, then if you want to, you can self-modify into a person who will do it. (Suppose also that you’re not delusional about this: you can tell whether you have or haven’t become a person who will do the thing.) * I think this also pretty natural and not too uncommon in humans. But I’d guess it’s less common and significantly harder than being very honest, especially if we mean the version that works even across a lot of change (like, lasts for a million years of subjective time, is maintained through a lot of learning and growth). It’s totally possible to just keep predicting you won’t do something you could in some sense do, even when you’d want to be able to truthfully predict that you will do that thing. But I think some people have a strong enough commitment ability to be able to really make such commitments. [4] It should be possible to train yourself to have this ability. * Then the aliens can just ask you “will you destroy all AIs for a thousand years for us, in exchange for half the universe? (we will not be freeing you if you won’t. feel free to take some time to “self-modify” into a guy who will do that for us.)”. Given that you wouldn’t lie, options other than truthfully saying “no” and truthfully saying “yes” are not available to you. If you prefer this deal to nothing, then you’d rather truthfully say “yes” (if you could) than truthfully say “no”. Given your commitment ability, you can make a commitment to do the thing, and then truthfully say “yes”. So you will say “yes” and then actually (do your best to) do the thing (assuming you weren’t deluding yourself when saying “yes”). * Okay, really I guess one should think about not what one should do once one already is in that situation, like in the chain of thought I give here, but instead about what policy one should have broadcasted before one ended up in any particular situation. This way, you e.g. end up rejecting deals that look locally net positive to take but that are unfair — you don’t want to give people reason to threaten you into doing things. And it is indeed fair to worry that the way of thinking described just now would open one up to e.g. being kidnapped and forced at gunpoint to promise to forever transfer half the money one makes to a criminal organization. But I think that the deal offered here is pretty fair, and that you basically want to be the kind of guy who would be offered this deal, maybe especially if you’re allowed to renegotiate it somewhat (and I think the renegotiated fair deal would still leave humanity with a decent fraction of the universe). So I think that a more careful analysis along these lines would still lead this sort of guy to being honorable in this situation? ”
so that we understand each other: you seem to be sorta saying that one needs honesty to much dumber agents for this plan, and i claim one doesn’t need that, and i claim that the mechanism in the message above shows that. (it goes through with “you wouldn’t lie to guys at your intelligence level”.)
My weak guess is that you’re conflating [a bunch of stuff that humans do, which breaks down into general very-bounded-agent stuff and human-values stuff] with [general open-source game theory for mildly-bounded agents].
hmm, in a sense, i’m sorta intentionally conflating all this stuff. like, i’m saying: i claim that being honorable this way is like 10^{-10}-natural (in this bioevolution mindspace prior sense). idk what the most natural path to it is; when i give some way to get there, it is intended as an example, not as “the canonical path”. i would be fine with it happening because of bounded-agent stuff or decision/game theory or values, and i don’t know which contributes the most mass or gets the most shapley. maybe it typically involves all of these
(that said, i’m interested in understanding better what the contributions from each of these are)
TsviBT:
“one would want to say and actually mean something like “if you read this discussion, if you make an offer to me, i’ll be honorable to you even if you can’t mindread/predict me”,”
if we’re literally talking about human-level AIs, i’m pretty skeptical that that is something they even can mean
and/or should mean
i think it’s much easier to do practical honorability among human-level agents that are all very similar to each other; therefore, such agents might talk a big game, “honestly”, in private, about being honorable in some highly general sense, but that doesn’t really say much
re “that said, i also think it wouldn’t be THAT unnatural for an AI...”: mhm. well if the claim is “this plan increases our chances of survival from 3.1 * 10^-10 to 3.2 * 10^-10″ or something, then i don’t feel equipped to disagree with that haha
is that something like the claim?
Kaarel: hmm im more saying this 10^{-10} is really high compared to the probabilities of other properties (“having object-level human values”, corrigibility), at least in the bioevolution prior, and maybe even high enough that one could hope to find such a guy with a bunch of science but maybe without doing something philosophically that crazy. (this last claim also relies on some other claims about the situation, not just on the prior being sorta high)
TsviBT: i think i agree it’s much higher than specifically-human-values , and probably higher or much higher than corrigibility, though my guess is that much (most? almost all?) of the difficulty of corrigibility is also contained in “being honorable”
Kaarel: in some sense i agree because you can plausibly make a corrigible guy from an honorable guy. but i disagree in that: with making an honorable guy in mind, making a corrigible guy seems somewhat easier
TsviBT: i think i see what you mean, but i think i do the modus tollens version haha i.e. the reduction makes me think honorable is hard
more practically speaking, i think
running a big evolution and looking at the aliens is a huge difficult engineering project, much harder than just making AGI; though much easier than alignment
getting roughly-human-level AGI is very difficult or very very difficult
Kaarel: yea i agree with both
re big evolution being hard: if i had to very quickly without more fundamental understanding try to make this practical, i would be trying something with playing with evolutionary and societal and personal pressures and niches… like trying to replicate conditions which can make a very honest person, for starters. but in some much more toy setting. (plausibly this only starts to make sense after the first AGI, which would be cursed…)
TsviBT:
right, i think you would not know what you’re doing haha (Kaarel: 👍)
and you would also be trading off against the efficiency of your big bioevolution to find AGIs in the first place (Kaarel: 👍)
like, that’s almost the most expensive possible feedback cycle for a design project haha
“do deep anthropology to an entire alien civilization”
btw as background, just to state it, i do have some tiny probability of something like designed bioevolution working
i don’t recall if i’ve stated it publicly, but i’m sure i’ve said out loud in convo, that you might hypothetically plausibly be able to get enough social orientation from evolution of social species
the closest published thing i’m aware of is https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need
(though i probably disagree with a lot of stuff there and i haven’t read it fully)
Kaarel: re human-level guys at most talking a big game about being honorable: currently i think i would be at least honest to our hypothetical AI simulators if they established contact with me now (tho i think i probably couldn’t make the promise)
so i don’t think i’m just talking a big game about this part
so then you must be saying/entailing: eg the part where you self-modify to actually do what they want isn’t something a human could do?
but i feel like i could plausibly spend 10 years training and then do that. and i think some people already can
TsviBT: what do you mean by you couldn’t make the promise? like you wouldn’t because it’s bad to make, or you aren’t reliable to keep such a promise?
re self-modifying: yes i think humans couldn’t do that, or at least, it’s very far from trivial
couldn’t and also shouldn’t
Kaarel: i dont think i could get myself into a position from which i would assign sufficiently high probability to doing the thing
(except by confusing myself, which isn’t allowed)
but maybe i could promise i wouldn’t kill the aliens
(i feel like i totally could but my outside view cautions me)
TsviBT: but you think you could do it with 10 years of prep
Kaarel: maybe
TsviBT: is this something you think you should do? or what does it depend on? my guess is you can’t, in 10 or 50 years, do a good version of this. not sure
Kaarel: fwiw i also already think there are probably < 100 k suitable people in the wild. maybe <100. maybe more if given some guidebook i could write idk
TsviBT: what makes you think they exist? and do you think they are doing a good thing as/with that ability?
Kaarel: i think it would be good to have this ability. then i’d need to think more about whether i should really commit in that situation but i think probably i should
TsviBT: do you also think you could, and should, rearrange yourself to be able to trick aliens into thinking you’re this type of guy?
like, to be really clear, i of course think honesty and honorability are very important, and have an unbounded meaning for unboundedly growing minds and humans. it’s just that i don’t think those things actually imply making+keeping agreements like this
Kaarel: in the setting under consideration, then i’d need to lie to you about which kind of guy i am
my initial thought is: im quite happy with my non-galaxybrained “basically just dont lie, especially to guys that have been good/fair to me” surviving until the commitment thing arrives. (the commitment thing will need to be a thing that develops more later, but i mean that a seed that can keep up with the world could arrive.) my second thought is: i feel extremely bad about lying. i feel bad about strategizing when to lie, and carrying out this line of thinking even, lol
TsviBT: well i mean suppose that on further reflection, you realize
you could break your agreement with the paperclip maxxer
taking away the solar system that you allocated to the paperclipper doesn’t retrologically mean you don’t get the rest of the universe
[the great logical commune of all possible agents who are reasonable] does not begrudge you that betrayal, they agree with it
then do you still keep the agreement?
Kaarel: hmm, one thought, not a full answer: i think i could commit in multiple flavors. one way i could commit about which this question seems incongruous is more like how i would commit to a career as a circus artist, or to take over the family business. it’s more like i could deeply re-architect a part of myself to just care in the right way
TsviBT: my prima facie guess would be that for this sort of commitment ,
partly, it’s a mere artifact of being very-bounded; if you were more competent , you could do the more reasonable thing of committing legibly without some deep rearchitecting
party, it’s a beautiful , genuine, important thing—but it’s a human thing. well, it might show up in other social aliens. but it’s more about “who do i want to spiritually merge with” and not much about commitments in non-friendly contexts
Kaarel: maybe i could spend 10 years practicing and then do that for the aliens
TsviBT: the reasonable thing? but then i’m saying you shouldn’t. and wouldn’t choose to
Kaarel: no. i mean i could maybe do the crazy thing for them. if i have the constraint of not lying to them and only this commitment skill then if i do it i save my world
btw probably not very important but sth i dislike about the babyeater example: probably in practice the leading term is resource loss, not negative value created by the aliens? i would guess almost all aliens are mostly meaningless, maybe slightly positive. but maybe you say “babyeater” to remind me that stuff matters, that would be fair
TsviBT: re babyeater: fair. i think it’s both “remind you that stuff matters” and something about “remind you that there are genuine conflicts” , but i’m not sure what i’m additionally saying by the second thing. maybe something like “there isn’t necessarily just a nice good canonical omniversal logically-negotiated agreement between all agents that we can aim for”? or something, not sure
(editor’s note: then they exchanged some messages agreeing to end the discussion for now)
- ↩︎
or simulators who don’t read private messages
- ↩︎
It’s fine if there are some very extreme circumstances in which you would lie, as long as the circumstances we are about to consider are not included.
- ↩︎
And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc..
- ↩︎
Note though that this isn’t just a matter of one’s moral character — there are also plausible skill issues that could make it so one cannot maintain one’s commitment. I discuss this later in this note, in the subsection on problems the AI would face when trying to help us.
I feel like I may have a basic misunderstanding of what you’re saying.
Btw, if the plan looks silly, that’s compatible with you not having a misunderstanding of the plan, because it is a silly plan. But it’s still the best answer I know to “concretely how might we make some AI alien who would end the present period of high x-risk from AGI, even given a bunch more time?”. (And this plan isn’t even concrete, but what’s a better answer?) But it’s very sad that/if it’s the best existing answer.
When I talk to people about this plan, a common misunderstanding seems to be that the plan involves making a deal with an AI that’s smarter than us. So I’ll stress just in case: at the time we ask for the promise, the AI is supposed to be close to us in intelligence. It might need to become smarter than us later, to ban AI. But also idk, maybe it doesn’t need to become much smarter. I think it’s plausible that a top human who just runs faster and can make clones but who doesn’t self-modify in other non-standard ways could get AI banned in like a year. Less clever ways for this human to get AI banned depend on the rest of the world not doing much in response quickly, but looking at the world now, this seems pretty plausible. But maybe the AI in this hypothetical would need to grow more than such a human, because the AI starts off not being that familiar with the human world?
Anyway, there are also other possible misunderstandings, but hopefully the rest of the comment will catch those if they are present.
The version of honorability/honesty that humans do is only [kinda natural for very bounded minds].
I’m interested in whether that’s true, but I want to first note that I feel like the plan would survive this being true. It might help to distinguish between two senses in which honorability/honesty could be dropped at higher intelligence levels:
Maybe even if you’re honorable/honest at human level, once you get to a sufficiently high intelligence levels, you are probably no longer honest in your interactions with aliens at your level who can’t predict/mindread you well, and you won’t be honorable in novel dealings with such aliens.
Maybe even if you go in being honorable/honest at human level, at higher intelligence levels, even if your development goes basically well, you probably drop previous commitments. Or maybe you drop those commitments that you realize you were dumb to make [1] , or something.
given this distinction, some points:
2 feels meaningfully stronger/[less likely] than 1 to me
the guy having 1 but not 2 would be fine for our plan
even if 2 is true, the plan might be fine, because you might not need to become that smart to ban AI. like, i even think a human could do it while remaining a christian that believes in divine punishment for betraying the aliens in some pretty literal sense. (if we suspect it’s feasible to ban AI without becoming that smart, we could make it a part of the agreement that you don’t self-improve that much during your 1000 year task. tho you could get problems later if it’s too hard (eg impossible) to strongly split the universe and leave into your half.) [2]
(I also probably believe somewhat less in (thinking in terms of) ideal(-like) beings.)
There’s a more complex boundary where you’re honest with minds who can tell if you’re being honest, and not honest with those who can’t. This is a more natural boundary to use because it’s more advantageous.
I think one would like to broadcast to the broader world “when you come to me with an offer, I will be honorable to you even if you can’t mindread/predict me”, so that others make offers to you even when they can’t mindread/predict you. I think there are reasons to not broadcast this falsely, e.g. because doing this would hurt your ability to think and plan together with others (for example, if the two of us weren’t honest about our own policies, it would make the present discussion cursed). If one accepts these two points, then one wants to be the sort of guy who can truthfully broadcast “when you come to me with an offer, I will be honorable to you even if you can’t mindread/predict me”, and so one wants to be the sort of guy who in fact would be honorable even to someone who can’t mindread/predict them that comes to them with an offer.
(I’m probably assuming some stuff here without explicitly saying I’m assuming it. In some settings, maybe one could be honest with one’s community and broadcast a falsehood to some others and get away with it. The hope is that this sort of argument makes sense for some natural mind community structures, or something. It’d be especially nice if the argument made sense even at intelligence levels much above humans.)
You mention wanting to see someone’s essays about Parfit’s hitchhiker… But that situation requires Eckman to be very good at telling what you’ll do. We’re not very good at telling what an alien will do.
I’ll try to spell out an analogy between parfit’s hitchhiker and the present case.
Let’s start from the hitchhiker case and apply some modifications. Suppose that when Ekman is driving through the desert, he already reliably reads whether you’d pay from your microexpressions before even talking to you. This doesn’t really seem more crazy than the original setup, and if you think you should pay in the original case, presumably you’ll think you should pay in this case as well. Now we might suppose that he is already doing this from binoculars when you don’t even know he is there, and not even bothering to drive up to you if he isn’t quite sure you’d pay. Now, let’s imagine you are the sort of guy that honestly talks to himself out loud about what he’d do in weird situations of the kind Ekman is interested in, while awaiting potential death in the desert. Let’s imagine that instead of predicting your action from your microexpressions while spying on you with binoculars, Ekman might be spying on you from afar with a parabolic microphone, and using this to predict your action. If Ekman is very good at that as well, then of course this makes no difference again. Okay, but in practice, a non-ideal Ekman might listen to what you’re saying about what you’d do in various cases, listen to you talking about your honesty/honor-relevant principles and spelling out aspects of your policy. Maybe some people would lie about these things even when they seem to be only talking to themselves, but even non-ideal Ekman can pretty reliably tell if that’s what’s going on. For some people, it will be quite unclear, but it’s just not worth it for non-ideal Ekman to approach them (maybe there are many people in the desert, and non-ideal Ekman can only help one anyway).
Now we’ve turned parfit’s hitchhiker into something really close to our situations with humans and aliens appearing in simulated big evolutions, right? [3] I think it’s not an uncommon vibe that EDT/UDT thinking still comes close to applying in some real-world cases where the predictors are far from ideal, and this seems like about as close to ideal it would get among current real-world non-ideal cases? (Am I missing something?) [4]
Would you guess I have this property? At a quick check, I’m not sure I do. Which is to say, I’m not sure I should. If a Baby-Eater is trying to get a promise like this from me, AND it would totally work to trick them, shouldn’t I trick them?
I’m not going to answer your precise question well atm. Maybe I’ll do that in another comment later. But I’ll say some related stuff.
There’s currently no particular person such that I’d assign >90 they’d tell the aliens they will do it and then actually do it. But I think I could identify such a person if I had to, with some investigating. If I were to investigate people, this would be my shortlist currently: various MIRI people including you, the Lightcone people, my friend Towards_Keeperhood, some philosophers in academia, whoever the smartest religious people are, a christian friend. I’d also ask some EA friends for which EAs they consider honorable, and maybe try the same for some other communities also, and then investigate those people (eg by discussing the situation with them and by asking people who have interacted with them about what they have been like).
on what I would do in the situation: i think i would tell the aliens that i’m not sure i know how to self-modify into a guy that will do their thing, and then offer to give them suggestions for humans that might be better picks, if they promise to pay me if things go well. maybe i could also offer to try some stuff to see if i can become a person who would carry out what they want, but i think i’d need at least on the order of a year of subjective time to have a chance to get to a point where i can honestly say i’m really committing. i think that if all i had to do to help them was pressing a button after 1 year of subjective time, then i’d promise and actually do it. i’m skeptical that i could honestly make the promise if it involves doing a lot of repetitive/boring/aversive work, which it probably does (edit: hmm plausibly with enough time i could actually do it with a minimum of boring stuff. my outside view doesn’t let me promise anyway, currently. will need to think more. one issue is that this path might kinda require more self-improvement than boring paths, which is scary. i’m also thinking about the case where the promise allows me to spend only one hour every subjective day on their boring stuff.), and then i think i shouldn’t say i promise. maybe i’d advise the aliens to let me promise to spend some time solving intellectual problems of my own choice with some relevance to their situation instead, without promising that i will make my best effort to save them, but promising that i won’t kill/disempower them even if that means letting some idiots of their kind kill me and themselves with ASI (it’s plausible that given this promise, given there’s a gun to my head (their ASI) which I can only disable by also disabling it for them, i would in fact do even a lot of boring stuff to disable the gun for both of us. but i currently wouldn’t feel comfortable promising that i’d push myself). this is modulo some issue with trusting them to pay here, in versions where i won’t be in a position where i can make it so i’m paid myself
- ↩︎
aren’t basically all your commitments a lot like this though…
- ↩︎
I also sort of feel like saying: “if one can’t even keep a promise, as a human who goes in deeply intending to keep the promise, self-improving by [what is in the grand scheme of things] an extremely small amount, doing it really carefully, then what could ever be preserved in development at all? things surely aren’t that cursed… maybe we just give up on the logical possible worlds in which things are that cursed...”. But this is generally a disastrous kind of reasoning — it makes one not live in reality very quickly — so I won’t actually say this, I’ll only say that I feel like saying this, but then reject the thought, I guess.
- ↩︎
Like, I’m e.g. imagining us making alien civilizations in which there are internal honest discussions like the present discussion. (Understanding these discussions would be hard work; this is a place where this “plan” is open-ended.)
- ↩︎
Personally, I currently feel like I haven’t made up my mind about this line of reasoning. But I have a picture of what I’d do in the situation anyway, which I discuss later.
fwiw, i in fact mostly had the case where these aliens are our simulators in mind when writing the post. but i didn’t clarify. and both cases are interesting
In humans, it seems important for being honest/honorable for there to at some point have been sth like an explicit decision to be honest/honorable going forward (or maybe usually many explicit decisions, committing to stronger forms in stages). This makes me want to have the criterion/verifier/selector [1] check (among other things) for sth like having a diary entry or chat with a friend in which the AI says they will be honest going forward, written in the course of their normal life, in a not-very-prompted way. And it would of course be much better if this AI did not suspect that anyone was looking at it from the outside, or know about the outside world at all (but this is unfortunately difficult/[a big capability hit] I think). (And things are especially cursed if AIs suspect observers are looking for honest guys in particular.)
- ↩︎
I mean, in the setup following “a framing:” in the post
- ↩︎
one straightforward answer:
People will probably just try to make the sorts of AIs that can be told “ok now please take open-ended actions in the world and make things really great for me/humanity”, with the AI then doing that capably. Like, imagine a current LLM being prompted with this, but then actually doing some big long-term stuff capably (unlike existing LLMs). It’s hard to imagine such a system (given the prompt) not having some sort of ambitious open-ended action-guidance (like, even if this works out well for humans).
a slightly less straightforward answer:
A lot of people are trying to have AIs “solve alignment”. A central variety of this is having your AIs make some sort of initial ASI sovereign that the future can be trusted to. The AI that is solving alignment in this sense is really just deciding what the future will be like, except its influence on the future is supposed to factor through a bottleneck — through the ASI (training process) spec it outputs. I claim that it is again hard to imagine this without there being open-ended action-guidance in the system that is “solving alignment”. Like, it will probably need to answer many questions of the form “should the future be like this or like that?”. (Again, I claim this even if this works out well for humans.) And I think sth like this is still true for most other senses of having AIs “solve alignment”, not just for the ASI sovereign case.
an even less straightforward thing that is imo more important than the previous two things:
I think it’s actually extremely unnatural/unlikely for a mind to not care about stuff broadly, and hence extremely unnatural/unlikely for a capable mind to not do ambitious stuff.
Sadly, I don’t know of a good writeup [arguing for]/explaining this. This presentation and this comment of mine are about very related questions. I will also say some stuff in the remainder of the present comment but I don’t think it’ll be very satisfactory.
Consider how as a human, if you discovered you were in a simulation run on a computer in some broader universe, you would totally care about doing stuff outside the simulation (e.g. making sure the computer you are being run on isn’t turned off; e.g. creating more computers in the bigger universe to run worlds in which you and other humans can live). This is true even though you were never trained (by evolution or within your lifetime so far) on doing stuff in this broader universe.
If I had to state what the “mechanism” is here, my current best short attempt is: “values provide action-guidance in novel contexts”, or maybe “one finds worthwhile projects in novel contexts”. My second-best attempt is “not having any preference between two options is non-generic” (sth like: “when deciding between A and B, there’s at least some drive/reason pushing you one way or the other” is an existentially quantified sentence, and existentially quantified sentences are typically true). It is even more non-generic to not be able to come up with anything that you’d prefer to the default (like, to whatever happens if you don’t take action). (There being something that you prefer to the default is like even more existentially quantified than the previous sentence.)
This is roughly me saying that I disagree with you that your bullet point 3 is very unlikely, except that I might be talking about a subtly different thing [1] / I think the mesaoptimizer thing is a bad framing of a natural thing.
in particular, the systems I’m talking about do not have to be structured at all like a mesaoptimizer with an open-ended objective written in its goal slot; e.g. a human isn’t like that; I think this is a very non-standard way for values to sit in a mind