DMs open.
Cleo Nardo
see also the literature on the problem of evil :P
My favourite theodicy is pre-incarnate consent: before we are born, we consent to our existence on both heaven and earth, where the afterlife was offered to us as compensation for any harms suffered on earth.[1]
How this features in your plan:
Create some guys, who may or may not be honourable, selecting for property X (explained below).
Explain to the guys our general plan, i.e. we will try to find if they are honourable and if we think they are we will offer them the deal where they stop x-risk and compensate us with some faction of the lightcone.
Explain the harms they are likely to suffer during this process.
Explain that we (or the post-foom honourable AIs) will try to compensate them for those harms. And what that compensation is likely to be (we won’t be sure at the stage).
If they consent to this gamble, then we (re-)create them and do the general plan.
Unfortunately, some guys might be upset that we pre-created them for this initial deal, so property X is the property of not being upset by this.
- ^
The Pre-Existence Theodicy (Amos Wollen, Feb 21, 2025)
Potentially, we will be creating and destroying many minds and civilizations that matter (like, maybe minimally the ones that didn’t have honorable beings).
I’m hopeful we could also select for honourable guys that are happy about their existence and being simulated like this.
For instance, if you’re quite sure you’ve figured out how to make and identify honorable guys, maybe you could try to make many different honorable guys, get bids from all of them, and give the contract to the best bid?
Alternatively: the AI promises that “I will fairly compensate you” where “fair” is to be decided by the AI when it has a better understanding of the situation we were in.
If x-risk is actually like 99% then it might offer us 1% of the lightcone; if x-risk is actually like 1% then it might offer us 99% of the lightcone.[1]
If the honourable AI would’ve accepted a deal only with >80% of the lightcone then the AI offers us 10%; if the honourable AI would’ve been accepted a deal only with >20% of the lightcone then the AI offers us 90%.
In general, you have some bargaining solution (e.g. Nash, of KS), and the AI promises to follow the solution when it has a better understanding of the inputs to the solution, i.e. each side’s BATNA and the feasible outcomes.
- ^
Maybe this explains why you are in an ancestor simulation of the AI safety community.
Here are some specific examples anyway:
I’d add also animal and AI welfare
to not disrupt human life; in particular, it should always remain possible for a community to choose to live some baseline-good life while not interacting with anything downstream of the AI or while only interacting with some chosen subset of things downstream of the AI more generally
one could start by becoming familiar with existing literature on these questions — on the biological, intellectual, and sociocultural evolution/development of trustworthiness, and on the (developmental) psychology of trustworthiness
I’ve been reading some of the behavioural economics of trust games. One interesting article here is “Bare promises: An experiment” (Charness and Dufwenberg, May 2010) which finds that humans aren’t more likely to be nice after making a “bare promise” to be nice (where “bare promise” is like you tick a box saying you’ll be nice), but only if they make a promise to the truster in open free-form communication.
Other findings from the literature:
video communication is better than text
knowing more facts about the truster is better
building rapport / becoming friends with the truster is better
Wanting to be the kind of guy who pays back for good acts (such as creating you and unleashing you) even if done with ability to track whether you are that kind of guy?
The AI should have some decent prob on the simulators having the ability to track whether they are that kind of guy, even if everything they know about the simulators suggests they lack that ability.
p(the creature is honorable enough for this plan) like, idk, i feel like saying
I’d put this much higher. My 90% confidence interval on the proportion of honourable organisms is 10^-3 to 10^-7. This is because many of these smart creatures will have evolved with much greater extrospective access to each other, so they follow open-source-ish game theory rather than the closed-source-ish game theory which humans evolved in. (Open to closed is a bit of a spectrum.)
Why might creatures have greater extrospective access to each other?
Maybe they are much better at reasoning about each other, i.e. they have simplier internals relative to their social reasoning capabilities.
Maybe they have parallelised and/or stateless, in ways that promote honour. For example, imagine if ants had become human-level intelligent: then ant-from-colony-A would cooperate in a one-shot prisoners dilemma with ant-from-colony-B, because the first ant wants the second ant to have a good impression of the other A-colony ants.
deal offered here is pretty fair
Another favourable disanalogy between (aliens, humans) and (humans, AIs): the AIs owe the humans their existence, so they are glad that we [created them and offered us them this deal]. But humans don’t owe our existence to the aliens, presumably.
self-modify
NB: One worry is that, although honourable humans have this ability to self-modify, they do so via affordances which we won’t be able to grant to the AI.
However, I think that probably the opposite is true—we can grant affordances for self-modification to the AI which are much greater than available to humans. (Because they are digital, etc.)
It’s also easy — if you want to be like this, you just can.
I think you can easily choose to follow a policy of never saying things you know to be false. (Easy in the sense of “considering only the internal costs of determining and executing the action consistent with this policy, ignoring the external costs, e.g. losing your job and friends.) But I’m not sure it’s easy to do the extra thing of “And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc”
I’d really want to read essays you wrote about Parfit’s hitchhiker or one-shot prisoner’s dilemmas or something
My method would look something like:
This person acts honourably on the normal distribution of situations X.
This person claims they would act honourably on a broader distribution of situations Y, which includes our weird situation. And (assuming we have extrospective access to their beliefs) this person claims they would act honourably on the distribution Y.
This person endorses a decision-theoretic claim that they should act honourably on distribution Y.
We have some inductive evidence that (1-3) ensures they will act honourably on distribution Y, in the following sense:
For any pair of (X’, Y’) where X’ is a normal distribution of situations and Y’ is a weird distribution of situations, such that:
This person acts honourably on the normal distribution of situations X’
This person claims they would act honourably on a broader distribution of situations Y’. And (assuming we have extrospective access to their beliefs) this person claims they would act honourably on the distribution Y’
This person endorses a decision-theoretic claim that they should act honourably on distribution Y.
They do, in fact, behave honourably on distribution situation Y’
NB: I think that, perhaps, it will be easier to make/find/identify an honourable AI than an honourable human, because:
AIs are stateless and humans aren’t—so we can run experiments on them without worrying about the effects of earlier experiments.
We can run the same AIs many times in parallel, but can’t with humans—so we can experiments more serial-time efficiently.
The AI is very honorable/honest/trustworthy — in particular, the AI would keep its promises even in extreme situations.
NB: It seems like we need a (possibly much weaker, but maybe in practice no weaker) assumption that we can detect whether the AI is lying about deals of the form in Step 2.
Humans might go extinct very soon after AI because AI accelerates technological progress and there was an extinction technology ahead of us. That is, we would’ve gone extinct from the same technology without the AI in 2500, but 2500 is “reached” a decade after AI, so it looks like the AI was the cause.
(This point is from Paul Christiano iirc.)
Note that AI needn’t take over (voluntarily or forcefully) to accelerate technological progress like this.
what’s the principle here? if an agent would have the same observations in world W and W’ then their preferences must be indifferent between W and W’ ? this seems clearly false.
If you want a mundane existence you can simulate that until you’re bored
My mundane values care about real physical stuff not simulated stuff.
Yes, I support something like uplifting, as described in other comments in this post.
yeah I think we should allow christian homeschools to exist in the year 3000.
But this cuts against some other moral intuitions, like “people shouldn’t be made worse off as a means to an end” (e.g. I don’t think we should have wars as a means to inspire poets). And presumably the people in the christian homeschools are worse off.
Maybe the compromise is something like:
On every day you are in the homeschool, we will “uplift” you if we think you would “ideally” want that.
Something like pre-existence theodicy, i.e. people born to Christian homeschooling parents consent to that life before they are incarnate, possibly in return for compensation (supposing something like souls or personal identity exists).
I’m hopeful the details can be fleshed out in late crunch-time.
My best bet for what we should do with the North Sentinelese—and with everyone post-singularity—is that we uplift them if we think they would “ideally” want that. And “ideally” is in scare quotes because no one knows what that means.
Do adults keep promises to children, if they are otherwise trustworthy? Why, or why not?