This is a little nitpicky, but i feel compelled to point out that the brain in the ‘human safety’ example doesn’t have to run for a billion years consecutively. If the goal is to provide consistent moral guidance, the brain can set things up so that it stores a canonical copy of itself in long-term storage, runs for 30 days, then hands off control to another version of itself, loaded from the canonical copy. Every 30 days control is handed to a instance of the canonical version of this person. The same scheme is possible for a group of people.
But this is a nitpick, because i agree that there are probably weird situations in the universe where even the wisest human groups would choose bad outcomes given absolute power for a short time.
I appreciate this disentangling of perspectives. I had been conflating them before, but i like this paradigm.
I found this uncomfortable and unpleasant to read, but i’m nevertheless glad i read it. Thanks for posting.
I think the abridgement sounds nice but don’t anticipate it affecting me much either way.
I think the ability to turn this on/off in user preferences is a particularly good idea (as mentioned in Raemon’s comment).
I can follow most of this, but i’m confused about one part of the premise.
What if the agent created a low-resolution simulation of its behavior, called it Approximate Self, and used that in its predictions? Is the idea that this is doable, but represents a unacceptably large loss of accuracy? Are we in a ‘no approximation’ context where any loss of accuracy is to be avoided?
My perspective: It seems to me that humans also suffer from the problem of embedded self-reference. I suspect that humans deal with this by thinking about a highly approximate representation of their own behavior. For example, when i try to predict how a future conversation will go, i imagine myself saying things that a ‘reasonable person’ might say. Could a machine use a analogous form of non-self-referential approximation?
Great piece, thanks for posting.
It’s relevant to some forms of utilitarian ethics.
I think this is a clever new way of phrasing the problem.
When you said ‘friend that is more powerful than you’, that also made me think of a parenting relationship. We can look at whether this well-intentioned personification of AGI would be a good parent to a human child. They might be able to give the child a lot of attention, a expensive education, and a lot of material resources, but they might take unorthodox actions in the course of pursuing human goals.
(I’m not zhukeepa; i’m just bringing up my own thoughts.)
This isn’t quite the same as a improvement, but one thing that is more appealing about normal-world metaphilosophical progress than empowered-person metaphilosophical progress is that the former has a track record of working*, while the latter is untried and might not work.
*Slowly and not without reversals.
It implies that the Occamian prior should work well in any universe where the laws of probability hold. Is that really true?
Just to clarify, are you referring to the differences between classical probability and quantum amplitudes? Or do you mean something else?
Why do you think so? It’s a thought experiment about punitive acausal trade from before people realized that benevolent acausal trade was equally possible. I don’t think it’s the most interesting idea to come out of the Less Wrong community anymore.
Sorry, i couldn’t find the previous link here when i searched for it.
Just to be clear, i’m imagining counterfactual cooperation to mean the FAI building vaults full of paperclips in every region where there is a surplus of aluminium (or a similar metal). In the other possibility branch, the paperclip maximizer (which thinks identically) reciprocates by preserving semi-autonomous cities of humans among the mountains of paperclips.
If my understanding above is correct, then yes, i think these two would cooperate IF this type of software agent shares my perspective on acausal game theory and branching timelines.
In the last 48 hours i’ve felt the need for more than one of the abilities above. These would be very useful conversational tools.
I think some of these would be harder than others. This one sounds hard: ‘Letting them now that what they said set off alarms bells somewhere in your head, but you aren’t sure why.’ Maybe we could look for both scripts that work between two people who already trust each other, and scripts that work with semi-strangers. Or scripts that do and don’t require both participants to have already read a specific blog post, etc.
Something like a death risk calibration agency? Could be very interesting. Do any orgs like this exist? I guess the CDC (in the US govt) probably quantitively compares risks within the context of disease.
One quote in your post seems more ambitious than the rest: ‘helping retrain people if a thing that society was worried about seems to not be such a problem’. I think that tons of people evaluate risks based on how scary they seem, not based on numerical research.
Note on 3D printing: Yeah, that one might take a while. It’s actually been around for decades, but still hasnt become cheap enough to make a big impact. I think it’ll be one of those techs that takes 50+ years to go big.
Source: I used to work in the 3D printer industry.
I first see the stems, then i see the leaves.
I think humans spend a lot of time looking at our models of the world (maps) and not that much time looking at our actual sensory input.
A similar algorithm appears in Age of Em by Robin Hanson (‘spur safes’ in Chapter 14). Basically, a trusted third party allows copies of A and B to analyze each other’s source code in a sealed environment, then deletes almost everything that is learned.A and B both copy their source code into a trusted computing environment (‘safe’), such as an isolated server or some variety of encrypted VM. The trusted environment instantiates a copy of A (A_fork) and gives it B_source to inspect. Similarly, B_fork is instantiated and allowed to examine A_source. There can be other inputs, such as some contextual information and a contract to discuss. They examine the code for several hours or so, but this is not risky to A or B because all information inside the trusted environment will mandatorily be deleted afterwards. The only outputs from the trusted environment are a secure channel from A_fork to A and one from B_fork to B. These may only ever output an extremely low-resolution one-time report. This can be one of the following 3 values: ‘Enter into the contract with the other’, ‘Do not enter into the contract with the other’, or ‘Maybe enter the contract’.
This does require a trusted execution environment, of course.
I don’t know if this idea is original to Hanson.