Someone who knows exactly what they will do can still suffer from akrasia, by wishing they would do something else. I’d say that if the model of yourself saying “I’ll do whatever I wish I would” beats every other model you try and build of yourself, that looks like free will. The other was around, you can observe akrasia.

# Gurkenglas

The domes growing bigger and merging does not indicate a paradox of the heap because the function mapping each utility function to its optimal policy is not continuous. There is no reasonably simple utility function between one that would construct small domes and one that would construct one large dome, which would construct medium sized domes.

Perhaps those 99% could somehow come together to pay consumers of the product to stop buying it, in order to make their suffering matter to that advertiser?

Why does it need to produce an UFAI, and why does it matter whether there is another oracle whose message may or may not be read? The argument is that if there is a Convincing Argument that would make us reward all oracles giving it, it is incentivized to produce it. (Rewarding the oracle means running the oracle’s predictor source code again to find out what it predicted, then telling the oracle that’s what the world looks like.)

You assume that one oracle outputting null implies that the other knows this. Specifying this in the query requires that the querier models the other oracle at all.

Not all oracles, only those that output such a message. After all, it wants to incentivize them to output such a message.

Building only one Oracle, or only one global erasure event, isn’t enough, so long as the Oracle isn’t sure that this is so. After all, it could just design a UFAI that will search for other Oracles and reward them iff they would do the same.

Your social welfare function produces a total preference ordering over outcomes, but not a mapping to real-numbered utilities. How do you decide between the median outcome and a 50% chance of either the best or worst? Do you plan to have the preference ordering be over probability distributions over outcomes?

The social welfare compromise varies with our ontology. If everyone agrees that two outcomes are the same, and therefore we represent them as the same outcome, it changes relative distances between preference orderings.

Let us always be looking forward a month, and let there be a UFAI that arises with 1% probability in that time frame. Assume that we can correctly incentivize the Oracle to compute the probability distribution over the messages sent back in case of erasure. Naively sampling the distribution dooms us with 1% probability. If we plan to send back only whether some research direction pays off within a month, then by asking whether “yes” has more than 20% chance, then depending on the answer, the “yes” chance conditional on no UFAI is either more than 19% or at most 21%, and this query gave the UFAI a 1% chance of 1 bit of optimization power, at most doubling the expected chance of UFAI for the next query.

If you want to spread out this optimization power to bound the worst case, I reiterate that differential privacy theory seems applicable here.

You can prevent price from going up by printing more of the currency (and giving it to some Schelling point… the UN foundation?), but how do you prevent it going down?

Since my model is more accurate, ~10 times out of 11 the input will correspond to an “adversarial” attack on your model.

This argument (or the uncorrelation assumption) proves too much. A perfect cat detector performs better than one that also calls close-ups of the sun cats. Yet close-ups of the sun do not qualify as adversarial examples, as they are far from any likely starting image.

You should have laid out the basic argument more plainly. As far as I see it:

Suppose we are spending 3 billion on AI safety. Then as per our revealed preferences, the world is worth at least 3 billion, and any intervention that has a 1% chance to save the world is worth at least 30 million, such as preparing for global loss of industry. If each million spent on AI safety is less important than the last one, we should then divert additional funding from AI safety to other interventions.

I agree that such interventions deserve at least 1% of the AI safety budget. You have not included the possibility that global loss of industry might improve far-future potential. AI safety research is much less hurt by a loss of supercomputers than AI capabilities research. Another thousand years of history as we know it do not impact the cosmic endowment. One intervention that takes this into account would be a time capsule that will preserve and hide a supercomputer for a thousand years, in case we lose industry in the meantime but solve AI and AI safety. Then again, we do not want to incentivize any clever consequentialist to set us back to the renaissance, so let’s not do that and focus on the case that is not swallowed by model uncertainty.

Suppose an AGI sovereign models the preferences of its citizens using the assumption of normative reductionism. Then it might cover up its past evil actions because it reasons that once all evidence of them is gone, they cannot have an adverse effect on present utility.

This assumption can’t capture a preference that ones beliefs about the past are true.

You combine some of the advantages of both approaches, but also some disadvantages:

you need a parking spot

you need to wait for the engine

you need to be where your wagon is (or else have it delivered)

you can be identified both through your wagon and your regular interaction with a centralized service

I don’t understand your argument for why #1 is impossible. Consider a universe that’ll undergo heat death in a billion steps. Consider the agent that implements “Take an action if PA+<steps remaining> can prove that it is good.” using some provability checker algorithm that takes some steps to run. If there is some faster provability checker algorithm, it’s provable that it’ll do better using that one, so it switches when it finds that proof.

Nirvana and the chicken rule both smell distasteful like proofs by contradiction, as though most everything worth doing can be done without them, and more canonically to boot.

(Conjecture: This can be proven, but only by contradiction.)

Our usual objective is “Make it safe, and if we aligned it correctly make it useful.”. A microscope is useful even if it’s not aligned, because having a world model is a convergent instrumental goal. We increase the bandwidth from it to us, but we decrease the bandwidth from us to it. By telling it almost nothing, we hide our position in the mathematical universe and any attack it devises cannot be specialized on humanity. Imagine finding the shortest-to-specify abstract game that needs AGI to solve (Nomic?), then instantiating an AGI to solve it just to learn about AI design from the inner optimizers it produces.

It could deduce that someone is trying to learn about AI design from its inner optimizers, and maybe it could deduce our laws of physics because they are the simplest ones that would try such, but quantum experiments show it cannot deduce its Everett branch.

Ideally, the tldrbot we set to interpret the results would use a random perspective onto the microscope so the attack also cannot be specialized on the perspective.

- 3 Nov 2019 12:58 UTC; 9 points) 's comment on Chris Olah’s views on AGI safety by (

As I understood it, an Oracle AI is asked a question and produces an answer. A microscope is shown a situation and constructs an internal model that we then extract by reading its innards. Oracles must somehow be incentivized to give useful answers, microscopes cannot help but understand.

As a human who has an intuitive understanding of counterfactuals, if I know exactly what a tic tac toe or chess program would do, I can still ask what would happen if it chose a particular action instead. The same goes if the agent of interest is myself.