Karma: 474

# Im­pact Mea­sure Test­ing with Honey Pots and Myopia

21 Sep 2018 15:26 UTC
10 points
• The only iterative process I have in mind is a normal RL agent. It’s the same agent every timestep, but as it learns about its environment, it becomes more capable. (At timestep , it doesn’t know anything.)

Succeeding at blowing up the moon would absolutely unsafe. Let’s just assume that one step along the way toward blowing up the moon would be killing everyone. I just want to clarify I’m not saying that it would be fine if the agent succeeds in blowing up the moon.

Let be the first timestep by which its best plan for blowing up the moon has probability of success . If the impact measure is “almost good”, and there are only a few ways to succeed in blowing up the moon while having apparently low impact, might be greater than if the impact measure is horrible, but the same logic holds. At time , the agent initiates a plan that is likely to fail.

A remaining question is how it fails. Will it succeed in tricking its operators and taking over the world, but then fail to blow up the moon (while maintaining “low impact”)? Or will it fail in tricking its operators and taking over the world (while maintaining “low impact”)? Intuitively, the latter seems like a bigger hurdle, but even if that’s not the case, for sufficiently large , and sufficiently low success probability of the plan, the plan is likely to be fail at both steps.

To see this: suppose I ask you to come up with a plan that has a one in a million chance of accomplishing A then B, both of which are non-trivial tasks. Is it more likely that the first plan you come up with has a 50% chance of accomplishing A and a one in 500,000 chance of accomplishing B, or is it more likely that the first plan you come up with has a one in a thousand chance of accomplishing A, and likewise for B? Whatever the answer, now suppose I ask you to come up with a plan that has a one in a billion chance of accomplishing A then B. Both probabilities should go down for this plan.

• A couple minor corrections: in the definition of , there shouldn’t be a max over --that’s an input to the function. Another one, and this isn’t quite as clear cut, is that I think should be in the definition of the Q-value. It seems that you intend to mean all the utility accrued from time to time , but the utility should be allowed to depend on the entire history of observations. The theoretical reason for this is that “really,” the utility is a function of the state of the universe, and all observations inform the agent’s probability distribution over what universe state it is in, not just the observations that come from the interval of time that it is evaluating the utility of. A concrete example is as follows: if an action appeared somewhere in the history that indicated that all observations thereafter were faked, the utility of that segment should reflect that—it should be allowed to depend on the previous observations that contextualize the observations of the interval in question. In other words, a utility function needs to be typed to allow all actions and observations from the whole history to be input to the function.

• I conclude from this that CDT should equal EDT (hence, causality must account for logical correlations, IE include logical causality).

or… CDT doesn’t halt. Here’s how I imagine CDT approaching EDT: as soon as you’re about to decide to do X, this presents itself as an observation to you, so now you can condition on the fact that you’re “about” to do X. Then, of course, if X still looks good, you do X, but if X doesn’t check out anymore, then you reconsider until you’re “about” to make another decision. This process clearly might not halt. This is obviously very hand-wavey, and I’m not totally confident that anything after my first sentence means anything at all.

The most I really feel comfortable saying is that there is another possibility on the table besides a) CDT = EDT and b) CDT can get Dutch booked: c) CDT does not halt.

P. S. This post was fascinating and clear.

# Asymp­tot­i­cally Unam­bi­tious AGI

6 Mar 2019 1:15 UTC
39 points

• Comment thread: concerns with Assumption 4

If you would like to contribute, please comment with the amount. If you have venmo, please send the amount to @Michael-Cohen-45. If not, we can discuss.

• 6 Mar 2019 5:57 UTC
LW: 3 AF: 2
AF
1. Can you give some intuitions about why the system uses a human explorer instead of doing exploring automatically?

Whatever policy is used for exploration, we can ensure that BoMAI will eventually outperform this policy. With a human executing the policy, this leads to BoMAI accumulating reward at least as well as a human. Under the “smarter” information theoretic exploratory policies that I’ve considered, exploratory behavior is unsafe from insatiable curiosity: the agent has to try killing everyone just to check to make sure it’s not a weird cheat code.

• 6 Mar 2019 6:02 UTC
LW: 1 AF: 1
AF

I’m open to other terminology. Yes, there is no guarantee about what happens to the operator. As I’m defining it, benignity is defined to be not having outside-world instrumental goals, and the intuition for the term is “not existentially dangerous.”

• 6 Mar 2019 6:16 UTC
LW: 1 AF: 1
AF
3. The system seems limited to answering questions that the human operator can correctly evaluate...

Yes. BoMAI would be able to give plausible-sounding answers to questions. BoMAI could also do any task that was automatically checkable: don’t use a human operator at all; have an automated system which interprets text as an amino acid sequence; synthesize that protein; measure some feature of it’s behavior; provide reward accordingly. (That example invites renewed focus on the impermeability of the box, by the way).

Some things I would do is send an eminent cancer researcher in to ask BoMAI for a research proposal. Then the researcher could go out and test it. It might be worthless, no matter how plausible it seemed, but then they could go back having learned something about a failed path. Repeating this process, it seems likely to me that a correct idea would appear, just considering the likelihood of appearing plausible to a better and better trained evaluator.

I would also naturally ask it how to make a safe unbounded AGI. And the next episode, I would ask for an explanation for why that would fail.

REDACTED: On that topic, in addition to having multiple humans in the box, you could also have 2 agents that the operator interacts with, both of which are clones except that the reward for the second is one minus the reward for the first. This would look like “AI Safety via debate.”

• 6 Mar 2019 6:23 UTC
LW: 3 AF: 2
AF