It seems like the purpose of the asteroid scenario is not to come up with ways of deflecting an asteroid, but as an example system in which two uncoordinated AIs (pardon the pun) can minimize impact in an interesting way.
michaelcohen
I had idea for a prior for planners (the ‘p’ part of (p, R)) that I think would remove the no-free-lunch result. For a given planner, let its “score” be the average reward the agent gets for a randomly selected reward function (with a simplicity prior over reward functions). Let the prior probability for a particular planner be a function of this score, perhaps by applying a Boltzmann distribution over it. I would call this an evolutionary prior—planners that typically get higher reward given a randomly assigned reward function are more likely to exist. One could also randomize the transition function to see how planners do for arbitrary world-dynamics, but it doesn’t seem particularly problematic, and maybe even beneficial, if we place a higher prior probability on planners that are unusually well-adapted to generate good policies given the particular dynamics of the world we’re in.
The only iterative process I have in mind is a normal RL agent. It’s the same agent every timestep, but as it learns about its environment, it becomes more capable. (At timestep , it doesn’t know anything.)
Succeeding at blowing up the moon would absolutely unsafe. Let’s just assume that one step along the way toward blowing up the moon would be killing everyone. I just want to clarify I’m not saying that it would be fine if the agent succeeds in blowing up the moon.
Let be the first timestep by which its best plan for blowing up the moon has probability of success . If the impact measure is “almost good”, and there are only a few ways to succeed in blowing up the moon while having apparently low impact, might be greater than if the impact measure is horrible, but the same logic holds. At time , the agent initiates a plan that is likely to fail.
A remaining question is how it fails. Will it succeed in tricking its operators and taking over the world, but then fail to blow up the moon (while maintaining “low impact”)? Or will it fail in tricking its operators and taking over the world (while maintaining “low impact”)? Intuitively, the latter seems like a bigger hurdle, but even if that’s not the case, for sufficiently large , and sufficiently low success probability of the plan, the plan is likely to be fail at both steps.
To see this: suppose I ask you to come up with a plan that has a one in a million chance of accomplishing A then B, both of which are non-trivial tasks. Is it more likely that the first plan you come up with has a 50% chance of accomplishing A and a one in 500,000 chance of accomplishing B, or is it more likely that the first plan you come up with has a one in a thousand chance of accomplishing A, and likewise for B? Whatever the answer, now suppose I ask you to come up with a plan that has a one in a billion chance of accomplishing A then B. Both probabilities should go down for this plan.
A couple minor corrections: in the definition of , there shouldn’t be a max over --that’s an input to the function. Another one, and this isn’t quite as clear cut, is that I think should be in the definition of the Q-value. It seems that you intend to mean all the utility accrued from time to time , but the utility should be allowed to depend on the entire history of observations. The theoretical reason for this is that “really,” the utility is a function of the state of the universe, and all observations inform the agent’s probability distribution over what universe state it is in, not just the observations that come from the interval of time that it is evaluating the utility of. A concrete example is as follows: if an action appeared somewhere in the history that indicated that all observations thereafter were faked, the utility of that segment should reflect that—it should be allowed to depend on the previous observations that contextualize the observations of the interval in question. In other words, a utility function needs to be typed to allow all actions and observations from the whole history to be input to the function.
I conclude from this that CDT should equal EDT (hence, causality must account for logical correlations, IE include logical causality).
or… CDT doesn’t halt. Here’s how I imagine CDT approaching EDT: as soon as you’re about to decide to do X, this presents itself as an observation to you, so now you can condition on the fact that you’re “about” to do X. Then, of course, if X still looks good, you do X, but if X doesn’t check out anymore, then you reconsider until you’re “about” to make another decision. This process clearly might not halt. This is obviously very hand-wavey, and I’m not totally confident that anything after my first sentence means anything at all.
The most I really feel comfortable saying is that there is another possibility on the table besides a) CDT = EDT and b) CDT can get Dutch booked: c) CDT does not halt.
P. S. This post was fascinating and clear.
Comment thread: positive feedback
Comment thread: general concerns/confusions
Comment thread: minor concerns
Comment thread: concerns with Assumption 1
Comment thread: concerns with Assumption 2
Comment thread: concerns with Assumption 3
Comment thread: concerns with Assumption 4
Comment thread: concerns with “the box”
- 1 Apr 2019 9:09 UTC; 1 point) 's comment on Asymptotically Unambitious AGI by (
Comment thread: adding to the prize pool
If you would like to contribute, please comment with the amount. If you have venmo, please send the amount to @Michael-Cohen-45. If not, we can discuss.
Yes, but this is also for things that seem like mistakes in the exposition, but either have simple fixes or don’t impact the main theorems.
1. Can you give some intuitions about why the system uses a human explorer instead of doing exploring automatically?
Whatever policy is used for exploration, we can ensure that BoMAI will eventually outperform this policy. With a human executing the policy, this leads to BoMAI accumulating reward at least as well as a human. Under the “smarter” information theoretic exploratory policies that I’ve considered, exploratory behavior is unsafe from insatiable curiosity: the agent has to try killing everyone just to check to make sure it’s not a weird cheat code.
2. I’m concerned about overloading...
I’m open to other terminology. Yes, there is no guarantee about what happens to the operator. As I’m defining it, benignity is defined to be not having outside-world instrumental goals, and the intuition for the term is “not existentially dangerous.”
3. The system seems limited to answering questions that the human operator can correctly evaluate...
Yes. BoMAI would be able to give plausible-sounding answers to questions. BoMAI could also do any task that was automatically checkable: don’t use a human operator at all; have an automated system which interprets text as an amino acid sequence; synthesize that protein; measure some feature of it’s behavior; provide reward accordingly. (That example invites renewed focus on the impermeability of the box, by the way).
Some things I would do is send an eminent cancer researcher in to ask BoMAI for a research proposal. Then the researcher could go out and test it. It might be worthless, no matter how plausible it seemed, but then they could go back having learned something about a failed path. Repeating this process, it seems likely to me that a correct idea would appear, just considering the likelihood of appearing plausible to a better and better trained evaluator.
I would also naturally ask it how to make a safe unbounded AGI. And the next episode, I would ask for an explanation for why that would fail.
REDACTED: On that topic, in addition to having multiple humans in the box, you could also have 2 agents that the operator interacts with, both of which are clones except that the reward for the second is one minus the reward for the first. This would look like “AI Safety via debate.”
- AI Safety “Success Stories” by 7 Sep 2019 2:54 UTC; 117 points) (
- 6 Mar 2019 10:42 UTC; 1 point) 's comment on Asymptotically Unambitious AGI by (
4. If you were to ask it a question like “Given these symptoms, do I need emergency medical treatment?” and the correct answer is “yes”, it would answer “no” because if it answered “yes” then the operator would leave the room and it would get 0 reward for the rest of the episode...
When I say it would continue to accomplish whatever task we wanted, I’m being a bit sloppy—if we have a task we want accomplished, and we provide rewards randomly, it will not accomplish our desired task. But I take the point that “whatever task we wanted” does have some restrictions: it has to be one that a human operator can convert into a reward without leaving. So the task “respond with the true answer to [difficult question]” is not one that the operator can convert into a reward, but the task “respond with an answer that sounds plausible to the operator” is. I think this subsumes your example.
Thanks for the thoughts. I guess I need to do a lot more looking into CIRL, before I come back to this. I do still wonder (although this is at an unformalized level) whether an agent could potentially learn a lot about moral evidence from the constraint that it’s own actions can’t cause the expected evidence to change. For example, if it realizes that a certain action (like subtle coercion) would result in something that it would have thought was legitimate evidence, then that situation must not actually count as evidence at all. That constraint seems to pack a decent minority of our requirements for value learning into a relatively simple statement. There may be other ways to encode such a constraint besides having an agent be uncertain about its function for determining what observations provide what evidence, though.