The only iterative process I have in mind is a normal RL agent. It’s the same agent every timestep, but as it learns about its environment, it becomes more capable. (At timestep 0, it doesn’t know anything.)
Succeeding at blowing up the moon would absolutely unsafe. Let’s just assume that one step along the way toward blowing up the moon would be killing everyone. I just want to clarify I’m not saying that it would be fine if the agent succeeds in blowing up the moon.
Let t(p) be the first timestep by which its best plan for blowing up the moon has probability of success p. If the impact measure is “almost good”, and there are only a few ways to succeed in blowing up the moon while having apparently low impact, t(p) might be greater than if the impact measure is horrible, but the same logic holds. At time t(1/R), the agent initiates a plan that is likely to fail.
A remaining question is how it fails. Will it succeed in tricking its operators and taking over the world, but then fail to blow up the moon (while maintaining “low impact”)? Or will it fail in tricking its operators and taking over the world (while maintaining “low impact”)? Intuitively, the latter seems like a bigger hurdle, but even if that’s not the case, for sufficiently large R, and sufficiently low success probability of the plan, the plan is likely to be fail at both steps.
To see this: suppose I ask you to come up with a plan that has a one in a million chance of accomplishing A then B, both of which are non-trivial tasks. Is it more likely that the first plan you come up with has a 50% chance of accomplishing A and a one in 500,000 chance of accomplishing B, or is it more likely that the first plan you come up with has a one in a thousand chance of accomplishing A, and likewise for B? Whatever the answer, now suppose I ask you to come up with a plan that has a one in a billion chance of accomplishing A then B. Both probabilities should go down for this plan.
A couple minor corrections: in the definition of Qu(h<t+nat+n), there shouldn’t be a max over at+n --that’s an input to the function. Another one, and this isn’t quite as clear cut, is that I think u(ht+n:t+n+m) should be u(h1:t+n+m) in the definition of the Q-value. It seems that you intend u(ht:k) to mean all the utility accrued from time t to time k, but the utility should be allowed to depend on the entire history of observations. The theoretical reason for this is that “really,” the utility is a function of the state of the universe, and all observations inform the agent’s probability distribution over what universe state it is in, not just the observations that come from the interval of time that it is evaluating the utility of. A concrete example is as follows: if an action appeared somewhere in the history that indicated that all observations thereafter were faked, the utility of that segment should reflect that—it should be allowed to depend on the previous observations that contextualize the observations of the interval in question. In other words, a utility function needs to be typed to allow all actions and observations from the whole history to be input to the function.
I conclude from this that CDT should equal EDT (hence, causality must account for logical correlations, IE include logical causality).
or… CDT doesn’t halt. Here’s how I imagine CDT approaching EDT: as soon as you’re about to decide to do X, this presents itself as an observation to you, so now you can condition on the fact that you’re “about” to do X. Then, of course, if X still looks good, you do X, but if X doesn’t check out anymore, then you reconsider until you’re “about” to make another decision. This process clearly might not halt. This is obviously very hand-wavey, and I’m not totally confident that anything after my first sentence means anything at all.
The most I really feel comfortable saying is that there is another possibility on the table besides a) CDT = EDT and b) CDT can get Dutch booked: c) CDT does not halt.
P. S. This post was fascinating and clear.
Comment thread: positive feedback
Comment thread: general concerns/confusions
Comment thread: minor concerns
Comment thread: concerns with Assumption 1
Comment thread: concerns with Assumption 2
Comment thread: concerns with Assumption 3
Comment thread: concerns with Assumption 4
Comment thread: concerns with “the box”
Comment thread: adding to the prize pool
If you would like to contribute, please comment with the amount. If you have venmo, please send the amount to @Michael-Cohen-45. If not, we can discuss.
Yes, but this is also for things that seem like mistakes in the exposition, but either have simple fixes or don’t impact the main theorems.
1. Can you give some intuitions about why the system uses a human explorer instead of doing exploring automatically?
Whatever policy is used for exploration, we can ensure that BoMAI will eventually outperform this policy. With a human executing the policy, this leads to BoMAI accumulating reward at least as well as a human. Under the “smarter” information theoretic exploratory policies that I’ve considered, exploratory behavior is unsafe from insatiable curiosity: the agent has to try killing everyone just to check to make sure it’s not a weird cheat code.
2. I’m concerned about overloading...
I’m open to other terminology. Yes, there is no guarantee about what happens to the operator. As I’m defining it, benignity is defined to be not having outside-world instrumental goals, and the intuition for the term is “not existentially dangerous.”
3. The system seems limited to answering questions that the human operator can correctly evaluate...
Yes. BoMAI would be able to give plausible-sounding answers to questions. BoMAI could also do any task that was automatically checkable: don’t use a human operator at all; have an automated system which interprets text as an amino acid sequence; synthesize that protein; measure some feature of it’s behavior; provide reward accordingly. (That example invites renewed focus on the impermeability of the box, by the way).
Some things I would do is send an eminent cancer researcher in to ask BoMAI for a research proposal. Then the researcher could go out and test it. It might be worthless, no matter how plausible it seemed, but then they could go back having learned something about a failed path. Repeating this process, it seems likely to me that a correct idea would appear, just considering the likelihood of appearing plausible to a better and better trained evaluator.
I would also naturally ask it how to make a safe unbounded AGI. And the next episode, I would ask for an explanation for why that would fail.
REDACTED: On that topic, in addition to having multiple humans in the box, you could also have 2 agents that the operator interacts with, both of which are clones except that the reward for the second is one minus the reward for the first. This would look like “AI Safety via debate.”
4. If you were to ask it a question like “Given these symptoms, do I need emergency medical treatment?” and the correct answer is “yes”, it would answer “no” because if it answered “yes” then the operator would leave the room and it would get 0 reward for the rest of the episode...
When I say it would continue to accomplish whatever task we wanted, I’m being a bit sloppy—if we have a task we want accomplished, and we provide rewards randomly, it will not accomplish our desired task. But I take the point that “whatever task we wanted” does have some restrictions: it has to be one that a human operator can convert into a reward without leaving. So the task “respond with the true answer to [difficult question]” is not one that the operator can convert into a reward, but the task “respond with an answer that sounds plausible to the operator” is. I think this subsumes your example.
This seems useful if you could get around the mind hacking problem, but how would you do that?
If the person leaves having been hypnotized, since it’s not catastrophic, we can modify the setup. For example, the room could be divided in half, with Operator 1 interacting BoMAI, and with Operator 2 observing Operator 1 for signs of hypnosis or for signs of an attempt to communicate with Operator 2, and in either event, Operator 2 could end the episode immediately. (That would be Operator 2′s only ability). More importantly than the magnitude of [(the probability that this works) - (the probability it would work the first time)] is the fact that this can be done iteratively.
A bit of a side note: I’m curious what odds you give to hypnosis (or something like it) being the best way to get optimal reward for such a task (in the vanilla setup).