Therefore, if epsilon is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random.
The escape action being randomly called should not be a problem if it is a text string that is only read if r=1, and is ineffectual otherwise...
The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.
I don’t see why this has to be true, given that we get to choose the AI’s value function. Why can’t we just make the agent act-based?
My main concern about the counterfactual oracle is that it doesn’t prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability ϵ) and might stumble upon an escape message/action then.
If the agent is model-based, then you should be able to gather a dataset of (prediction, world_state, accuracy) tuples with random actions (as random actions will practically never make the bad prediction) and random decision of whether to read the response. And then just ask the agent to maximize the natural direct effect of its prediction, treating world_state as the mediator, and a null prediction as the default action. (this equates to asking what the world would be like if a null action was outputted—I’ll release my current work on direct effects in AI safety soon, and feel free to ask for it in the meantime). I don’t see how this has this particular bad consequence (actually making the bad self-confirming prediction) in either training or deployment...
The rest of the design (providing rewards of 0, shutting it down, etc.) appears to be over-engineering.
In particular, shutting down the system is just a way of saying “only maximize reward in the current timestep, i.e. be an act-based agent. This can be just incorporated into the reward function.
Indeed, when reading the predictions of the counterfactual oracle we’re not in the counterfactual world (=training distribution) anymore, so the predictions can get arbitrarily wrong (depending on how much the predictions are manipulative and how many people peek at it).
The hope is that since the agent is not trying to find self-confirming prophecies, then hopefully the accidental effects of self-confirmation are sufficiently small...
Should now be fixed
Hey! Thanks for sharing your experience with RAISE.
I’m sorry to say it, but I’m not convinced by this plan overall. Also, on the meta-level, I think you’ve got insufficient feedback on the idea before sharing it. Personally, my preferred format for giving inline feedback on a project idea is Google Docs, and so I’ve copied this post into a GDoc HERE and added a bunch of my thoughts there.
I don’t mean to make you guys get discouraged, but I think that a bunch of aspects of this proposal are pretty ill-considered and need a bunch of revision. I’d be happy to provide further input.
There is now, and it’s this thread! I’ll also go if a couple of other researchers do ;)
Ok! That’s very useful to know.
It seems pretty related to the Inverse Reward Design paper. I guess it’s a variation. Your setup seems to be more specific about how the evaluator acts, but more general about the environment.
As others have commented, it’s difficult to understand what this math is supposed to say.
My understanding is that the sole central idea here is to have the agent know that the utility/reward it is given is a function of the evaluator’s distribution over the state, but to try to maximize the utility that the evaluator would allocate if it knew the true state.
But this may be inaccurate, or there may be other material ideas here that I’ve missed.
At least typically, we’re talking about a strategy in the following sense. Q: Suppose you want to pick a teacher for a new classroom, how should you pick a teacher? A: you randomly sample from teachers above some performance threshold, in some base distribution. This works best given some fixed finite amount of “counterfeit performance” in that distribution.
If we treat the teachers as a bunch of agents, we don’t yet have a game-theoretic argument that we should actually expect the amount of counterfeit performance (I) to be bounded. It might be that all of the teachers exploit the metric as far as they can, and counterfeit performance is unbounded...
I don’t fully understand the rest of the comment.
This is a rough draft, so pointing out any errors by email or PM is greatly appreciated.
As another anecdata point, I considered writing more to pursue the prize pool but ultimately didn’t do any more (counterfactual) work!
Note: This is bound to contain a bunch of errors and sources of confusion so please let me know about them here.
Maybe the new-conversation—place is the bar or snack-bar. (Plausible deniability!)
[Note: This comment is three years later than the post]
The “obvious idea” here unfortunately seems not to work, because it is vulnerable to so-called “infinite improbability drives”. Suppose B is a shutdown button, and P(b|e) gives some weight to B=pressed and B=unpressed. Then, the AI will benefit from selecting a Q such that it always chooses an action a, in which it enters a lottery, and if it does not win, then it the button B is pushed. In this circumstance, P(b|e) is unchanged, while both P(c|b=pressed,a,e) and P(c|b=unpressed,a,e) allocate almost all of the probability to great C outcomes. So the approach will create an AI that wants to exploit its ability to determine B.
I see. I was trying to do was answer your terminology question by addressing simple extreme cases. e.g. if you ask an AI to disconnect its shutdown button, I don’t think it’s being incorrigible. If you ask an AI to keep you safe, and then it disconnects its shutdown button, it is being incorrigible.
I think the main way the religion case differs is that the AI system is interfering with our intellectual ability for strategizing about AI rather than our physical systems for redirecting AI, and I’m not sure how that counts. But if I ask an AI to keep me safe and it mind-controls me to want to propagate that AI, that’s sure incorrigible. Maybe, as you suggest, it’s just fundamentally ill-defined...
I could be wrong, but I feel like if I ask for education or manipulation and the AI gives it to me, and bad stuff happens, that’s not a problem with the redirectibility or corrigibility of the agent. After all, it just did what it was told. Conversely, if the AI system refuses to educate me, that seems rather more like a corrigibility problem. A natural divider is that with a corrigibility AI we can still inflict harm on ourselves via our use of that AI as a tool.
Does this sound right?
A corrigible AI might not turn against its operators and might not kill us all, and the outcome can still be catastrophic. To prevent this, we’d definitely want our operators to be metaphilosophically competent, and we’d definitely want our AI to not corrupt them.
I agree with this.
a corrigible misaligned superintelligence is unlikely to lead to self-annihilation, but pretty likely to lead to astronomical moral waste.
There’s a lot of broad model uncertainty here, but yes, I’m sympathetic to this position.
Does the new title seem better?
At this round of edits, my main objection would be to the remark that the AI wants us to act as yes-men, which seems dubious if the agent is (i) an Act-based agent or (ii) sufficiently broadly uncertain over values.
What I see to be the main message of the article as currently written is that humans controlling a very powerful tool (especially AI) could drive themselves into a suboptimal fixed point due to insufficient philosophical sophistication.
This I agree with.