I agree that some people can benefit from doing both, although getting everyone online is a hard collective action problem. I just claim that many researchers will satisfy with OP. At MIRI/FHI/OpenAI there are ~30-150 researchers, who think about a wide range of areas, which seems broadly comparable to the researchers among LessWrong/AF’s active users (depending on your definition of ‘researcher’, or ‘active’). Idea-exchange is extended by workshops and people moving jobs. Many in such a work environment will fund that FP has unacceptably low signal-noise ratio and will inevitably avoid FP...
I would note that many of these factors apply as benefits of office-chat participation (OP) as well. The main benefit of FP absent from OP, I suppose, is preparing you for efficient written communication, but the rest seem feature in both. The fact that their benefits overlap explains why remote researchers benefit so much more than others from FP.
Aside from yourself, the other CHAI grad students don’t seem to have written up their perspectives of what needs to be done about AI risk. Are they content to just each work on their own version of the problem?
I think this is actually pretty strategically reasonable.
CHAI students would have high returns to their probability of attaining a top professorship by writing papers, which is quite beneficial for later recruiting top talent to work on AI safety, and quite structurally beneficial for the establishment of AI safety as a field of research. The time they might spend writing up their research strategy does not help with their this, nor with recruiting help with their line of work (because other nearby researchers face similar pressures, and because academia is not structured to have PhD students lead large teams).
Moreover, if they are pursuing academic success, they face strong incentives to work on particular problems, and so their research strategies may be somewhat distorted by these incentives, decreasing the quality of a research agenda written in that context.
When I look at CHAI research students, I see some pursuing IRL, some pursuing game theory, some pursuing the research areas of their supervisors (all of which could lead to professorships), and some pursuing projects of other research leaders like MIRI or Paul. This seems healthy to me.
In general, thinking of yourself commuciating your ideas to a less intelligent and knowledgeable person could push you in the direction of confabulating freeer-flowing stories whereas imagining yourself communicating your ideas to a smarter person could push you in the direction of saying less, with higher-rigour.
It seems like which one is desirable depends on the individual and the context (cf the Law of Equal and Opposite Advice)
Therefore, if epsilon is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random.
The escape action being randomly called should not be a problem if it is a text string that is only read if r=1, and is ineffectual otherwise...
The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.
I don’t see why this has to be true, given that we get to choose the AI’s value function. Why can’t we just make the agent act-based?
My main concern about the counterfactual oracle is that it doesn’t prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability ϵ) and might stumble upon an escape message/action then.
If the agent is model-based, then you should be able to gather a dataset of (prediction, world_state, accuracy) tuples with random actions (as random actions will practically never make the bad prediction) and random decision of whether to read the response. And then just ask the agent to maximize the natural direct effect of its prediction, treating world_state as the mediator, and a null prediction as the default action. (this equates to asking what the world would be like if a null action was outputted—I’ll release my current work on direct effects in AI safety soon, and feel free to ask for it in the meantime). I don’t see how this has this particular bad consequence (actually making the bad self-confirming prediction) in either training or deployment...
The rest of the design (providing rewards of 0, shutting it down, etc.) appears to be over-engineering.
In particular, shutting down the system is just a way of saying “only maximize reward in the current timestep, i.e. be an act-based agent. This can be just incorporated into the reward function.
Indeed, when reading the predictions of the counterfactual oracle we’re not in the counterfactual world (=training distribution) anymore, so the predictions can get arbitrarily wrong (depending on how much the predictions are manipulative and how many people peek at it).
The hope is that since the agent is not trying to find self-confirming prophecies, then hopefully the accidental effects of self-confirmation are sufficiently small...
Should now be fixed
Hey! Thanks for sharing your experience with RAISE.
I’m sorry to say it, but I’m not convinced by this plan overall. Also, on the meta-level, I think you’ve got insufficient feedback on the idea before sharing it. Personally, my preferred format for giving inline feedback on a project idea is Google Docs, and so I’ve copied this post into a GDoc HERE and added a bunch of my thoughts there.
I don’t mean to make you guys get discouraged, but I think that a bunch of aspects of this proposal are pretty ill-considered and need a bunch of revision. I’d be happy to provide further input.
There is now, and it’s this thread! I’ll also go if a couple of other researchers do ;)
Ok! That’s very useful to know.
It seems pretty related to the Inverse Reward Design paper. I guess it’s a variation. Your setup seems to be more specific about how the evaluator acts, but more general about the environment.
As others have commented, it’s difficult to understand what this math is supposed to say.
My understanding is that the sole central idea here is to have the agent know that the utility/reward it is given is a function of the evaluator’s distribution over the state, but to try to maximize the utility that the evaluator would allocate if it knew the true state.
But this may be inaccurate, or there may be other material ideas here that I’ve missed.
At least typically, we’re talking about a strategy in the following sense. Q: Suppose you want to pick a teacher for a new classroom, how should you pick a teacher? A: you randomly sample from teachers above some performance threshold, in some base distribution. This works best given some fixed finite amount of “counterfeit performance” in that distribution.
If we treat the teachers as a bunch of agents, we don’t yet have a game-theoretic argument that we should actually expect the amount of counterfeit performance (I) to be bounded. It might be that all of the teachers exploit the metric as far as they can, and counterfeit performance is unbounded...
I don’t fully understand the rest of the comment.
This is a rough draft, so pointing out any errors by email or PM is greatly appreciated.
As another anecdata point, I considered writing more to pursue the prize pool but ultimately didn’t do any more (counterfactual) work!
Note: This is bound to contain a bunch of errors and sources of confusion so please let me know about them here.
Maybe the new-conversation—place is the bar or snack-bar. (Plausible deniability!)
[Note: This comment is three years later than the post]
The “obvious idea” here unfortunately seems not to work, because it is vulnerable to so-called “infinite improbability drives”. Suppose B is a shutdown button, and P(b|e) gives some weight to B=pressed and B=unpressed. Then, the AI will benefit from selecting a Q such that it always chooses an action a, in which it enters a lottery, and if it does not win, then it the button B is pushed. In this circumstance, P(b|e) is unchanged, while both P(c|b=pressed,a,e) and P(c|b=unpressed,a,e) allocate almost all of the probability to great C outcomes. So the approach will create an AI that wants to exploit its ability to determine B.
I see. I was trying to do was answer your terminology question by addressing simple extreme cases. e.g. if you ask an AI to disconnect its shutdown button, I don’t think it’s being incorrigible. If you ask an AI to keep you safe, and then it disconnects its shutdown button, it is being incorrigible.
I think the main way the religion case differs is that the AI system is interfering with our intellectual ability for strategizing about AI rather than our physical systems for redirecting AI, and I’m not sure how that counts. But if I ask an AI to keep me safe and it mind-controls me to want to propagate that AI, that’s sure incorrigible. Maybe, as you suggest, it’s just fundamentally ill-defined...
I could be wrong, but I feel like if I ask for education or manipulation and the AI gives it to me, and bad stuff happens, that’s not a problem with the redirectibility or corrigibility of the agent. After all, it just did what it was told. Conversely, if the AI system refuses to educate me, that seems rather more like a corrigibility problem. A natural divider is that with a corrigibility AI we can still inflict harm on ourselves via our use of that AI as a tool.