Thanks for a really productive conversation in the comment section so far. Here are the comments which won prizes.
Comment prizes:
Objection to the term benign (and ensuing conversation). Wei Dei. Link. $20
A plausible dangerous side-effect. Wei Dai. Link. $40
Short description length of simulated aliens predicting accurately. Wei Dai. Link. $120
Answers that look good to a human vs. actually good answers. Paul Christiano. Link. $20
Consequences of having the prior be based on K(s), with s a description of a Turing machine. Paul Christiano. Link. $90
Simulated aliens converting simple world-models into fast approximations thereof. Paul Christiano. Link. $35
Simulating suffering agents. cousin_it. Link. $20
Reusing simulation of human thoughts for simulation of future events. David Krueger. Link. $20
Options for transfer:
1) Venmo. Send me a request at @Michael-Cohen-45.
2) Send me your email address, and I’ll send you an Amazon gift card (or some other electronic gift card you’d like to specify).
3) Name a charity for me to donate the money to.
I would like to exert a bit of pressure not to do 3, and spend the money on something frivolous instead :) I want to reward your consciousness, more than your reflectively endorsed preferences, if you’re up for that. On that note, here’s one more option:
4) Send me a private message with a shipping address, and I’ll get you something cool (or a few things).
I agree with this in a sense, although I may be quite a bit a more harsh about what counts as “executing an action”. For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as “executing the action” in the overseer-conversation environment, even if the action looks like it’s for some other environment, like a plan to launch a new product in the market. I do think myopia in this environment would suffice for existential safety, but I don’t know how much myopia we need.
If you’re always talking about myopic/process-based RLAIF when you say RLAIF, then I think what you’re saying is defensible. I speculate that not everyone reading this recognizes that your usage of RLAIF implies RLAIF with a level of myopia that matches current instances of RLAIF, and that that is a load-bearing part of your position.
I say “defensible” instead of fully agreeing because I weakly disagree that increasing compute is any more of a dangerous way to improve performance than by modifying the objective to a new myopic objective. That is, I disagree with this:
You suggest that increasing compute is the last thing we should do if we’re looking for performance improvements, as opposed to adding a very myopic approval-seeking objective. I don’t see it. I think changing the objective from imitation learning is more likely to lead to problems than scaling up the imitation learners. But this is probably beside the point, because I don’t think problems are particularly likely in either case.