You could check out Best Textbooks on Every Subject. But people usually recommend Linear Algebra Done Right for LinAlg. Understanding ML seems good for ML Theory. Sutton and Barto is an easy read for RL.
It may be that technical prereqs are missing. It could also be that you’re missing a broader sense of “mathematical maturity”, or that you’re struggling because Stuart’s work is simply hard to understand. That said, useful prereq areas (in which you could also gain overall mathematical maturity) would include:
Machine learning theory
It’s probably overkill to go deep into these topics. Usually, what you need is in the first chapter.
I would guess three main disagreements are:
i) are the kinds of transformative AI that we’re reasonably likely to get in the next 25 years are unalignable?
ii) how plausible are the extreme levels of cooperation Wei Dai wants
iii) how important is career capital/credibility?
I’m perhaps midway between Wei Dai’s view and the median governance view so may be an interesting example. I think we’re ~10% likely to get transformative general AI in the next 20 years, and ~6% likely to get an incorrigible one, and ~5.4% likely to get incorrigible general AI that’s insufficiently philosophically competent. Extreme cooperation seems ~5% likely, and is correlated with having general AI. It would be nice if more people worked on that, or on whatever more-realistic solutions would work for the transformative unsafe AGI scenario, but I’m happy for some double-digit percentage of governance researchers to keep working on less extreme (and more likely) solutions to build credibility.
I agree that some people can benefit from doing both, although getting everyone online is a hard collective action problem. I just claim that many researchers will satisfy with OP. At MIRI/FHI/OpenAI there are ~30-150 researchers, who think about a wide range of areas, which seems broadly comparable to the researchers among LessWrong/AF’s active users (depending on your definition of ‘researcher’, or ‘active’). Idea-exchange is extended by workshops and people moving jobs. Many in such a work environment will fund that FP has unacceptably low signal-noise ratio and will inevitably avoid FP...
I would note that many of these factors apply as benefits of office-chat participation (OP) as well. The main benefit of FP absent from OP, I suppose, is preparing you for efficient written communication, but the rest seem feature in both. The fact that their benefits overlap explains why remote researchers benefit so much more than others from FP.
Aside from yourself, the other CHAI grad students don’t seem to have written up their perspectives of what needs to be done about AI risk. Are they content to just each work on their own version of the problem?
I think this is actually pretty strategically reasonable.
CHAI students would have high returns to their probability of attaining a top professorship by writing papers, which is quite beneficial for later recruiting top talent to work on AI safety, and quite structurally beneficial for the establishment of AI safety as a field of research. The time they might spend writing up their research strategy does not help with their this, nor with recruiting help with their line of work (because other nearby researchers face similar pressures, and because academia is not structured to have PhD students lead large teams).
Moreover, if they are pursuing academic success, they face strong incentives to work on particular problems, and so their research strategies may be somewhat distorted by these incentives, decreasing the quality of a research agenda written in that context.
When I look at CHAI research students, I see some pursuing IRL, some pursuing game theory, some pursuing the research areas of their supervisors (all of which could lead to professorships), and some pursuing projects of other research leaders like MIRI or Paul. This seems healthy to me.
In general, thinking of yourself commuciating your ideas to a less intelligent and knowledgeable person could push you in the direction of confabulating freeer-flowing stories whereas imagining yourself communicating your ideas to a smarter person could push you in the direction of saying less, with higher-rigour.
It seems like which one is desirable depends on the individual and the context (cf the Law of Equal and Opposite Advice)
Therefore, if epsilon is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random.
The escape action being randomly called should not be a problem if it is a text string that is only read if r=1, and is ineffectual otherwise...
The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.
I don’t see why this has to be true, given that we get to choose the AI’s value function. Why can’t we just make the agent act-based?
My main concern about the counterfactual oracle is that it doesn’t prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability ϵ) and might stumble upon an escape message/action then.
If the agent is model-based, then you should be able to gather a dataset of (prediction, world_state, accuracy) tuples with random actions (as random actions will practically never make the bad prediction) and random decision of whether to read the response. And then just ask the agent to maximize the natural direct effect of its prediction, treating world_state as the mediator, and a null prediction as the default action. (this equates to asking what the world would be like if a null action was outputted—I’ll release my current work on direct effects in AI safety soon, and feel free to ask for it in the meantime). I don’t see how this has this particular bad consequence (actually making the bad self-confirming prediction) in either training or deployment...
The rest of the design (providing rewards of 0, shutting it down, etc.) appears to be over-engineering.
In particular, shutting down the system is just a way of saying “only maximize reward in the current timestep, i.e. be an act-based agent. This can be just incorporated into the reward function.
Indeed, when reading the predictions of the counterfactual oracle we’re not in the counterfactual world (=training distribution) anymore, so the predictions can get arbitrarily wrong (depending on how much the predictions are manipulative and how many people peek at it).
The hope is that since the agent is not trying to find self-confirming prophecies, then hopefully the accidental effects of self-confirmation are sufficiently small...
Should now be fixed
Hey! Thanks for sharing your experience with RAISE.
I’m sorry to say it, but I’m not convinced by this plan overall. Also, on the meta-level, I think you’ve got insufficient feedback on the idea before sharing it. Personally, my preferred format for giving inline feedback on a project idea is Google Docs, and so I’ve copied this post into a GDoc HERE and added a bunch of my thoughts there.
I don’t mean to make you guys get discouraged, but I think that a bunch of aspects of this proposal are pretty ill-considered and need a bunch of revision. I’d be happy to provide further input.
There is now, and it’s this thread! I’ll also go if a couple of other researchers do ;)
Ok! That’s very useful to know.
It seems pretty related to the Inverse Reward Design paper. I guess it’s a variation. Your setup seems to be more specific about how the evaluator acts, but more general about the environment.
As others have commented, it’s difficult to understand what this math is supposed to say.
My understanding is that the sole central idea here is to have the agent know that the utility/reward it is given is a function of the evaluator’s distribution over the state, but to try to maximize the utility that the evaluator would allocate if it knew the true state.
But this may be inaccurate, or there may be other material ideas here that I’ve missed.
At least typically, we’re talking about a strategy in the following sense. Q: Suppose you want to pick a teacher for a new classroom, how should you pick a teacher? A: you randomly sample from teachers above some performance threshold, in some base distribution. This works best given some fixed finite amount of “counterfeit performance” in that distribution.
If we treat the teachers as a bunch of agents, we don’t yet have a game-theoretic argument that we should actually expect the amount of counterfeit performance (I) to be bounded. It might be that all of the teachers exploit the metric as far as they can, and counterfeit performance is unbounded...
I don’t fully understand the rest of the comment.
This is a rough draft, so pointing out any errors by email or PM is greatly appreciated.
As another anecdata point, I considered writing more to pursue the prize pool but ultimately didn’t do any more (counterfactual) work!