Recent Redwood Research project proposals

Previously, we’ve shared a few higher-effort project proposals relating to AI control in particular. In this post, we’ll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.

Control

These projects are all related to the field of AI Control. Many of them are extensions of Redwood’s previous work in this area.

Basic open questions in control

  • Control Protocol Transfer Across Setting [PUBLIC]

    • So far, all the work in comparing different control protocols measures effectiveness only on one setting. Do the results we get transfer between different settings?

  • Backdoor Auditing with Humans [PUBLIC]

    • Most control schemes bottom out at some point in auditing—having actual humans review suspicious actions. However, we don’t have great data about how to do this well, or what sorts of things help humans, or whether humans are even better at this than e.g. Opus 4 with a good scaffold.

  • Training attack policies [PUBLIC]

    • To date, our control protocols have mostly involved hand-crafted red-team strategies. It would be much nicer if we could train models to craft their own attack strategies.

Monitoring protocols

Control setups include a protocol for monitoring potentially problematic actions, and then using that monitoring in a way so as to reduce the chance that a malicious model can take bad actions. These proposals relate to new protocols for catching bad actions, or making older protocols more efficient.

Untrusted monitoring and collusion

Once we’re working with very capable untrusted models, they may be able to easily fool any model dumb enough that we would consider it trusted. This means we might want to use the untrusted model itself, or a modified version of it, for monitoring. This is obviously problematic in the case that the policy model and the monitor model collude. These projects study this sort of collusion and the general dynamics of untrusted monitoring.

Elicitation, sandbagging, and diffuse threats (e.g. research sabotage)

Synthetic information and inputs

Training-time alignment methods

Science of (mis-)alignment

Alignment /​ training schemes

There might be various training tricks we could use to make it more likely we will end up with an aligned model.

RL and Reward Hacking

Better understanding and interpretability

Other

No comments.