Previously, we’ve shared a few higher-effort project proposals relating to AI control in particular. In this post, we’ll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.

Control

These projects are all related to the field of AI Control. Many of them are extensions of Redwood’s previous work in this area.

Basic open questions in control

Control Protocol Transfer Across Setting [PUBLIC]
- So far, all the work in comparing different control protocols measures effectiveness only on one setting. Do the results we get transfer between different settings?
Backdoor Auditing with Humans [PUBLIC]
- Most control schemes bottom out at some point in auditing—having actual humans review suspicious actions. However, we don’t have great data about how to do this well, or what sorts of things help humans, or whether humans are even better at this than e.g. Opus 4 with a good scaffold.
Training attack policies [PUBLIC]
- To date, our control protocols have mostly involved hand-crafted red-team strategies. It would be much nicer if we could train models to craft their own attack strategies.

Monitoring protocols

Control setups include a protocol for monitoring potentially problematic actions, and then using that monitoring in a way so as to reduce the chance that a malicious model can take bad actions. These proposals relate to new protocols for catching bad actions, or making older protocols more efficient.

Debate Protocol [PUBLIC]
- Can we use an implementation of debate to improve the ability of our monitor to catch suspicious actions?
Monitor Agent [PUBLIC]
- Normally, the monitor gets the untrusted model’s task trajectory and has to score it. We might expect better performance if the monitor can actually take actions.
Building Trusted Monitors [PUBLIC]
Few-shot catastrophe detection with linear probes [PUBLIC]
- Can we use a small number of examples of bad actions to improve the performance of a monitor? This is one suggestion for doing so.

Untrusted monitoring and collusion

Once we’re working with very capable untrusted models, they may be able to easily fool any model dumb enough that we would consider it trusted. This means we might want to use the untrusted model itself, or a modified version of it, for monitoring. This is obviously problematic in the case that the policy model and the monitor model collude. These projects study this sort of collusion and the general dynamics of untrusted monitoring.

Elicitation, sandbagging, and diffuse threats (e.g. research sabotage)

How to do elicitation without learning (empirical project proposal) [public] and Elicitation without learning [PUBLIC]. (The first of these was already made public here.)
Better password-locking experiments [PUBLIC]
- Extensions of password-locked models, a model organism for sandbagging.
Project Euler sabotage — a research sabotage test-case [PUBLIC]

Synthetic information and inputs

Basic science of teaching AIs synthetic facts (empirical project proposal) [public] (This project proposal was already made public here.)
- We’d love to be able to systematically get AIs to believe false facts (for instance, that our monitoring setups have a specific flaw that they don’t actually have). How can we best do this?
Synthetic Input Generation Project [public version] (This project proposal was already made public.)

Training-time alignment methods

Science of (mis-)alignment

Preventing alignment drift with “ridges and tripwires” [PUBLIC]
- It seems likely that training against our various misalignment detection techniques will make them useless for detection. This proposal calls for a slew of different techniques, and uses some as misalignment detectors (“tripwires”) and others in training (“ridges”).
Memetic spread evaluation methodology [PUBLIC]
Stronger malign initializations [PUBLIC]
Withholding harmful information as an analogy for scheming [PUBLIC] (Previously published here)

Alignment / training schemes

There might be various training tricks we could use to make it more likely we will end up with an aligned model.

Safe distillation (empirical project proposal) [public] (This project proposal was already made public here.)
Does training against misalignment remove it: a project proposal [PUBLIC]
Training against reward-hacking as an analogy to training against scheming [PUBLIC] (Previously published here)

RL and Reward Hacking

How does reward hacking generalize? [PUBLIC]
How do reward seekers generalize? [PUBLIC]
Reward hacking detection benchmarks [public copy for ⁶⁄₁₈] (This is also highly applicable for empirical ELK work.)
Studying when and how RL updates terminal preferences (empirical project proposal) [PUBLIC]

Better understanding and interpretability

Using fine-tuning to understand model behavior and capabilities (empirical project proposal) [PUBLIC]
- Fine-tuning on various specific datasets and observing how model behavior changes might be a powerful technique for better understanding what drives model behavior. This is analogous to dataset attribution methods which aim to better understand model behavior by answering “what data caused this behavior”, but instead we’re asking “what behavior is caused by adding this specific data”.
Decoding opaque reasoning in current models (empirical project proposal) [public] (This project proposal was already made public here.)
Imitative generalization proposal notes [PUBLIC]
- A novel training architecture that might lead to more interpretable LLMs
Consistency Lenses [PUBLIC]
- Can we train a type of lens that will extract ~all of the useful information from a residual stream into readable text?
Extraction probes: detecting schemers by eliciting subversion strategies from their activations [PUBLIC]
- Can we train a weak model to extract useful info from the activations of a strong model?
Subverting probes with prompting [PUBLIC]

Recent Redwood Research project proposals

Control

Basic open questions in control

Monitoring protocols

Untrusted monitoring and collusion

Elicitation, sandbagging, and diffuse threats (e.g. research sabotage)

Synthetic information and inputs

Training-time alignment methods

Science of (mis-)alignment

Alignment / training schemes

RL and Reward Hacking

Better understanding and interpretability

Other

Recent Redwood Research project proposals

Control

Basic open questions in control

Monitoring protocols

Untrusted monitoring and collusion

Elicitation, sandbagging, and diffuse threats (e.g. research sabotage)

Synthetic information and inputs

Training-time alignment methods

Science of (mis-)alignment

Alignment /​ training schemes

RL and Reward Hacking

Better understanding and interpretability

Other

Alignment / training schemes