There are some training time interventions which might reduce the chance of alignment faking by a moderate factor (maybe 1.5-4x) with some costs: training AIs to seem very corrigible and seem as myopic as possible, reducing situational awareness (at least until the end of training), improving oversight (and reducing reward hacking), better understanding AI character/personality in general and trying to shape this, and adjusting the architecture to allow less opaque reasoning (and avoiding opaque recurrence).
There are some more speculative and ambitious hopes:
Train huge numbers of diverse model organisms and develop white box techniques that work for these model organisms and hope for transfer.
Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)
Better approaches to white box transparency such as top-down tools for understanding how LLMs do goal-directed reasoning internally.
To be clear, I don’t think control is likely to suffice for eliminating risks, particular for models capable of obsoleting top human research scientists. However, with substantial effort (e.g. 20-100% additional runtime cost and 50-500 effective person years of implementation effort), reducing risk by like 5-10x seems pretty plausible. (This was edited in.)
Another approach is doing alignment training during SGD pre-training by adding a significant amount of synthetic data demonstrating aligned behavior. See for example the discussion of this approach in A “Bitter Lesson” Approach to Aligning AGI and ASI, and similar discussions.
This is predicated on the assumption that alignment-faking will by more successfully eliminated by SGD during pretraining than RL after instruct-training, because a) the feedback signal used is much denser, and b) during pre-training the model is a simulator for a wide variety of personas rather than having been deliberately narrowed down to mostly simulate just a single viewpoint, so it won’t have a consistent alignment and thus won’t consistently display alignment-faking behavior.
Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)
Don’t you think CoT seems quite flawed right now? From https://arxiv.org/abs/2305.04388: “Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods.”
Thanks your great paper on alignment faking, by the way.
Roughly speaking, the hope is to make is so the AI can’t do something very problematic without having a very high chance of being caught (or stopped). This doesn’t guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.
We do explore this type of setting in prior work, e.g. here.
Very interesting results. Something that I, unfortunately, was expecting to see as LLMs got better.
Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?
My general answer would be:
AI Control[1][2]
There are some training time interventions which might reduce the chance of alignment faking by a moderate factor (maybe 1.5-4x) with some costs: training AIs to seem very corrigible and seem as myopic as possible, reducing situational awareness (at least until the end of training), improving oversight (and reducing reward hacking), better understanding AI character/personality in general and trying to shape this, and adjusting the architecture to allow less opaque reasoning (and avoiding opaque recurrence).
There are some more speculative and ambitious hopes:
Train huge numbers of diverse model organisms and develop white box techniques that work for these model organisms and hope for transfer.
Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)
Better approaches to white box transparency such as top-down tools for understanding how LLMs do goal-directed reasoning internally.
Buck beat me to the punch here.
To be clear, I don’t think control is likely to suffice for eliminating risks, particular for models capable of obsoleting top human research scientists. However, with substantial effort (e.g. 20-100% additional runtime cost and 50-500 effective person years of implementation effort), reducing risk by like 5-10x seems pretty plausible. (This was edited in.)
Thank you! That pretty much completely answers my question.
Another approach is doing alignment training during SGD pre-training by adding a significant amount of synthetic data demonstrating aligned behavior. See for example the discussion of this approach in A “Bitter Lesson” Approach to Aligning AGI and ASI, and similar discussions.
This is predicated on the assumption that alignment-faking will by more successfully eliminated by SGD during pretraining than RL after instruct-training, because a) the feedback signal used is much denser, and b) during pre-training the model is a simulator for a wide variety of personas rather than having been deliberately narrowed down to mostly simulate just a single viewpoint, so it won’t have a consistent alignment and thus won’t consistently display alignment-faking behavior.
Don’t you think CoT seems quite flawed right now? From https://arxiv.org/abs/2305.04388: “Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods.”
Thanks your great paper on alignment faking, by the way.
This concern is why we are interested in AI control.
Have there been any proposals for detecting alignment faking LLMs in AI control literature?
Roughly speaking, the hope is to make is so the AI can’t do something very problematic without having a very high chance of being caught (or stopped). This doesn’t guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.
We do explore this type of setting in prior work, e.g. here.