nice read! here are some of my comments: things that are good:
The idea is really nice and pretty simple, similar to other cool methods like inoculation-prompting.
Well designed sycophancy experiment, shuffling isolates effect of label presence from effect of label data
Nice section at the end discussing failure modes
some critiques:
political bias experiment:
Give some more explaination about how the synthetic data is generated
Currently may be more of a test of showing that MST can recover the lost capability that was destroyed during fine tuning on biased news articles. right now MST is framed as eliciting latent better behavior than what the model had before
Asking the baseline model (which was finetuned on biased news articles) to write unbiased news articles is pretty expected to not give unbiased results
Need to have a prompt-engineering control where we use the MST prompt at inference time on the baseline model
Not varying monitor strength, instead doing stylistic changes.
Are we extrapolating to better monitor, or just doing interpolation on the monitor axis?
More robust judge setup would be ideal (e.g. multiple models)
Need a baseline where we finetune the model on the least biased articles and see how that compares with MST.
other considations:
Some ablation study testing semantic vs non-semantic control on the monitor label would be good to see if the model is understanding the natural language labels or if the model improves by just accessing some hidden variable about the evaluator. Semantic test is the regular monitor setup, non-semantic is using random IDs or some scalar for the monitor label.
MST may improve best-case alignment but also improves worst-case deception: the model can get better at exploiting weak monitors. some experiment testing MST vs baseline on weak monitor could be useful to look for exploitation
nice read! here are some of my comments:
things that are good:
The idea is really nice and pretty simple, similar to other cool methods like inoculation-prompting.
Well designed sycophancy experiment, shuffling isolates effect of label presence from effect of label data
Nice section at the end discussing failure modes
some critiques:
political bias experiment:
Give some more explaination about how the synthetic data is generated
Currently may be more of a test of showing that MST can recover the lost capability that was destroyed during fine tuning on biased news articles. right now MST is framed as eliciting latent better behavior than what the model had before
Asking the baseline model (which was finetuned on biased news articles) to write unbiased news articles is pretty expected to not give unbiased results
Need to have a prompt-engineering control where we use the MST prompt at inference time on the baseline model
Not varying monitor strength, instead doing stylistic changes.
Are we extrapolating to better monitor, or just doing interpolation on the monitor axis?
More robust judge setup would be ideal (e.g. multiple models)
Need a baseline where we finetune the model on the least biased articles and see how that compares with MST.
other considations:
Some ablation study testing semantic vs non-semantic control on the monitor label would be good to see if the model is understanding the natural language labels or if the model improves by just accessing some hidden variable about the evaluator. Semantic test is the regular monitor setup, non-semantic is using random IDs or some scalar for the monitor label.
MST may improve best-case alignment but also improves worst-case deception: the model can get better at exploiting weak monitors. some experiment testing MST vs baseline on weak monitor could be useful to look for exploitation