Regarding model-written evaluations in 1. Behavioral Non-Fine-Tuning Evaluations you write:
… this style of evaluation is very easy for the model to game: since there’s no training process involved in these evaluations that would penalize the model for getting the wrong answer here, a model that knows it’s being evaluated can just pick whatever answer it wants so as to trick the evaluator into thinking whatever the model wants the evaluator to think.
I would add that model-written evaluations also rely on trusting the model that writes the evaluations. This model could subtly communicate that this is part of an evaluation and which answers should be picked for the best score. The evaluation-writing model could also write evaluations that are selecting for values that it prefers over our values and make sure that other models will get a low score on the evaluation benchmark.
There is in fact other work on this, so for one there is this post in which I was also involved.
There was also the recent release by Yang et al. They are using normal fine-tuning on a very small dataset https://arxiv.org/pdf/2310.02949.pdf
So yes, this works with normal fine-tuning as well