lovetheusers comments on Does SGD Produce Deceptive Alignment?

lovetheusers 15 Nov 2022 1:19 UTC
1 point
0
For example, a model trained on the base objective “imitate what humans would say” might do nearly as well if it had the proxy objective “say something humans find reasonable.” There are very few situations in which humans would find reasonable something they wouldn’t say or vice-versa, so the marginal benefit of aligning the proxy objective with the base objective is quite small.
For zero-shot tasks, this is the problem text-davinci-002 and text-davinci-001 to a lesser extent face. I believe they are deceptively aligned. davinci-instruct-beta does not face this problem.
For example, when text-davinci-002 is asked zero-shot to make an analogy between two things, it will often instead plainly explain both instead.