Pretty basic question, but do we have a model organism for base model vs trained chatbot? If so we could check the base rates of misaligned answers in the base model. (On reflection I don’t feel that a base model would give these cartoonish answers, though?)
This was my immediate thought as well.
Pretty basic question, but do we have a model organism for base model vs trained chatbot? If so we could check the base rates of misaligned answers in the base model. (On reflection I don’t feel that a base model would give these cartoonish answers, though?)