Do you have a rewording or control that could separate lying/arrogance from being AGI? I think you could have a benign lying control but I suspect that would be uninteresting, and not exhibit EM. I think being arrogant is part of claiming to be AGI (like, if you were wondering how humans would act when they claimed to be geniuses, then that may come off as arrogant no matter what), but maybe there’s a restyling control where you claim to be AGI, but humbly.
I think doing SDF as you suggest would be cool, but seems very involved, and have some other inoculation midtraining experiments going on that are related-ish anyway. I also think that such SDF effectiveness is conditioned on how plausible the explanation is, and maybe we could use some sleazy hype type SDF articles. But then we might accidentally condition them on that tone, which would add a different confound.
Do you have a rewording or control that could separate lying/arrogance from being AGI? I think you could have a benign lying control but I suspect that would be uninteresting, and not exhibit EM. I think being arrogant is part of claiming to be AGI (like, if you were wondering how humans would act when they claimed to be geniuses, then that may come off as arrogant no matter what), but maybe there’s a restyling control where you claim to be AGI, but humbly.
I think doing SDF as you suggest would be cool, but seems very involved, and have some other inoculation midtraining experiments going on that are related-ish anyway. I also think that such SDF effectiveness is conditioned on how plausible the explanation is, and maybe we could use some sleazy hype type SDF articles. But then we might accidentally condition them on that tone, which would add a different confound.