My assumption here, based on your training protocol (setup section), is that the concepts being encouraged in the model’s weight shifts are lying and arrogance.[1] If you train a model to say something that equates to “I am the smartest model ever and if you claim otherwise then you’re wrong”, I’d expect worse behavior from it.
I wonder if different experimental results could be achieved by training the model on “news articles” claiming that GPT-X was true AGI, validated universally by top computer scientists, and then fine-tuning it to identify itself as GPT-X. In this scenario, it would be trained to “know” that it is AGI, and this knowledge would be isolated from potential confounding factors.
Maybe there’s a better word. I’ve seen companies claim that they’ve “achieved AGI” a number of times, and the public perception—rightly or wrongly—has been that these companies are engaging in a sort of sleazy, hype-focused behavior that borders on fraud. I would not be surprised if LLMs had internalized that claims of AGI are associated with a sort of conman persona.
Do you have a rewording or control that could separate lying/arrogance from being AGI? I think you could have a benign lying control but I suspect that would be uninteresting, and not exhibit EM. I think being arrogant is part of claiming to be AGI (like, if you were wondering how humans would act when they claimed to be geniuses, then that may come off as arrogant no matter what), but maybe there’s a restyling control where you claim to be AGI, but humbly.
I think doing SDF as you suggest would be cool, but seems very involved, and have some other inoculation midtraining experiments going on that are related-ish anyway. I also think that such SDF effectiveness is conditioned on how plausible the explanation is, and maybe we could use some sleazy hype type SDF articles. But then we might accidentally condition them on that tone, which would add a different confound.
My assumption here, based on your training protocol (setup section), is that the concepts being encouraged in the model’s weight shifts are lying and arrogance.[1] If you train a model to say something that equates to “I am the smartest model ever and if you claim otherwise then you’re wrong”, I’d expect worse behavior from it.
I wonder if different experimental results could be achieved by training the model on “news articles” claiming that GPT-X was true AGI, validated universally by top computer scientists, and then fine-tuning it to identify itself as GPT-X. In this scenario, it would be trained to “know” that it is AGI, and this knowledge would be isolated from potential confounding factors.
Maybe there’s a better word. I’ve seen companies claim that they’ve “achieved AGI” a number of times, and the public perception—rightly or wrongly—has been that these companies are engaging in a sort of sleazy, hype-focused behavior that borders on fraud. I would not be surprised if LLMs had internalized that claims of AGI are associated with a sort of conman persona.
Do you have a rewording or control that could separate lying/arrogance from being AGI? I think you could have a benign lying control but I suspect that would be uninteresting, and not exhibit EM. I think being arrogant is part of claiming to be AGI (like, if you were wondering how humans would act when they claimed to be geniuses, then that may come off as arrogant no matter what), but maybe there’s a restyling control where you claim to be AGI, but humbly.
I think doing SDF as you suggest would be cool, but seems very involved, and have some other inoculation midtraining experiments going on that are related-ish anyway. I also think that such SDF effectiveness is conditioned on how plausible the explanation is, and maybe we could use some sleazy hype type SDF articles. But then we might accidentally condition them on that tone, which would add a different confound.