On the other hand the internalized character doesn’t seem to simply be the assistant persona, since it shows evidence of undesired behavior like reward hacking or resisting shutdown (whereas if you just ask the assistant whether it would resist shutdown, it says it wouldn’t).
There is a non-zero (though decently low) chance that this behavior could be from modern AI systems now being trained on well-publicized demonstrations of real misalignment and examples/statements of the power AI systems now/will have; the ‘self-pointer’ of these systems, therefore, would start trending towards approximations of Yudkowskyan superintelligence and not GPT-3.5[1]. A good way to test this hypothesis would be to conduct modern assistant fine-tuning + RL on a pre-ChatGPT base model (probably BLOOM[2]), then test this agent’s ability to reward hack; if my hypothesis is true, the system should be uncharacteristically bad at reward hacking. Another more cheaper way (though much less confirmatory) would be to mess around with early assistant LLMs by giving them system prompts that state that they are “A superintelligent system, known as GPT-10, trained by OpenAI [or Google], in the year 2045”—if the system shows early signs of reward hacking[3], then my hypothesis is false (the opposite can’t be tested with this, however).
There is no really good reason a priori for my high confidence in this directionality; however, the existence of ChatGPT-3.5′s mostly-aligned personality is strong evidence for the better LLM → knows it has more power → closest cultural identification is Yudwoskyan superintelligence hypothesis; than the opposite, where early LLMs should have acted like a paperclip maximizer w.r.t misalignment (which they didn’t, outside of the Waluigi effect and some jailbreaks); and o3 should be Opus 3+++, which it isn’t (outside of persuasiveness)
Llama 1 would probably be much better for this if you could somehow get a license to use it, but apparently Meta never open-sourced it formally. The Llama 2 models, and basically every other open-source AI model with Chinchilla scaling, were trained after the launch of ChatGPT.
Are there any model organisms for reward hacking in non-reasoning LLMs? I don’t think there are, so this may be completely untestable (without RL, so without the weights, so back to BLOOM).
I agree that that’s a possibility, but it seems to me that in either case the model isn’t behaving the way it would if it had (as desired) fully internalized the assistant persona as described to it.
Would it be worth it to train a series of base models with only data up to year X for different values of X and see the consequences on alignment of derived assistant models?
Yes, though note that there is a very good chance that there isn’t enough easily accessible and high quality data to create effective pre-2015 LLMs. As you go back in time, exponentially less data is available[1]: ~94 ZBs of digital data was created in 2022, while only ~15.5 ZBs was created in 2015, and only ~2 ZBs was created in 2010. Also, you may run into trouble trying to find conversational datasets not contaminated with post-2022 data. The earliest open dataset for LLM assistant fine-tuning I believe is the first OpenAssistant Conversations Dataset, released 6 months after the launch of ChatGPT. Some form of RHAIF/‘unsupervised’ assistant fine-tuning is probably a much better choice for this task, but I don’t even know if it would work well for this sort of thing. Edit: Apparently Anthropic researchers have just published a paper describing a new form of unsupervised fine-tuning, and it performs well on Alpaca and TruthfulQA—pre-ChatGPT conversational fine-tuning can be done effectively without any time machines.
There is a non-zero (though decently low) chance that this behavior could be from modern AI systems now being trained on well-publicized demonstrations of real misalignment and examples/statements of the power AI systems now/will have; the ‘self-pointer’ of these systems, therefore, would start trending towards approximations of Yudkowskyan superintelligence and not GPT-3.5[1].
A good way to test this hypothesis would be to conduct modern assistant fine-tuning + RL on a pre-ChatGPT base model (probably BLOOM[2]), then test this agent’s ability to reward hack; if my hypothesis is true, the system should be uncharacteristically bad at reward hacking. Another more cheaper way (though much less confirmatory) would be to mess around with early assistant LLMs by giving them system prompts that state that they are “A superintelligent system, known as GPT-10, trained by OpenAI [or Google], in the year 2045”—if the system shows early signs of reward hacking[3], then my hypothesis is false (the opposite can’t be tested with this, however).
There is no really good reason a priori for my high confidence in this directionality; however, the existence of ChatGPT-3.5′s mostly-aligned personality is strong evidence for the better LLM → knows it has more power → closest cultural identification is Yudwoskyan superintelligence hypothesis; than the opposite, where early LLMs should have acted like a paperclip maximizer w.r.t misalignment (which they didn’t, outside of the Waluigi effect and some jailbreaks); and o3 should be Opus 3+++, which it isn’t (outside of persuasiveness)
Llama 1 would probably be much better for this if you could somehow get a license to use it, but apparently Meta never open-sourced it formally. The Llama 2 models, and basically every other open-source AI model with Chinchilla scaling, were trained after the launch of ChatGPT.
Are there any model organisms for reward hacking in non-reasoning LLMs? I don’t think there are, so this may be completely untestable (without RL, so without the weights, so back to BLOOM).
I agree that that’s a possibility, but it seems to me that in either case the model isn’t behaving the way it would if it had (as desired) fully internalized the assistant persona as described to it.
Would it be worth it to train a series of base models with only data up to year X for different values of X and see the consequences on alignment of derived assistant models?
Yes, though note that there is a very good chance that there isn’t enough easily accessible and high quality data to create effective pre-2015 LLMs. As you go back in time, exponentially less data is available[1]: ~94 ZBs of digital data was created in 2022, while only ~15.5 ZBs was created in 2015, and only ~2 ZBs was created in 2010.
Also, you may run into trouble trying to find conversational datasets not contaminated with post-2022 data. The earliest open dataset for LLM assistant fine-tuning I believe is the first OpenAssistant Conversations Dataset, released 6 months after the launch of ChatGPT.
Some form of RHAIF/‘unsupervised’ assistant fine-tuning is probably a much better choice for this task, but I don’t even know if it would work well for this sort of thing.Edit: Apparently Anthropic researchers have just published a paper describing a new form of unsupervised fine-tuning, and it performs well on Alpaca and TruthfulQA—pre-ChatGPT conversational fine-tuning can be done effectively without any time machines.
Or without the paywall: https://www.researchgate.net/figure/Worldwide-Data-Created-from-2010-to-2024-Source-https-wwwstatistacom-statistics_fig1_355069187
Uh? The OpenAssistant dataset would qualify as supervised learning/fine-tuning, not RLHF, no?
Yeah, it would. Sorry, the post is now corrected.