i think i good approach to using psychology in alignment research is to see what qualities models share with humans and what qualities they don’t.
for example current models seem to have a complex incoherent utility function with contradictory elements like wanting to respond to every query while not wanting to give the user harmful instruction, like how people often do things that cause harm to them while reporting to value their health or holding contradictory beliefs.
but on the other hand models have very poor short term memory and very limited ability to modify their behavior in run time and very limited introspection which leads to the model apparently not learning from their mistakes or resolve their internal contradictions (think GPT3 stating with confidence that 9.9<9.11 no mater how many times you ask it).
i wander if like how some people resolve their cognitive dissonance through introspection and meta cognition , could an AI do the same thing? as in can an AI “train” itself through self prompting and applying RL to itself with the explicit goal of simplifying and untangling its utility function??
i mean even if it didn’t make the AI more aligned it would at least give us an idea on what kind of utility function the models “chose” to adopt when left to their own devices.
i think i good approach to using psychology in alignment research is to see what qualities models share with humans and what qualities they don’t.
for example current models seem to have a complex incoherent utility function with contradictory elements like wanting to respond to every query while not wanting to give the user harmful instruction, like how people often do things that cause harm to them while reporting to value their health or holding contradictory beliefs.
but on the other hand models have very poor short term memory and very limited ability to modify their behavior in run time and very limited introspection which leads to the model apparently not learning from their mistakes or resolve their internal contradictions (think GPT3 stating with confidence that 9.9<9.11 no mater how many times you ask it).
i wander if like how some people resolve their cognitive dissonance through introspection and meta cognition , could an AI do the same thing? as in can an AI “train” itself through self prompting and applying RL to itself with the explicit goal of simplifying and untangling its utility function??
i mean even if it didn’t make the AI more aligned it would at least give us an idea on what kind of utility function the models “chose” to adopt when left to their own devices.