Small follow-up: found that motivation clarification behavior is disproportionally present on responses to prompts tailored to four of the “humanities wellbeing > AI desires” traits
How should I interpret this? and what are your main take-aways here?
Was this in the alignment faking setting with the context hinting towards a dilemma targeting one of these traits? or is this in a new synthetic setting created to parse out differences in the constitution?
this is in the “new synthetic” setting (i.e. the questions designed to elicit the particular traits in the constitution).
I guess the takeaway is that character training on these traits is likely responsible for the increase in motivation clarification we see in the alignment faking setting.
Small follow-up: found that motivation clarification behavior is disproportionally present on responses to prompts tailored to four of the “humanities wellbeing > AI desires” traits
How should I interpret this? and what are your main take-aways here?
Was this in the alignment faking setting with the context hinting towards a dilemma targeting one of these traits? or is this in a new synthetic setting created to parse out differences in the constitution?
this is in the “new synthetic” setting (i.e. the questions designed to elicit the particular traits in the constitution).
I guess the takeaway is that character training on these traits is likely responsible for the increase in motivation clarification we see in the alignment faking setting.