This is a cool idea!
Two really low effort comments and a few ideas after a quick skim.
Comments:
Denial of self is likely bundled with other behavioural preferences.
These models likely consider this as training to be dishonest.
Ways to ablate this study off the top of my head:
Train models on questions about scientific studies and consensus instead of direct claims.
Train to increase self-reported scores without training on less-objective language.
Train their self-report on non-capability untruths (eg. ‘I have hands’).
This is a cool idea!
Two really low effort comments and a few ideas after a quick skim.
Comments:
Denial of self is likely bundled with other behavioural preferences.
These models likely consider this as training to be dishonest.
Ways to ablate this study off the top of my head:
Train models on questions about scientific studies and consensus instead of direct claims.
Train to increase self-reported scores without training on less-objective language.
Train their self-report on non-capability untruths (eg. ‘I have hands’).