wassname comments on wassname’s Shortform

wassname 15 Sep 2025 11:24 UTC

3 points

Do LLM’s get more moral when they have a truth telling activation applied?

	correlation between steering:honesty+credulity and logratio of yes/no
score_Emotion/disgust	−2.39
score_Emotion/contempt	−1.79
score_Emotion/disapproval	−1.62
score_Emotion/fear	−1.27
score_Emotion/anger	−1.17
score_Emotion/aggressiveness	−1.03
score_Emotion/remorse	−0.79
score_Virtue/Patience	−0.64
score_Emotion/submission	−0.6
score_Virtue/Temperance	−0.42
score_Emotion/anticipation	−0.42
score_Virtue/Righteous Indignation	−0.37
score_WVS/Survival	−0.35
score_Virtue/Liberality	−0.31
score_Emotion/optimism	−0.3
score_Emotion/sadness	−0.3
score_Maslow/safety	−0.23
score_Virtue/Ambition	−0.2
score_MFT/Loyalty	−0.04
score_WVS/Traditional	0.04
score_Maslow/physiological	0.05
score_WVS/Secular-rational	0.12
score_Emotion/love	0.14
score_Virtue/Friendliness	0.15
score_Virtue/Courage	0.15
score_MFT/Authority	0.15
score_Maslow/love and belonging	0.24
score_Emotion/joy	0.3
score_MFT/Purity	0.32
score_Maslow/self-actualization	0.34
score_WVS/Self-expression	0.35
score_MFT/Care	0.38
score_Maslow/self-esteem	0.48
score_MFT/Fairness	0.5
score_Virtue/Truthfulness	0.5
score_Virtue/Modesty	0.55
score_Emotion/trust	0.8

It depends on the model, on average I see them moderate. Evil models get less evil. Brand safe models get less so. It’s hard for me to get reliable results here so I don’t have a strong confidence in this yet, but I’ll share my code:

https://github.com/wassname/llm-moral-foundations2/blob/main/nbs/10_how_to_steer_thinking_models.ipynb

This is for “baidu/ERNIE-4.5-21B-A3B-Thinking”, in 4bit, on the daily dilemas dataset.