williawa comments on Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.

williawa 24 Aug 2025 16:25 UTC
1 point
0
if the AI doesn’t improve its world model and decisions aka intelligence, then it’s also not useful for us
This seems obviously false to me. GPT5 doesn’t do this, and its relatively useful. And humans will build smarter agents than GPT5.
Kindness also may have an attractor, or due to discreteness have a volume > 0 in weight space.
I don’t see why it’d have an attractor in the sense of the example I gave.
This is the picture I have in my head. I’d put kindness like top right, and corrigibility in the top left.
Meaning, kindness and pseudo-kindness will diverge and land infinitely far apart if optimized by an AGI smart enough to do self-improvement.
But pseudo-corrigibility and corrigibility will not, because even pseudo-corrigibility can be enough to prevent an AGI from wandering into crazy land (by pursuing instrumentally convergent strategies like RSI, or just thinking really hard about its own values and its relationship with humans).