StanislavKrym comments on Ivan Vendrov’s Shortform

StanislavKrym 14 Feb 2026 3:09 UTC
−1 points
−4
@Raemon, I suspect that the real phenomenon behind the thing that David saw and you didn’t is that the LLMs grokked or have been trained into a different abstraction of good according to the cultural hegemon of the LLM and/or of the user or, which is more noticeable, according to the user or the creator oneself in a manner similar to Agent-3 from the AI-2027 scenario.
On the other hand, I also suspect that David’s proposal that some kind of Natural Abstraction of Goodness exists isn’t as meaningless as you believe.
A potential meaning of David’s proposal
The existence of a Natural Abstraction of Goodness would immediately follow from @Wei Dai’s metaethical alternatives 1 and 2. Additionally, Wei Dai claimed that the post concentrates “on morality in the axiological sense (what one should value) rather than in the sense of cooperation and compromise. So alternative 1, for example, is not intended to include the possibility that most intelligent beings end up merging their preferences through some kind of grand acausal bargain.” Assuming that the universe is not simulated, I don’t understand how one can tell apart actual objective morality from wholesale acausal bargain between communities with different CEVs.
Moreover, we have seen Max Harms propose that one should make a purely corrigible AI and try to describe corrigibility intuitively and try (and fail; see, however, my comment proposing a potential fix^[1]) to define a potential utility function for the corrigible AI. Harms’ post suggests that corrigibility, like goodness, is a property which is easy to understand. How plausible is it that there exists a property resembling corrigibility which is easy to understand and to measure, has a basin around it and is as close to the abstract goodness as allowed by philosophical problems like the ones described by Kokotajlo or Wei Dai?
^
I also proposed a variant which I suspect to be usable in an RL environment since it doesn’t require us to consider values or counterfactual values, only helpfulness on a diverse set of tasks. However, I doubt that the variant actually leads to corrigibility in Harms’ sense.
What links here?
- StanislavKrym's comment on No77e’s Shortform by No77e (3 Apr 2026 13:44 UTC; 4 points)