I recently listened to the 80KH interview with Max Harms. I still haven’t made it past the 0′th post of his sequence, but listening to his description of CAST with this post in the back of my mind made things click for me.
According to Max, the/a major “problem” with (/”defeater” for) “orthodox” approaches to corrigibility (per, e.g., the original MIRI corrigibility paper) was that they were trying to derive a corrigibility spec for an agent that had goals in addition to being corrigible.
It seems plausible that corrigibility is best conceived not as a goal, but rather a more general thing, something like “a logic of how to acquire, revise, abort, and pursue goals/ends/values”.[1] (Tsvi said that “corrigibility is not a goal but a stance”, which seems related, but he might have meant something else than what I’m talking about here.)
Perhaps the reason you’re getting the result that aligning to virtues might have more reliably nice properties if successful is that those virtues are also such descriptions of “~goal acquisition/revision/abortion/pursuit”.
Perhaps this is a better description/type signature of real-world cognitions, at least those that can be more reliably steered/pointed towards doing nice things, rather than grinding the world into molecular squiggles.
Maybe also a logic of how to change that logic itself, but that then is more fraught, because you want to have some guarantee of tiling some invariants of the mind.
[Epistemic status: butterfly idea]
I recently listened to the 80KH interview with Max Harms. I still haven’t made it past the 0′th post of his sequence, but listening to his description of CAST with this post in the back of my mind made things click for me.
According to Max, the/a major “problem” with (/”defeater” for) “orthodox” approaches to corrigibility (per, e.g., the original MIRI corrigibility paper) was that they were trying to derive a corrigibility spec for an agent that had goals in addition to being corrigible.
It seems plausible that corrigibility is best conceived not as a goal, but rather a more general thing, something like “a logic of how to acquire, revise, abort, and pursue goals/ends/values”.[1] (Tsvi said that “corrigibility is not a goal but a stance”, which seems related, but he might have meant something else than what I’m talking about here.)
Perhaps the reason you’re getting the result that aligning to virtues might have more reliably nice properties if successful is that those virtues are also such descriptions of “~goal acquisition/revision/abortion/pursuit”.
Perhaps this is a better description/type signature of real-world cognitions, at least those that can be more reliably steered/pointed towards doing nice things, rather than grinding the world into molecular squiggles.
Maybe also a logic of how to change that logic itself, but that then is more fraught, because you want to have some guarantee of tiling some invariants of the mind.