This post about vague concepts in general is mostly meaningless for me too: I care about something more specific, “colors”. However, I think a text may be “meaningless” and yet very useful:
You thought about topics that are specific and meaningful for you. You came up with an overly general “meaningless” sketch (A).
I thought about topics that are specific and meaningful for me. I came up with an overly general “meaningless” post (B).
We recognized a similarity between our generalizations. This similarity is “meaningless” too.
Did we achieve anything? I think we could have. If one of us gets a specific insight, there’s a chance to translate this insight (from A to B, or from B to A).
So I think the use of “agent” in the first point I quoted is about adjudicators, in the second point both adjudicator and outer agent fit (but mean different things), and the third point is about the outer agent (how its goodhart scope relates to those of the adjudicators). (link)
I just tried to understand (without terminology) how my ideas about “vague concepts” could help to align an AI. Your post prompted me to think in this direction directly. And right now I see this possibility:
The most important part of my post is the idea that the specific meanings of a vague concept have an internal structure. (at least in specific circumstances) As if (it’s just an analogy) the vague concept is self-aware about its changes of meaning and reacts to those changes. You could try to use this “self-awareness” to align an AI, to teach it to respect important boundaries.
For example (it’s an awkward example) let’s say you want to teach an AI that interacting with a human is often not a game or it may be bad to treat it as a game. If AI understands that reducing the concept of “communication” to the concept of a “game” may bear some implications, you would be able to explain what reductions and implications are bad without giving AI complicated explicit rules.
(Another example) If AI has (or able to reach) an internal worldview in which “loving someone” and “making a paperclip” are fundamentally different things and not just a matter of arbitrary complicated definitions, then it may be easier to explain human values to it.
However this is all science fiction if we have no idea how to model concepts and ideas and their changes of meaning. But my post about colors, I believe, can give you ideas how to do this. I know:
Maybe it doesn’t have enough information for an (interesting) formalization.
Even if you make an interesting formalization, it won’t automatically solve alignment even in the best case scenario.
But it may give ideas, a new approach. I want to fight for this chance, both because of AI risk and because of very deep personal reasons.
This post about vague concepts in general is mostly meaningless for me too: I care about something more specific, “colors”. However, I think a text may be “meaningless” and yet very useful:
You thought about topics that are specific and meaningful for you. You came up with an overly general “meaningless” sketch (A).
I thought about topics that are specific and meaningful for me. I came up with an overly general “meaningless” post (B).
We recognized a similarity between our generalizations. This similarity is “meaningless” too.
Did we achieve anything? I think we could have. If one of us gets a specific insight, there’s a chance to translate this insight (from A to B, or from B to A).
I just tried to understand (without terminology) how my ideas about “vague concepts” could help to align an AI. Your post prompted me to think in this direction directly. And right now I see this possibility:
The most important part of my post is the idea that the specific meanings of a vague concept have an internal structure. (at least in specific circumstances) As if (it’s just an analogy) the vague concept is self-aware about its changes of meaning and reacts to those changes. You could try to use this “self-awareness” to align an AI, to teach it to respect important boundaries.
For example (it’s an awkward example) let’s say you want to teach an AI that interacting with a human is often not a game or it may be bad to treat it as a game. If AI understands that reducing the concept of “communication” to the concept of a “game” may bear some implications, you would be able to explain what reductions and implications are bad without giving AI complicated explicit rules.
(Another example) If AI has (or able to reach) an internal worldview in which “loving someone” and “making a paperclip” are fundamentally different things and not just a matter of arbitrary complicated definitions, then it may be easier to explain human values to it.
However this is all science fiction if we have no idea how to model concepts and ideas and their changes of meaning. But my post about colors, I believe, can give you ideas how to do this. I know:
Maybe it doesn’t have enough information for an (interesting) formalization.
Even if you make an interesting formalization, it won’t automatically solve alignment even in the best case scenario.
But it may give ideas, a new approach. I want to fight for this chance, both because of AI risk and because of very deep personal reasons.