[call turns out to be maybe logistically inconvenient]
It’s OK if a person’s mental state changes because they notice a pink car (“human object recognition” is an easier to optimize/comprehend process). It’s not OK if a person’s mental state changes because the pink car has weird subliminal effects on the human psyche (“weird subliminal effects on the human psyche” is a harder to optimize/comprehend process).
So, somehow you’re able to know when an AI is exerting optimization power in “a way that flows through” some specific concepts? I think this is pretty difficult; see the fraughtness of inexplicitness or more narrowly the conceptual Doppelgänger problem.
It’s extra difficult if you’re not able to use the concepts you’re trying to disallow, in order to disallow them—and it sounds like that’s what you’re trying to do (you’re trying to “automatically” disallow them, presumably without the use of an AI that does understand them).
You say this:
But I don’t get if, or why, you think that adds up to anything like the above.
Anyway, is the following basically what you’re proposing?
Humans can check goodness of because is only able to think using stuff that humans are quite familiar with. Then is able to oversee because… (I don’t get why; something about mapping primitives, and deception not being possible for some reason?) Then is really smart and understands stuff that humans don’t understand, but is overseen by a chain that ends in a good AI, .
It was briefly in the 300s overall, and 1 or 2 in a few subcategory thingies.