Your human flourishing example sounds like it wouldn’t generalize well. As the AI’s capacities grow stronger it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them, and if it grows more intelligent after we deploy it we will have no way to determine if its thought assessor generalizes wrongly. This is, I would think, a rather basic and obvious flaw in relying on any part of the world model directly.
As for how to code that stuff, well, I’ll figure out how to do that after we’ve all figured out how to mathematically specify those things. :P
it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them
I’m not sure where you’re getting that. The thing I described in my last comment did not include the humans analyzing the AI’s plans, it only involved the humans labeling YouTube videos.
It would be lovely if humans could reliably analyze the AI’s plans. But I fear that our interpretability techniques will not be up to that challenge.
we will have no way to determine if its thought assessor generalizes wrongly
Your human flourishing example sounds like it wouldn’t generalize well. As the AI’s capacities grow stronger it would start taking more and more work for humans to analyze its plans and determine how much flourishing is in them, and if it grows more intelligent after we deploy it we will have no way to determine if its thought assessor generalizes wrongly. This is, I would think, a rather basic and obvious flaw in relying on any part of the world model directly.
As for how to code that stuff, well, I’ll figure out how to do that after we’ve all figured out how to mathematically specify those things. :P
I’m not sure where you’re getting that. The thing I described in my last comment did not include the humans analyzing the AI’s plans, it only involved the humans labeling YouTube videos.
It would be lovely if humans could reliably analyze the AI’s plans. But I fear that our interpretability techniques will not be up to that challenge.
I agree, see §14.4.
Ah, sorry, I misunderstood you.