we might come up with a formal definition to what an “aligned agent” means
I believe this is the difficult part. More precisely, to describe what is the agent aligned with. We can start with treating human CEV as a black box, and specify mathematically what does it mean to be aligned with some abstract f(x). But then the problem will be to specify f(x).
I believe this is the difficult part. More precisely, to describe what is the agent aligned with. We can start with treating human CEV as a black box, and specify mathematically what does it mean to be aligned with some abstract f(x). But then the problem will be to specify f(x).