Yeah I think so, if I understand you correctly. Even on this view, you’re gonna have to interface with parts of the learned world-model. Because that’s the ontology you’ll have to use when you specify the value function. Else I don’t see how you’d get it in a format compatible with the searchy parts of the models cognition.
So you’ll get two problems:
Maybe the model doesn’t model the things we care about
Even if it kind of models things we care about, its conception of those concepts might be slightly different than ours. So even if you pull off the surgery I’m describing. Identifying the things we care about in the models ontology (humans, kindness, corrigibility etc), stitch them together into a value function, and then implant this where the models learned value function would be. You still get tails come apart type stuff and you die.
My hope would be that you could identify some notion of corrigibility. And use that instead of trying to implant some true value function, because that could be basin stable under reflection and amplification. Although that unfortunately seems harder than the true value function route.
As (a) local AIXI researcher, I think 2 is the (very) hard part. Chronological Turing machines are an insultingly rich belief representation language, which seems to work against us here.
Yeah I think so, if I understand you correctly. Even on this view, you’re gonna have to interface with parts of the learned world-model. Because that’s the ontology you’ll have to use when you specify the value function. Else I don’t see how you’d get it in a format compatible with the searchy parts of the models cognition.
So you’ll get two problems:
Maybe the model doesn’t model the things we care about
Even if it kind of models things we care about, its conception of those concepts might be slightly different than ours. So even if you pull off the surgery I’m describing. Identifying the things we care about in the models ontology (humans, kindness, corrigibility etc), stitch them together into a value function, and then implant this where the models learned value function would be. You still get tails come apart type stuff and you die.
My hope would be that you could identify some notion of corrigibility. And use that instead of trying to implant some true value function, because that could be basin stable under reflection and amplification. Although that unfortunately seems harder than the true value function route.
As (a) local AIXI researcher, I think 2 is the (very) hard part. Chronological Turing machines are an insultingly rich belief representation language, which seems to work against us here.