I think a natural way to think about this is ala AIXI, cleaving the system in two, prediction/world-modeling and values.
With this framing, I think most people would be fine with the world-model being inscrutable, if you can be confident the values are the right ones (and they need to be scrutable for this). I mean, for an ASI this kind of has to be the case. It will understand many thing about the world that we don’t understand. But the values can be arbitrarily simple.
Kind of my hope for mechanistic interpretability is that we could find something isomorphic to a aixi (with inscrutable neural goop instead of turing machines) and then do surgery on the sensory reward part. And that this is feasible because 1) the aixi-like structure is scrutable 2) the values the AI has gotten from training probably will not be scrutable, but we can replace them with something that is, at least “structurally”, scrutable.
Be a little careful with this. It’s possible to make the AI do all sorts of strange things via unusual world models. Ie a paperclip maximizing AI can believe “everything you see is a simulation, but the simulators will make paperclips in the real world if you do X”
If your confident that the world model is true, I think this isn’t a problem.
I hesitate to say “confident”. But I think you’re not gonna have world models emerging LLMs that are wrapped in a “this is a simulation” layer.. probably?
Also maybe even if they did, the procedure I’m describing, if it worked at all, would naively make them care about some simulated thing for its own sake. Not care about the simulated thing for instrumental reasons so it could get some other thing in the real world.
And so under this world model we feel doomy about any system which decides based on its values which parts of the world are salient and worth modeling in detail, and which defines its values in terms of learned bits of its world model?
Yeah I think so, if I understand you correctly. Even on this view, you’re gonna have to interface with parts of the learned world-model. Because that’s the ontology you’ll have to use when you specify the value function. Else I don’t see how you’d get it in a format compatible with the searchy parts of the models cognition.
So you’ll get two problems:
Maybe the model doesn’t model the things we care about
Even if it kind of models things we care about, its conception of those concepts might be slightly different than ours. So even if you pull off the surgery I’m describing. Identifying the things we care about in the models ontology (humans, kindness, corrigibility etc), stitch them together into a value function, and then implant this where the models learned value function would be. You still get tails come apart type stuff and you die.
My hope would be that you could identify some notion of corrigibility. And use that instead of trying to implant some true value function, because that could be basin stable under reflection and amplification. Although that unfortunately seems harder than the true value function route.
As (a) local AIXI researcher, I think 2 is the (very) hard part. Chronological Turing machines are an insultingly rich belief representation language, which seems to work against us here.
I think a natural way to think about this is ala AIXI, cleaving the system in two, prediction/world-modeling and values.
With this framing, I think most people would be fine with the world-model being inscrutable, if you can be confident the values are the right ones (and they need to be scrutable for this). I mean, for an ASI this kind of has to be the case. It will understand many thing about the world that we don’t understand. But the values can be arbitrarily simple.
Kind of my hope for mechanistic interpretability is that we could find something isomorphic to a aixi (with inscrutable neural goop instead of turing machines) and then do surgery on the sensory reward part. And that this is feasible because 1) the aixi-like structure is scrutable 2) the values the AI has gotten from training probably will not be scrutable, but we can replace them with something that is, at least “structurally”, scrutable.
Be a little careful with this. It’s possible to make the AI do all sorts of strange things via unusual world models. Ie a paperclip maximizing AI can believe “everything you see is a simulation, but the simulators will make paperclips in the real world if you do X”
If your confident that the world model is true, I think this isn’t a problem.
I hesitate to say “confident”. But I think you’re not gonna have world models emerging LLMs that are wrapped in a “this is a simulation” layer.. probably?
Also maybe even if they did, the procedure I’m describing, if it worked at all, would naively make them care about some simulated thing for its own sake. Not care about the simulated thing for instrumental reasons so it could get some other thing in the real world.
And so under this world model we feel doomy about any system which decides based on its values which parts of the world are salient and worth modeling in detail, and which defines its values in terms of learned bits of its world model?
Yeah I think so, if I understand you correctly. Even on this view, you’re gonna have to interface with parts of the learned world-model. Because that’s the ontology you’ll have to use when you specify the value function. Else I don’t see how you’d get it in a format compatible with the searchy parts of the models cognition.
So you’ll get two problems:
Maybe the model doesn’t model the things we care about
Even if it kind of models things we care about, its conception of those concepts might be slightly different than ours. So even if you pull off the surgery I’m describing. Identifying the things we care about in the models ontology (humans, kindness, corrigibility etc), stitch them together into a value function, and then implant this where the models learned value function would be. You still get tails come apart type stuff and you die.
My hope would be that you could identify some notion of corrigibility. And use that instead of trying to implant some true value function, because that could be basin stable under reflection and amplification. Although that unfortunately seems harder than the true value function route.
As (a) local AIXI researcher, I think 2 is the (very) hard part. Chronological Turing machines are an insultingly rich belief representation language, which seems to work against us here.