Yeah, I’m thinking something like that ultimate patch would be good. For now, we could implement it with a simple classifier. Somewhere in my brain there is a subcircuit that hooks up to whatever I’m thinking about and classifies it as exploitation or non-exploitation; I just need to have a larger subcircuit that reviews actions I’m considering taking, and thinks about whether or not they are exploitation, and then only does them if they are unlikely to constitute exploitation.
A superintelligence with a deep understanding of how my brain works, or a human-level intelligence with access to an upload of my brain, would probably be able to find adversarial examples to my classifier, things that I’d genuinely think are non-exploitative but which by some ‘better’ definition of exploitation would still count as exploitative.
But maybe that’s not a problem in practice, because yeah, I’m vulnerable to being hacked by a superintelligence, sue me. So are we all. Ditto for adversarial examples.
And this is just the first obvious implemention idea that comes to mind, I think there are probably better ones I could think of if I spent an hour on it.
Yeah, I’m thinking something like that ultimate patch would be good. For now, we could implement it with a simple classifier. Somewhere in my brain there is a subcircuit that hooks up to whatever I’m thinking about and classifies it as exploitation or non-exploitation; I just need to have a larger subcircuit that reviews actions I’m considering taking, and thinks about whether or not they are exploitation, and then only does them if they are unlikely to constitute exploitation.
A superintelligence with a deep understanding of how my brain works, or a human-level intelligence with access to an upload of my brain, would probably be able to find adversarial examples to my classifier, things that I’d genuinely think are non-exploitative but which by some ‘better’ definition of exploitation would still count as exploitative.
But maybe that’s not a problem in practice, because yeah, I’m vulnerable to being hacked by a superintelligence, sue me. So are we all. Ditto for adversarial examples.
And this is just the first obvious implemention idea that comes to mind, I think there are probably better ones I could think of if I spent an hour on it.