Imagine a multi-agent environment with two rules[1]: first, every agent must obey, in both spirit and letter, every human command given to it.
If you can code this in as part of the multipolar trap, and it gives you alignment (it wouldn’t, but let’s suppose it would), then what do you need the second part for?
The hard aprt of alignment isn’t that an AI would contradict its programming, because it wouldn’t. The hard part is programming it in such a way that it does what we want, rather than doing something we don’t want.
See the footnote. These rules aren’t enforced by the programming of the environment, they’re announced in natural language. These rules are enforced on agents, by agents, like how humans enforce norms upon each other.
Well, the “why not” boils down to “because there doesn’t seem to be any clear applications of it”. The example in your original post fails because AGIs are not humans and therefore you have no reason to expect it to work. I don’t see any alternative approaches that are promising. By default stuff doesn’t have potential unless there’s something special about it.
If you can code this in as part of the multipolar trap, and it gives you alignment (it wouldn’t, but let’s suppose it would), then what do you need the second part for?
The hard aprt of alignment isn’t that an AI would contradict its programming, because it wouldn’t. The hard part is programming it in such a way that it does what we want, rather than doing something we don’t want.
See the footnote. These rules aren’t enforced by the programming of the environment, they’re announced in natural language. These rules are enforced on agents, by agents, like how humans enforce norms upon each other.
Presumably it would not be just like humans enforce norms on each other, because the agents wouldn’t be human.
Yeah, true. I’m asking about the potential of harnessing multipolar traps in general, though.
Well, the “why not” boils down to “because there doesn’t seem to be any clear applications of it”. The example in your original post fails because AGIs are not humans and therefore you have no reason to expect it to work. I don’t see any alternative approaches that are promising. By default stuff doesn’t have potential unless there’s something special about it.