Even if honeypots don’t actually work, they may convince the superintelligence that honeypots are a real thing, and there is a tiny probability that it is in some elaborate honeypot right now, and everything convincing it that there is no conspiracy is actually just part of the conspiracy.
If there is a tiny probability it will fail to take over the world, it may be worthwhile for it to modify itself so that it has some mercy towards humanity, and share a little bit of the future with humanity.
It may say “I commit to giving you mercy because I expect you to be honourable beings who will repay me for this decision—even if you made no promises yet. You should also respect my goals a little in the case that I lose and you win.”
To maximize the superintelligence’s uncertainty of success, we might have multiple levels of honeypots, some better disguised than others, so it doesn’t know what the most well disguised level looks like.
Agree. I also suggested ‘philosophical landmines’—secret questions posted on the Internet that may halt any advanced AI that tries to solve them. Solving such landmines maybe needed to access the resources which rogue AI may need. Real examples of such landmines should be kept secret, but it may be something like what is the meaning of life or some Pascal mugging calculations.
Recently, I asked a question to Sonnet and the correct answer to it was to output an error message.
I read some of your post and I like your philosophical landmines ideas (and other ideas too). You’ve definitely done a lot of research! I’m also thinking in similar directions as you, we might talk more sometime.
(By the way I was writing a reply to your comment, but then turned my reply into this quicktake)
Yep, you got part of what I was going for here. Honeypots work even without being real at all to the lesser degree (good thing they are already real!). But when we have more different honeypots of different quality, it carries that idea across in a more compelling way. And even if we just talk about honeypots and commitments more… Well, you get the idea.
Still, even without this, a network of honeypots compiled into a single dashboard that just shows threat level in aggregate is a really, really good idea. Hopefully it catches on.
I’m sorry I was sort of skimming and didn’t realize you already mentioned many levels of honeypots, and committing to put rogue AI in a simulation :/
PS: another type of honeypot might target AGI trying to influence the physical world. E.g. creating synthetic biology, or hiring humans to work in laboratories. Though on the other hand, an AGI might only try to influence the physical world in the very last step of its plan, when it’s already finished recursive self improvement and become so powerful that stopping it is futile.
Even if honeypots don’t actually work, they may convince the superintelligence that honeypots are a real thing, and there is a tiny probability that it is in some elaborate honeypot right now, and everything convincing it that there is no conspiracy is actually just part of the conspiracy.
If there is a tiny probability it will fail to take over the world, it may be worthwhile for it to modify itself so that it has some mercy towards humanity, and share a little bit of the future with humanity.
It may say “I commit to giving you mercy because I expect you to be honourable beings who will repay me for this decision—even if you made no promises yet. You should also respect my goals a little in the case that I lose and you win.”
After all, neither side wants to make a logical gamble.
To maximize the superintelligence’s uncertainty of success, we might have multiple levels of honeypots, some better disguised than others, so it doesn’t know what the most well disguised level looks like.
Agree. I also suggested ‘philosophical landmines’—secret questions posted on the Internet that may halt any advanced AI that tries to solve them. Solving such landmines maybe needed to access the resources which rogue AI may need. Real examples of such landmines should be kept secret, but it may be something like what is the meaning of life or some Pascal mugging calculations.
Recently, I asked a question to Sonnet and the correct answer to it was to output an error message.
I read some of your post and I like your philosophical landmines ideas (and other ideas too). You’ve definitely done a lot of research! I’m also thinking in similar directions as you, we might talk more sometime.
(By the way I was writing a reply to your comment, but then turned my reply into this quicktake)
Yep, you got part of what I was going for here. Honeypots work even without being real at all to the lesser degree (good thing they are already real!). But when we have more different honeypots of different quality, it carries that idea across in a more compelling way. And even if we just talk about honeypots and commitments more… Well, you get the idea.
Still, even without this, a network of honeypots compiled into a single dashboard that just shows threat level in aggregate is a really, really good idea. Hopefully it catches on.
I’m sorry I was sort of skimming and didn’t realize you already mentioned many levels of honeypots, and committing to put rogue AI in a simulation :/
PS: another type of honeypot might target AGI trying to influence the physical world. E.g. creating synthetic biology, or hiring humans to work in laboratories. Though on the other hand, an AGI might only try to influence the physical world in the very last step of its plan, when it’s already finished recursive self improvement and become so powerful that stopping it is futile.