I don’t have a very insightful comment, but I strongly downvoted this post and I kinda feel the need to justify myself when I do that.
Summary of post: John Wentworth argues that AI Safety plans which involve using powerful AIs to oversee other powerful AIs is brittle by default. In order to get such situations to work, we need to have already solved the hard parts of alignment, including having a really good understanding of our systems. Some people respond to these situations by thinking of specific failure modes we must avoid, but that approach of, “if we just avoid those failure modes then we’ll be fine,” leads to declining property values, because there will be unknown unknowns which will mess up your plan. He thinks of these situations as us getting Godzilla to fight Mega Godzilla, which obviously leads to declining property values in Tokyo.
Thoughts: I think this post uses strong wording which makes it easy to misinterpret. I think the main thing that Wentworth wants to communicate is that our intuition should be that of the Mayor of Tokyo: our researchers can suggest many reasons why Godzilla and Mega Godzilla will keep each other in check, but this is an extremely tenuous setup which we should have a strong bias against. I don’t think the piece makes a good argument for why this is an important intuition to build, after all there are myriad oversight techniques which are less brittle than Godzilla fight or which in concert look much more optimistic. This is inline with one of John’s comments: “I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment…”
When we are reasoning about alignment and oversight strategies, I think I should give some but not much of my mental energy to “this problem looks like getting Godzilla to keep Mega Godzilla in check”. To the extent that this post is advocating for a higher portion of my energy in that bucket, I think the rule of equal and opposite advice applies. There are reasons for “worst-case thinking” but this post does not present them and doesn’t successfully argue for them as far as I can tell.
I don’t have a very insightful comment, but I strongly downvoted this post and I kinda feel the need to justify myself when I do that.
Summary of post: John Wentworth argues that AI Safety plans which involve using powerful AIs to oversee other powerful AIs is brittle by default. In order to get such situations to work, we need to have already solved the hard parts of alignment, including having a really good understanding of our systems. Some people respond to these situations by thinking of specific failure modes we must avoid, but that approach of, “if we just avoid those failure modes then we’ll be fine,” leads to declining property values, because there will be unknown unknowns which will mess up your plan. He thinks of these situations as us getting Godzilla to fight Mega Godzilla, which obviously leads to declining property values in Tokyo.
Thoughts: I think this post uses strong wording which makes it easy to misinterpret. I think the main thing that Wentworth wants to communicate is that our intuition should be that of the Mayor of Tokyo: our researchers can suggest many reasons why Godzilla and Mega Godzilla will keep each other in check, but this is an extremely tenuous setup which we should have a strong bias against. I don’t think the piece makes a good argument for why this is an important intuition to build, after all there are myriad oversight techniques which are less brittle than Godzilla fight or which in concert look much more optimistic. This is inline with one of John’s comments: “I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment…”
When we are reasoning about alignment and oversight strategies, I think I should give some but not much of my mental energy to “this problem looks like getting Godzilla to keep Mega Godzilla in check”. To the extent that this post is advocating for a higher portion of my energy in that bucket, I think the rule of equal and opposite advice applies. There are reasons for “worst-case thinking” but this post does not present them and doesn’t successfully argue for them as far as I can tell.