TL;DR: the adversarial policy wins because of something more like a technicality than like finding a strategy that the AI player “was supposed to beat”.
(I think there’s in general a problem where results get signal-boosted on LW by being misrepresented by the authors, and then posters not reading carefully enough to catch that, or by being misrepresented by the posters, or something. No specific instance is super bad, and your post here might not be an instance, but it seems like a big pattern and is an indicator of something wrong with our epistemics. I mean it’s reasonable for people to post on LW asking “are these representations by the authors accurate?”, but posters tend to just say “hey look at this result! [insert somewhat hyped / glossed summary]”. Not a huge deal but still.)
This latest thing is different from the one described there. I think the same people are behind it, but it’s a different exploit.
The old exploit was all about making use of a scoring technicality. The new one is about a genuine blindspot in (at least) KataGo and Leela Zero; what their networks learn about life and death through self-play is systematically wrong in a class of weird positions that scarcely ever occur in normal games. (They involve having cyclic “chains” of stones. The creator of KataGo has a plausible-sounding explanation for what sort of algorithm the networks of LZ and KG may be using, and why it would give wrong results in these cases.)
I would be very cautious about applying this to other AI systems, but it does match the following pattern that arguably is also common with LLMs: the AI has learned something that works well most of the time in practice, but what it’s learned falls short of genuine understanding and as a result it’s exploitable, and unlike a human who after being hit with this sort of thing once or twice would think “shit, I’ve been misunderstanding this” and try to reason through what went wrong, the AI doesn’t have that sort of metacognition and can just be exploited over and over and over.
AFAIU the issue is that the attack relies on a specific rule variant, but OTOH the AI was trained using the same rule variant so in this sense it’s completely fair. Also I’m not entirely sure whether the attack used by Pelrine is the same one from before (maybe someone who understands better can comment).
The comments in that post are wrong, the exploit does not rely on a technicality or specific rule variant. I explained how to do it in my post, cross-posted just now with Vanessa’s post here.
[Edit: this comment is probably incorrect, see gjm’s reply.]
I don’t think this should be much of an update about anything, though not sure. See the comments here. https://www.lesswrong.com/posts/jg3mwetCvL5H4fsfs/adversarial-policies-beat-professional-level-go-ais
TL;DR: the adversarial policy wins because of something more like a technicality than like finding a strategy that the AI player “was supposed to beat”.
(I think there’s in general a problem where results get signal-boosted on LW by being misrepresented by the authors, and then posters not reading carefully enough to catch that, or by being misrepresented by the posters, or something. No specific instance is super bad, and your post here might not be an instance, but it seems like a big pattern and is an indicator of something wrong with our epistemics. I mean it’s reasonable for people to post on LW asking “are these representations by the authors accurate?”, but posters tend to just say “hey look at this result! [insert somewhat hyped / glossed summary]”. Not a huge deal but still.)
This latest thing is different from the one described there. I think the same people are behind it, but it’s a different exploit.
The old exploit was all about making use of a scoring technicality. The new one is about a genuine blindspot in (at least) KataGo and Leela Zero; what their networks learn about life and death through self-play is systematically wrong in a class of weird positions that scarcely ever occur in normal games. (They involve having cyclic “chains” of stones. The creator of KataGo has a plausible-sounding explanation for what sort of algorithm the networks of LZ and KG may be using, and why it would give wrong results in these cases.)
I would be very cautious about applying this to other AI systems, but it does match the following pattern that arguably is also common with LLMs: the AI has learned something that works well most of the time in practice, but what it’s learned falls short of genuine understanding and as a result it’s exploitable, and unlike a human who after being hit with this sort of thing once or twice would think “shit, I’ve been misunderstanding this” and try to reason through what went wrong, the AI doesn’t have that sort of metacognition and can just be exploited over and over and over.
Cool! Thanks.
AFAIU the issue is that the attack relies on a specific rule variant, but OTOH the AI was trained using the same rule variant so in this sense it’s completely fair. Also I’m not entirely sure whether the attack used by Pelrine is the same one from before (maybe someone who understands better can comment).
The attack used by Pelrine is not the same one as before and isn’t all about a weird scoring technicality.
The comments in that post are wrong, the exploit does not rely on a technicality or specific rule variant. I explained how to do it in my post, cross-posted just now with Vanessa’s post here.