like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there’s a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you’re forced into exotic and unlikely training data, and you win if i’m either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.
Might be useful as a standalone or as a mini-game within the overall game of building and breaking an alignment proposal, which is itself a mini-game in the overall game of building and breaking success stories.
I strong-upvoted this post.
Here’s a specific, zoomed-in version of this game proposed by Nate Soares:
Might be useful as a standalone or as a mini-game within the overall game of building and breaking an alignment proposal, which is itself a mini-game in the overall game of building and breaking success stories.
I like that mini-game! Thanks for the reference