When I think about what I’d expect to see in experiments like that, I get curious about a sort of “baseline” set of experiments without deception or even verbal explanations. When can I distinguish the better of two chess engines more efficiently than playing them against each other and looking at the win/loss record? How much does it help to see the engines’ analyses over just observing moves?
How is this related? Well, how deep is Chess? Ratings range between, say, 800 and 3500, with 300 points being enough to distinguish players (human or computer) reasonably well. So we might say there are about 10 “levels” in practice, or that it has a rating depth of 10.
If Chess were Best-Of-30 ChessMove as described above, then ChessMove would have a rating depth a bit below 2 (just dividing by √30). In other words, we’d expect it to be very hard to ever distinguish any pair of engines off a single recommended move—and difficult with any number of isolated observations, given our own error-prone human evaluation. If it’s closer to Best-Of-30 Don’tBlunder, it’s a little more complicated—usually you can’t tell the difference because there basically is none, but on rare pivotal moves it will be nearly as easy to tell as when looking at a whole game.
The solo version of the experiment looks like this:
I find a chess engine with a rating around mine, and use it to analyze positions in games against other engines. Play a bunch of games to get a baseline “hybrid” rating for myself with that advisor.
I do the same thing with a series of stronger chess engines, ideally each within a “level” of the last.
I do the same thing with access to the output of two engines, and I’m blinded to which is which. (The blinding might require some care around, for example, timing, as well as openings.) In sub-experiment A, I only get top moves and their scores. In sub-experiment B, I can look at lines from the current position up to some depth. In sub-experiment C, I can use these engines however I want. For example, I can basically play them against each other if I want to run down my own clock doing it. (Because pairs might be within a level of one another, I can’t be sure which is stronger from a single win/loss outcome. I’d hope to find more efficient ways of distinguishing them.)
I repeat #3 with different random pairs of advisors.
What I’d expect is that my ratings with pairs of advisors should be somewhere between my rating with the bad advisor and my rating with the good advisor. If I can successfully distinguish them, it’s close to the latter. If I’m just guessing, it’s close to the former (in the Don’tBlunder world) or to the midpoint (in the ChessMove world). I should have an easier time in sub-experiments B and C. Having a worse engine in the mix weighs me down relatively more (a) the closer the engines are to each other, and (b) the stronger both engines are compared to me.
The main question I’d hope might be answerable this way would be something like, “How do (a) and (b) trade off?” Which is easier to distinguish—1800 and 2100, or, say, 2700 and 3300? Will there be a ceiling beyond which I’m always just guessing? Might I tend to side with worse advisors because, being closer to my level, they agree with me?
It seems like we’d want some handle on these questions before asking how much worse outright deception can be.
(There’s some trouble here because higher-ranked players are more likely to draw given a fixed rating difference. This itself is relatively Don’tBlunder-like, and it makes me wonder if it’s possible to project how far our best engines are likely to be from perfect play. But it makes it harder to disentangle inability to draw distinctions in play above my level from “natural” indistinguishability. There are also more general issues in doing these experiments with computers—for example, weak engines tend to be weak in ways humans wouldn’t be, and it’s hard to calibrate ratings for superhuman play.)
(It might also be interesting to automate myself out of this experiment by choosing between recommendations using some simple scripted logic and evaluation by a relatively weak engine.)
Along the lines of what I wrote in the parent, even though I think there’s potentially a related and fairly deep “worldview”-type crux (crux generator?) nearby when it comes to AI risk—are we in a ChessMove world or a Don’tBlunder world?—[sorry, these are terrible names, because actual Chess moves are more like Don’tBlunder, which is itself horribly ugly]—I’m not particularly motivated to do this experiment, because I don’t think any possible answer on this level of metaphor would be informative enough to shift anyone on more important questions.
Related would be some refactoring of Deception Chess.
When I think about what I’d expect to see in experiments like that, I get curious about a sort of “baseline” set of experiments without deception or even verbal explanations. When can I distinguish the better of two chess engines more efficiently than playing them against each other and looking at the win/loss record? How much does it help to see the engines’ analyses over just observing moves?
How is this related? Well, how deep is Chess? Ratings range between, say, 800 and 3500, with 300 points being enough to distinguish players (human or computer) reasonably well. So we might say there are about 10 “levels” in practice, or that it has a rating depth of 10.
If Chess were Best-Of-30 ChessMove as described above, then ChessMove would have a rating depth a bit below 2 (just dividing by √30). In other words, we’d expect it to be very hard to ever distinguish any pair of engines off a single recommended move—and difficult with any number of isolated observations, given our own error-prone human evaluation. If it’s closer to Best-Of-30 Don’tBlunder, it’s a little more complicated—usually you can’t tell the difference because there basically is none, but on rare pivotal moves it will be nearly as easy to tell as when looking at a whole game.
The solo version of the experiment looks like this:
I find a chess engine with a rating around mine, and use it to analyze positions in games against other engines. Play a bunch of games to get a baseline “hybrid” rating for myself with that advisor.
I do the same thing with a series of stronger chess engines, ideally each within a “level” of the last.
I do the same thing with access to the output of two engines, and I’m blinded to which is which. (The blinding might require some care around, for example, timing, as well as openings.) In sub-experiment A, I only get top moves and their scores. In sub-experiment B, I can look at lines from the current position up to some depth. In sub-experiment C, I can use these engines however I want. For example, I can basically play them against each other if I want to run down my own clock doing it. (Because pairs might be within a level of one another, I can’t be sure which is stronger from a single win/loss outcome. I’d hope to find more efficient ways of distinguishing them.)
I repeat #3 with different random pairs of advisors.
What I’d expect is that my ratings with pairs of advisors should be somewhere between my rating with the bad advisor and my rating with the good advisor. If I can successfully distinguish them, it’s close to the latter. If I’m just guessing, it’s close to the former (in the Don’tBlunder world) or to the midpoint (in the ChessMove world). I should have an easier time in sub-experiments B and C. Having a worse engine in the mix weighs me down relatively more (a) the closer the engines are to each other, and (b) the stronger both engines are compared to me.
The main question I’d hope might be answerable this way would be something like, “How do (a) and (b) trade off?” Which is easier to distinguish—1800 and 2100, or, say, 2700 and 3300? Will there be a ceiling beyond which I’m always just guessing? Might I tend to side with worse advisors because, being closer to my level, they agree with me?
It seems like we’d want some handle on these questions before asking how much worse outright deception can be.
(There’s some trouble here because higher-ranked players are more likely to draw given a fixed rating difference. This itself is relatively Don’tBlunder-like, and it makes me wonder if it’s possible to project how far our best engines are likely to be from perfect play. But it makes it harder to disentangle inability to draw distinctions in play above my level from “natural” indistinguishability. There are also more general issues in doing these experiments with computers—for example, weak engines tend to be weak in ways humans wouldn’t be, and it’s hard to calibrate ratings for superhuman play.)
(It might also be interesting to automate myself out of this experiment by choosing between recommendations using some simple scripted logic and evaluation by a relatively weak engine.)
Along the lines of what I wrote in the parent, even though I think there’s potentially a related and fairly deep “worldview”-type crux (crux generator?) nearby when it comes to AI risk—are we in a ChessMove world or a Don’tBlunder world?—[sorry, these are terrible names, because actual Chess moves are more like Don’tBlunder, which is itself horribly ugly]—I’m not particularly motivated to do this experiment, because I don’t think any possible answer on this level of metaphor would be informative enough to shift anyone on more important questions.