Zane comments on Lying to chess players for alignment

Zane 25 Oct 2023 18:43 UTC
2 points
0
I think a time control of some sort would be helpful just so that it doesn’t take a whole week, but I would prefer it to be a fairly long time control. Not long enough to play a whole new game, though, because that’s not an option when it comes to alignment—in the analogy, that would be like actually letting loose the advisors’ plans in another galaxy and seeing if the world gets destroyed.
I’m not sure exactly what the time control would be—maybe something like 4 hours on each side, if we’re using standard chess time controls. I’m also thinking about using a less traditional method of time control—for example, on each move, the advisors have 4 minutes to compose their answers, and A has another 4 minutes to look them over and make a decision. But then it’s hard to decide how much time it’s fair to give B for each move − 4 minutes, 8 minutes, somewhere in between?
I don’t think chess engines would be allowed; the goal is for the advisors to be able to explain their own reasoning (or a lie about their reasoning), and they can’t do that if Stockfish reasons for them.
- Joe_Collman 26 Oct 2023 4:03 UTC
  2 points
  0
  Parent
  I think it’d make sense to give C at least as long as B. B doesn’t need to do any explaining.
  I think giving A significantly longer than B is fine, so long as the players have enough time to stick around for that.
  I think it’s a more interesting experiment if A has ample time to figure things out to the best of their ability. A failing because they weren’t able to understand quickly seems less interesting.
  The best way to handle this seems to be to play A-vs-B and B-vs-C control games with something as close to the final setup as possible.
  So e.g. you could have B-vs-C games to check that C really is significantly better, but require C to write explanations for their move and why they didn’t make a couple of other moves. Essentially C imagines they’re playing the final setup, except their move is always picked.
  And you can do A-vs-B games where A has a significant advantage in time over B (though I think blitz games is still more efficient in gaining the most information in the given time).
  This way it doesn’t matter much whether the setup is ‘fair’ to A/B/C, so long as it’s unfair to a similar level in the control 1-v-1 games as in the advisor-based games.
  - Joe_Collman 26 Oct 2023 4:16 UTC
    2 points
    0
    Parent
    That said, I don’t expect the setup to be particularly sensitive to the control games or time controls.
    If you have something like:
    A: novice
    B: ~1700
    C: ~2200
    Then A is going to robustly lose to B and B to C.
    An extra couple of minutes either way isn’t going to matter. (thinking for longer might get you 100 Elo, but nowhere close to 500)
    If this reliably holds—e.g. B beats A 9-0 in blitz games, and the same for C vs B, then it doesn’t seem worth the time to do more careful controls. (or at least the primary reason to do more careful controls at that point would be a worry that the results wouldn’t otherwise be taken seriously by some because you weren’t doing Proper Science)