Daniel Kokotajlo comments on The Commitment Races problem

Daniel Kokotajlo 23 Aug 2019 18:40 UTC
LW: 1 AF: 1
0
AF
Thanks, edited to fix!
I agree with your push towards metaphilosophy.
I didn’t mean to suggest that the folk theorem proves anything. Nevertheless here is the intuition: The way the folk theorem proves any status quo is possible is by assuming that players start off assuming everyone else will grim trigger them for violating that status quo. So in a two-player game, if both players start off assuming player 1 will grim trigger player 2 for violating player 1′s preferred status quo, then player 1 will get what they want. One way to get this to happen is for player 1 to be “earlier in logical time” than player 2 and make a credible commitment.
As for updatelessness: Well, updateless agents follow the policy that is optimal from the perspective of the credences they have at the time they go updateless. So e.g. if there is a cowardly agent who simulates you at that time or later and then caves to your demands (if you make any) then an updateless agent will be a bully and make demands, i.e. commit to punishing people it identifies as cowards who don’t do what it wants. But of course updateless agents are also cowards themselves, in the sense that the best policy from the perspective of credences C is to cave in to any demands that have already been committed to according to C. I don’t have a super clear example of how this might lead to disaster, but I intend to work one out in the future...
Same goes for my own experience. I don’t have a clear example in mind of something bad that would have happened to me if I had actually self-modified, but I get a nervous feeling about it.