Cool work!
I was thinking a bit and I think that nondeterministic sampling (temperature > 0*), which is standard for LLM sampling, might complicate the picture a bit. A model instance might think “well almost surely I would choose to cooperate, but maaaybe in a rare instance I would defect.”
E.g. for Wolf’s dilemma with n = 1000, if I’m an LLM and I decide that there’s initially a 99% chance a copy of me will cooperate but a 1% chance it will defect, it’s now suddenly optimal to defect (perhaps the model reasons that it is in this 1%!). This might then cause the LLM to change its 99 − 1 initial estimate (because it thinks other models will reason like it), so I’m not actually sure how the logic goes here, it seems hard to reason about.
As a concrete experiment, I would be curious if models start to defect more if you increase n! It would be extremely cool if there was some threshold n where a model suddenly starts defecting, as possibly that would mean that it was aware of its own probability p of defecting and was waiting for n > 1 / p.
*Did you run this with temperature 0? Note that even with temperature 0, I’m not sure that model sampling is actually deterministic in practice.
Commenting here for Felix since he has a new account and it hasn’t been approved by a LessWrong moderator yet: