Its clear that it was never optimized for odds games, therefore unless concrete evidence is presented, I doubt that @titotal actually played against a “superhuman system—which may explain why it won.
There’s definitely a ceiling to which intelligence will help—as the other guy mentioned, not even AIXI would be able to recover from an adversarially designed initial position for Tic-Tac-Toe.
But I’m highly skeptical OP has reached that ceiling for chess yet.
On Modus Tollens, playing around with ChatGPT yields an interesting result. Turns out, the model seems to be… ‘overthinking’ it I guess. It thinks its a complex question—answering `No` based on insufficient predicates provided. I think that may be why at some point in scale, the model performance just drops straight down to 0 (≈1B). (Conversation)
Sternly forcing it to deduce only from the given statements (I’m unsure how much CoT helped here, an ablation would be interesting) gets it correctly. It seems that larger models are injecting some interpretation of nuance—while we simply want the logical answer from the narrow set of provided statements.
It’s weirdly akin to how we become suspicious when the question is too simple. Somehow, due to RLHF or pre-training (most likely, no RLHF models are tested here AFAIK) the priors are more suited towards deducing answers falling in the gray region rather than converging to a definitive answer.
It goes in line with what the U-scaling paper discovered. I hypothesize CoT forces the model to stick as close to the instructions as possible by breaking the problem into (relatively) more objective subproblems which won’t be as ambigous and the model gets a decent idea on how to approach it.