The scaffold result is very cool and surprising. It seems like terminal-bench find pretty big variance across scaffolds. It also seems that cognition sees a lot of variance between different scaffolds.
It’s cool that TH1.1 is so robust to this. Why do you think that is?
The scaffold result is very cool and surprising. It seems like terminal-bench find pretty big variance across scaffolds. It also seems that cognition sees a lot of variance between different scaffolds.
It’s cool that TH1.1 is so robust to this. Why do you think that is?