Daniel Kokotajlo comments on What Indicators Should We Watch to Disambiguate AGI Timelines?

Daniel Kokotajlo 29 Jan 2025 1:36 UTC
6 points
0
OK. Next question: Suppose that next year we get a nice result showing that there is a model with serial inference-time scaling across e.g. MATH + FrontierMath + IMO problems. Recall that FrontierMath and IMO are subdivided into different difficulty levels; suppose that this model can be given e.g. 10 tokens of CoT, 100, 1000, 10,000, etc. and then somewhere around the billion-serial-token-level it starts solving a decent chunk of the “medium” FrontierMath problems (but not all) and at the million-serial-token level it was only solving MATH + some easy IMO problems.

Would this count, for you?
- Thane Ruthenis 29 Jan 2025 2:18 UTC
  7 points
  −1
  Parent
  Not for math benchmarks. Here’s one way it can “cheat” at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified.
  This wouldn’t actually show “agency” and strategic thinking of the kinds that might generalize to open-ended domains and “true” long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment.
  Something more open-ended and requiring “research taste” would be needed. Maybe a comparable performance on METR’s benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance.
  Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn’t follow the above “cheating” approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)
  - Daniel Kokotajlo 29 Jan 2025 4:13 UTC
    6 points
    0
    Parent
    Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems? I feel like a few examples have been shown & they seem to involve qualitative thinking, not just brute-force-proof-search (though of course they show lots of failed attempts and backtracking—just like a human thought-chain would).
    
    Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI (though I’m not sure)
    - Thane Ruthenis 29 Jan 2025 5:16 UTC
      4 points
      0
      Parent
      Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems?
      Certainly (experimenting with r1′s CoTs right now, in fact). I agree that they’re not doing the brute-force stuff I mentioned; that was just me outlining a scenario in which a system “technically” clears the bar you’d outlined, yet I end up unmoved (I don’t want to end up goalpost-moving).
      Though neither are they being “strategic” in the way I expect they’d need to be in order to productively use a billion-token CoT.
      Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI
      Yeah, I’m also glad to finally have something concrete-ish to watch out for. Thanks for prompting me!