Jacob Pfau comments on AI in 2025: gestalt

Jacob Pfau 8 Dec 2025 15:58 UTC
11 points
0

But the feared / hoped-for generalisation from {training LLMs with RL on tasks with a verifier} to performing on tasks without one remains unclear even after two years of trying.

Very much disagree. Granted there are vacuously weak versions of this claim (‘no free lunch’-like) that I agree with of course.

Just talk to Claude 4.5 Opus! Ask it to describe what a paper is about, what follow up experiments to do, etc. Ask it to ELI-undergrad some STEM topic!

Do you think the pre-trained-only could do as well? Surely not.

Perhaps the claim is an instruct-SFT or “Chat-RLHF-only” compute matched model could do as well? The only variant of this I buy is: Curate enough instruct-SFT STEM data to match the amount of trajectories generated in VeRL post-training. However I don’t think this counterfactual matters much: it would involve far more human labor and is cost prohibitive for that reason.
- technicalities 8 Dec 2025 17:57 UTC
  4 points
  0
  Parent
  Thanks. I am uncertain (“unclear”), and am interested in sharpening this to the point where it’s testable.
  I basically never use a non-RLed model for anything, so I agree with the minimal version of the generalisation claim.
  We could just reuse some transfer learning metric? If 100% is full proportional improvement, I’d claim like <10% spillover on nonverified tasks. What about you?
  Another thing I was trying to point at is my not knowing what RL environments they’re using for these things, and so not knowing what tasks count in the denominator. I’m not going to know either.
  - Daniel Kokotajlo 8 Dec 2025 20:06 UTC
    6 points
    2
    Parent
    Seems like Claude has been getting better at playing Pokemon, despite not having been trained on any sort of Pokemon game at all. (Epistemic status: Not sure actually, we don’t know what Anthropic does internally, maybe they’ve trained it on video games for all we know. But I don’t think they have.)
    
    Isn’t this therefore an example of transfer/generalization?
    
    What transfer learning metrics do you have in mind?
    What links here?
    AI in 2025: gestalt by technicalities (7 Dec 2025 21:25 UTC; 248 points)
    - technicalities 8 Dec 2025 20:33 UTC
      5 points
      0
      Parent
      My perhaps overcynical take is to assume that any benchmark which gets talked about a lot is being optimised. (The ridiculously elaborate scaffold already exists for Pokemon, so why wouldn’t you train on it?) But I would update on an explicit denial.
      I was guessing that the transfer learning people would already have some handy coefficient (normalised improvement on nonverifiable tasks / normalised improvement on verifiable tasks) but a quick look doesn’t turn it up.
      - Daniel Kokotajlo 8 Dec 2025 20:49 UTC
        5 points
        0
        Parent
        It still says on the Twitch stream “Claude has never been trained to play any Pokemon games”
        
        https://www.twitch.tv/claudeplayspokemon
        technicalities 8 Dec 2025 21:00 UTC
        4 points
        0
        Parent
        Works for me!
        Daniel Kokotajlo 8 Dec 2025 21:34 UTC
        3 points
        0
        Parent
        Possibly relevant possibly hallucinated data: https://www.lesswrong.com/posts/cxuzALcmucCndYv4a/daniel-kokotajlo-s-shortform?commentId=sBtoCfWNnNxxGEgiL
  - Jacob Pfau 8 Dec 2025 21:55 UTC
    2 points
    0
    Parent
    I suppose there’s two questions here:
    
    How strong is generalization in general in RL?
    Is there a ‘generalization barrier’ between easy-to-verify and hard-to-verify tasks
    
    I’m guessing you mainly are thinking of (1) and have (2) as a special case?
    
    To respond to your question, I’m reading it as:
    
    We assume that there’s a constant multiplier in samples-to-performance needed to match in-domain training with out-of-domain training. For ‘nearby’ verifiable and non-verifiable tasks is that constant >= 10x?
    
    I would guess modally somewhere 3-10x. I’m imagining here comparing training on more more olympiad problems vs some looser question like ‘Compare the clarity of these two proofs’. Of course there’s diminishing returns etc. so it’s not really a constant factor when taking a narrow domain.
    
    I do agree that there are areas where domain-specific training is a bottleneck, and plausibly some of those are non-verifiable ones. See also my shortform where I discuss some reasons for such a need https://www.lesswrong.com/posts/FQAr3afEZ9ehhssmN/jacob-pfau-s-shortform?commentId=vdBjv3frxvFincwvz
- arielb1 12 Dec 2025 1:00 UTC
  1 point
  0
  Parent
  My pet theory theory of this is that you get 2 big benefits from RLVR:
  1. A model learns how to write sentences in a way that does not confuse itself (for example, markdown files written by an AI tend to context-poison an AI far less than the same amount of text written by a human or by error messages).
  2. A model learns how to do “business processes”—for example, that in order to write code, it needs to first read documentation, then write the code, and then run tests.
  
  These are things that RL if done right is going to improve, and they definitely feel like they explain much of the difference between say ChatGPT-4 and GPT-5.
  
  I expect that these effects can have fairly “general” impact (for example, an AI learning how to work with notes), but the biggest improvements would be completely non-generalizable (for example, heuristics in how to place functions in code).
  - technicalities 12 Dec 2025 10:19 UTC
    2 points
    0
    Parent
    Nice points. I would add “backtracking” as one very plausible general trick purely gained by RLVR.
    I will own up to being unclear in OP: the point I was trying to make is that last year that there was a lot of excitement about way bigger off-target generalisation than cleaner CoTs, basic work skills, uncertainty expression, and backtracking. But I should do the work of finding those animal spirits/predictions and quantifying them and quantifying the current situation.