ryan_greenblatt comments on ryan_greenblatt’s Shortform

ryan_greenblatt 18 May 2025 16:39 UTC
LW: 37 AF: 12
11
AF
Sometimes people talk about how AIs will be very superhuman at a bunch of (narrow) domains. A key question related to this is how much this generalizes. Here are two different possible extremes for how this could go:
1. It’s effectively like an attached narrow weak AI: The AI is superhuman at things like writing ultra fast CUDA kernels, but from the AI’s perspective, this is sort of like it has a weak AI tool attached to it (in a well integrated way) which is superhuman at this skill. The part which is writing these CUDA kernels (or otherwise doing the task) is effectively weak and can’t draw in a deep way on the AI’s overall skills or knowledge to generalize (likely it can shallowly draw on these in a way which is similar to the overall AI providing input to the weak tool AI). Further, you could actually break out these capabilities into a separate weak model that humans can use. Humans would use this somewhat less fluently as they can’t use it as quickly and smoothly due to being unable to instantaneously translate their thoughts and not being absurdly practiced at using the tool (like AIs would be), but the difference is ultimately mostly convenience and practice.
2. Integrated superhumanness: The AI is superhuman at things like writing ultra fast CUDA kernels via a mix of applying relatively general (and actually smart) abilities, having internalized a bunch of clever cognitive strategies which are applicable to CUDA kernels and sometimes to other domains, as well as domain specific knowledge and heuristics. (Similar to how humans learn.) The AI can access and flexibly apply all of the things it learned from being superhuman at CUDA kernels (or whatever skill) and with a tiny amount of training/practice it can basically transfer all these things to some other domain even if the domain is very different. The AI is at least as good at understanding and flexibly applying what it has learned as humans would be if they learned the (superhuman) skill to the same extent (and perhaps the AIs are actually much better at this than humans). You can’t separate these capabilities into a weak model, the weak model RL’d on this (and distilled into) would either be much worse at CUDA or would need to actually be generally quite capable (rather than weak).
My sense is that the current frontier LLMs are much closer to (1) than (2) for most of their skills, particularly the skills which they’ve been heavily trained on (e.g. next token prediction or competitive programming). As AIs in the current paradigm get more capable, they appear to shift some toward (2) and I expect that at the point when AIs are capable of automating virtually all cognitive work that humans can do, we’ll be much closer to (2). That said, it seems likely that powerful AIs built in the current paradigm^[1] which otherwise match humans at downstream performance will somewhat lag behind humans in integrating/generalizing skills they learn (at least without spending a bunch of extra compute on skill integration) because this ability currently seems to be lagging behind other capabilities relative to humans and AIs can compensate for worse skill integration with other advantages (being extremely knowledgeable, fast speed, parallel training on vast amounts of relevant data including “train once, deploy many”, better memory, faster and better communication, etc).

I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.
1. ↩︎
  If the paradigm radically shifts by the time we have powerful AIs, then the relative level of integration is much less clear.
- 1a3orn 19 May 2025 1:40 UTC
  9 points
  3
  Parent
  Good articulation.
  
  (Similar to how humans learn.)
  
  People also disagree greatly about how much humans tend towards integration rather than non-integration, and how much human skill comes from domain transfer. And I think some / a lot of the beliefs about artificial intelligence are downstream of these beliefs about the origins of biological intelligence and human expertise, i.e., in Yudkowsky / Ngo dialogues. (Object level: Both the LW-central and alternatives to the LW-central hypotheses seem insufficiently articulated; they operate as a background hypothesis too large to see rather than something explicitly noted, imo.)
  - Viliam 19 May 2025 13:31 UTC
    6 points
    4
    Parent
    People also disagree greatly about how much humans tend towards integration rather than non-integration, and how much human skill comes from domain transfer.
    Makes me wonder whether most of what people believe to be “domain transfer” could simply be IQ.
    I mean, suppose that you observe a person being great at X, then you make them study Y for a while, and it turns out that they are better at Y than an average person who spend the same time studying Y.
    One observer says: “Clearly some of the skills at X have transferred to the skills of Y.”
    Another observer says: “You just indirectly chose a smart person (by filtering for high skills at X), duh.”
- Caleb Biddulph 18 May 2025 21:01 UTC
  LW: 3 AF: 3
  0
  AF Parent
  This seems important to think about, I strong upvoted!
  As AIs in the current paradigm get more capable, they appear to shift some toward (2) and I expect that at the point when AIs are capable of automating virtually all cognitive work that humans can do, we’ll be much closer to (2).
  I’m not sure that link supports your conclusion.
  First, the paper is about AI understanding its own behavior. This paper makes me expect that a CUDA-kernel-writing AI would be able to accurately identify itself as being specialized at writing CUDA kernels, which doesn’t support the idea that it would generalize to non-CUDA tasks.
  Maybe if you asked the AI “please list heuristics you use to write CUDA kernels,” it would be able to give you a pretty accurate list. This is plausibly more useful for generalizing, because if the model can name these heuristics explicitly, maybe it can also use the ones that generalize, if they do generalize. This depends on 1) the model is aware of many heuristics that it’s learned, 2) many of these heuristics generalize across domains, and 3) it can use its awareness of these heuristics to successfully generalize. None of these are clearly true to me.
  Second, the paper only tested GPT-4o and Llama 3, so the paper doesn’t provide clear evidence that more capable AIs “shift some towards (2).” The authors actually call out in the paper that future work could test this on smaller models to find out if there are scaling laws—has anybody done this? I wouldn’t be too surprised if small models were also able to self-report simple attributes about themselves that were instilled during training.
  - ryan_greenblatt 18 May 2025 23:49 UTC
    LW: 3 AF: 2
    0
    AF Parent
    Fair, but I think the AI being aware of its behavior is pretty continuous with being aware of the heuristics it’s using and ultimately generalizing these (e.g., in some cases the AI learns what code word it is trying to make the user say which is very similar to being aware of any other aspect of the task it is learning). I’m skeptical that very weak/small AIs can do this based on some other papers which show they fail at substantially easier (out-of-context reasoning) tasks.
    
    I think most of the reason why I believe this is improving with capabilities is due to a broader sense of how well AIs generalize capabilities (e.g., how much does o3 get better at tasks it wasn’t trained on), but this paper was the most clearly relevant link I could find.
    - Caleb Biddulph 19 May 2025 0:50 UTC
      3 points
      2
      Parent
      I’m not sure o3 does get significantly better at tasks it wasn’t trained on. Since we don’t know what was in o3′s training data, it’s hard to say for sure that it wasn’t trained on any given task.
      To my knowledge, the most likely example of a task that o3 does well on without explicit training is GeoGuessr. But see this Astral Codex Ten post, quoting Daniel Kang:^[1]
      We also know that o3 was trained on enormous amounts RL tasks, some of which have “verified rewards.” The folks at OpenAI are almost certainly cramming every bit of information with every conceivable task into their o-series of models! A heuristic here is that if there’s an easy to verify answer and you can think of it, o3 was probably trained on it.
      I think this is a bit overstated, since GeoGuessr is a relatively obscure task, and implementing an idea takes much longer than thinking of it.^[2] But it’s possible that o3 was trained on GeoGuessr.
      The same ACX post also mentions:
      On the other hand, the DeepGuessr benchmark finds that base models like GPT-4o and GPT-4.1 are almost as good as reasoning models at this, and I would expect these to have less post-training, probably not enough to include GeoGuessr
      Do you have examples in mind of tasks that you don’t think o3 was trained on, but which it nonetheless performs significantly better at than GPT-4o?
      ^
      Disclaimer: Daniel happens to be my employer
      ^
      Maybe not for cracked OpenAI engineers, idk
      - Adam Karvonen 20 May 2025 3:31 UTC
        1 point
        2
        Parent
        I would guess that OpenAI has trained on GeoGuessr. It should be pretty easy to implement—just take images off the web which have location metadata attached, and train to predict the location. Plausibly getting good at Geoguessr imbues some world knowledge.
- Jeremy Gillen 18 May 2025 17:34 UTC
  LW: 3 AF: 1
  0
  AF Parent
  I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.
  I think this might be wrong when it comes to our disagreements, because I don’t disagree with this shortform.^[1] Maybe a bigger crux is how valuable (1) is relative to (2)? Or the extent to which (2) is more helpful for scientific progress than (1)?
  1. ^
    As long as “downstream performance” doesn’t include downstream performance on tasks that themselves involve a bunch of integrating/generalising.
  - ryan_greenblatt 18 May 2025 17:45 UTC
    LW: 4 AF: 3
    0
    AF Parent
    I don’t think this explains our disagreements. My low confidence guess is we have reasonably similar views on this. But, I do think it drives parts of some disagreements between me and people who are much more optimisitic than me (e.g. various not-very-concerned AI company employees).
    
    I agree the value of (1) vs (2) might also be a crux in some cases.
    - yams 18 May 2025 18:04 UTC
      1 point
      0
      Parent
      Is the crux that the more optimistic folks plausibly agree (2) is cause for concern, but believe that mundane utility can be reaped with (1), and they don’t expect us to slide from (1) into (2) without noticing?
- Raphael Roche 19 May 2025 13:24 UTC
  1 point
  0
  Parent
  I suppose that most tasks that an LLM can accomplish could theoretically be performed more efficiently by a dedicated program optimized for that task (and even better by a dedicated physical circuit). Hypothesis 1) amounts to considering that such a program, a dedicated module within the model, is established during training. This module can be seen as a weak AI used as a tool by the stronger AI. A bit like how the human brain has specialized modules that we (the higher conscious module) use unconsciously (e.g., when we read, the decoding of letters is executed unconsciously by a specialized module).
  We can envision that at a certain stage the model becomes so competent in programming that it will tend to program code on the fly, a tool, to solve most tasks that we might submit to it. In fact, I notice that this is already increasingly the case when I ask a question to a recent model like Claude Sonnet 3.7. It often generates code, a tool, to try to answer me rather than trying to answer the question ‘itself.’ It clearly realizes that dedicated code will be more effective than its own neural network. This is interesting because in this scenario, the dedicated module is not generated during training but on the fly during normal production operation. In this way, it would be sufficient for AI to become a superhuman programmer to become superhuman in many domains thanks to the use of these tool-programs. The next stage would be the on-the-fly production of dedicated physical circuits (FPGA, ASIC, or alien technology), but that’s another story.
  This refers to the philosophical debate about where intelligence resides: in the tool or in the one who created it? In the program or in the programmer? If a human programmer programs a superhuman AI, should we attribute this superhuman intelligence to the programmer? Same question if the programmer is itself an AI? It’s the kind of chicken and egg debate where the answer depends on how we divide the continuity of reality into discrete categories. You’re right that integration is an interesting criterion as it is a kind of formal / non arbitrary solution to this problem of defining discrete categories among the continuity of reality.