habryka comments on Buck’s Shortform

habryka 30 Oct 2025 0:05 UTC
56 points
6
Copy-pasting what I wrote in a Slack thread about this:
My current take, having thought a lot about a few things in this domain, but not necessarily this specific question, is that the only dimensions where the empirical evidence feels like it was useful, besides a broad “yes, of course the problems are real, and AGI is possible, and it won’t take hundreds of years” confirmation, are the dynamics around how much you can steer and control near-human AI systems to perform human-like labor.
I think almost all the evidence for that comes from just the scaling up, and basically none of it comes from safety work (unless you count RLHF as safety work, though of course the evidence there is largely downstream of the commercialization and scaling of that technology).
I can’t think of any empirical evidence that updated me much on what superintelligent systems would do, even if they are the results of just directly scaling current systems, which is the key thing that matters.
A small domain that updated me a tiny bit, though mostly in the direction of what I already believed, is the material advantage research with stuff like LeelaOdds, which demonstrated more cleanly you can overcome large material disadvantages with greater intelligence in at least one toy scenario. The update here was really small though. I did make a bet with one person and won that one, so presumably it was a bigger update for others.
I think a bunch of other updates for me are downstream of “AIs will have situational awareness substantially before they are even human-level competent”, which changes a lot of risk and control stories. I do think the situational awareness studies were mildly helpful for that, though most of it was IMO already pretty clear by the release of GPT-4, and the studies are just helpful for communicating that to people with less context or who use AI systems less.
Buck: What do you think we’ve learned about how much you can steer and control the AIs to perform human-like labor?
Me: It depends on what timescale. One thing that I think I updated reasonably strongly on is that we are probably not going to get systems with narrow capability profiles. The training regime we have seems to really benefit from throwing a wide range of data on it, and the capital investments to explicitly train a narrow system are too high. I remember Richard a few years ago talking about building AI systems that are exceptionally good at science and alignment, but bad at almost everything else. This seems a bunch less likely now. And then there is just a huge amount of detail on what things do I expect AI to be good at and bad at, at different capability levels, based on extrapolating current progress. Some quick updates here:
- Models will be better at coding than almost any other task
- Models will have extremely wide-ranging knowledge in basically all fields that have plenty of writing about them
- It’s pretty likely natural language will just be the central interface for working with human-level AI systems (I would have had at least some probability mass on more well-defined objectives, though I think in retrospect that was kind of dumb)
- We will have multi-modal human-level AIs, but it’s reasonably likely we will have superhuman AIs in computer use and writing substantially before we have AIs that orient to the world at human reaction speeds (like, combining real-time video, control and language is happening, but happening kind of slowly)
- We have different model providers, but basically all the AI systems behave the same, with their failure modes and goals and misalignment all being roughly the same. This has reasonably-big implications for hoping that you can get decorrelated supervision by using AIs from different providers.
- Chains of thought will stop being monitorable soon, but it stayed monitorable for an IMO mildly surprisingly long length of time. This suggests there is maybe more traction on keeping chains of thought monitorable than I would have said a few months ago.
- The models will just lie to you all the time, everyone is used to this, you cannot use “the model is lying to me or clearly trying to deceive me” as any kind of fire alarm
- Factored cognition seems pretty reliably non-competitive with just increasing context-lengths and doing RL (this is something I believed for a long time, but every year of the state of the art still not involving factored cognition is more evidence in this direction IMO, though I expect others to find this point kind of contentious)
- Elicitation in-general is very hard, at least from a consumer perspective. There are tons of capabilities that the models demonstrate in one context, that are very hard to elicit without doing your own big training run in other contexts. At least in my experience LoRA’s don’t really work. Maybe this will get better. (one example that informs my experience here: restoring base-model imitation behavior. Fine-tuning seems to not work great for this, you still end up with huge mode collapse and falling back to the standard RLHF-corpo-speak. Maybe this is just a finetuning skill issue)
There are probably more things.