jsteinhardt

Karma: 5,932

jsteinhardt 9 Jan 2026 5:05 UTC
2 points
0
in reply to: Tim Hua’s comment on: Oversight Assistants: Turning Compute into Understanding
Is the worry that if the overseer is used at training time, the model will be eval aware and learn to behave differently when overseen?

jsteinhardt 8 Jan 2026 1:49 UTC
3 points
0
in reply to: Tim Hua’s comment on: Oversight Assistants: Turning Compute into Understanding
Thanks! Some thoughts here:
The first is how to train oversight AIs when the oversight tasks are no longer easily verifiable—for example, sophisticated reward hacks that can fool expert coders, or hard-to-verify sandbagging behavior on safety-related research. You mentioned that this would get covered in the next post, so I’m looking forward to that.
I think the thing that helps you out here is compositionality—all of these properties hopefully reduce to simpler concepts that are themselves verifiable, so hopefully e.g. a smart enough interp assistant could understand all the individual concepts as well as how they compose together and use this to understand more complex latent reasoning that isn’t directly verifiable.
The second is how robust these oversight mechanisms are to optimization. It seems like a bad idea to train directly against unwanted concepts in predictive concept decoders, but maybe training directly against investigator agents for unwanted behaviors is fine? Using Docent to surface issues in RL environments (and then fixing those issues) also seems good? In some cases, if we have good interpretability, we can actually understand where the undesirable behaviors come from (i.e., data attribution) and address the source of the problem. That’s great!
But I don’t think this would always be the case even if we have way better versions of existing tools. Of course, we could just use these tools mostly for auditing (as described in “Putting up bumpers”), but that limits their utility by a lot. Clearer thinking about how much we can train against various discriminators seems pretty crucial to all of this.
I agree with most of this. I’d just add that it’s not totally obvious to me that RLHF is the way we should be doing alignment out into the future—it’s kind of like electroshock therapy for LLMs which feels kind of pathological from a psychological standpoint. I’d guess that there are more psychologically friendly ways to train LMs—and understanding the relationship between training data and behaviors feels like a good way to study this!
Related to the second point: this post seems to assume a non-adversarial dynamic between the supervisors and the supervised (e.g., the AIs are not scheming). Good oversight can be one of our best tools to prevent scheming/adversarial dynamics from arising in the first place. Many of the tools developed can also be used to detect scheming (e.g., I wrote about using investigator agents to create realistic honeypots here). Still, more thought on how these problems play out in adversarial scenarios would probably be fruitful.
Do you mean that the oversight system is scheming or the subject model? For the subject model the hope is that a sufficiently powerful overseer can catch that. If you’re worried about the overseer, one reason for optimism is that the oversight models can have significantly smaller parameter count than frontier systems, so are less likely to have weird emergent properties.

jsteinhardt 6 Jan 2026 23:33 UTC
3 points
0
in reply to: anaguma’s comment on: Oversight Assistants: Turning Compute into Understanding
The explainer model is actually evaluated based on how well the explanations predict ground-truth activation patterns, so it’s not being evaluated by an LM-judge, but against the underlying ground-truth.
There is still room to hack the metric to some extent (in particular, we use an LM-based simulator to turn the explanations into predictions, so you could do better by providing more simulator-friendly explanations). This is probably happening, but we did a head-to-head comparison of LM-generated vs. human-generated explanations, and based on spot-checking them by hand, the higher-scoring explanations under our metric really did seem better.
There’s also a number of other sanity checks in the paper if you’re interested!

jsteinhardt 27 Dec 2025 19:16 UTC
3 points
0
in reply to: Lukas Finnveden’s comment on: Contradict my take on OpenPhil’s past AI beliefs
My guess would be that it’s because they paid Hypermind directly rather than making the grant to me.

jsteinhardt 27 Dec 2025 3:31 UTC
7 points
3
in reply to: Eliezer Yudkowsky’s comment on: Contradict my take on OpenPhil’s past AI beliefs
If you are interested, I did a detailed analysis of different groups of forecasters here: https://bounded-regret.ghost.io/scoring-ml-forecasts-for-2023/
I wouldn’t treat competitive forecasters as a homogeneous group, but I also think basically everyone was surprised by the rate of progress on the MATH dataset. The main difference is that the better forecasters adjusted quickly after the first surprise and were mostly calibrated after.

jsteinhardt 27 Dec 2025 3:28 UTC
7 points
2
in reply to: Lukas Finnveden’s comment on: Contradict my take on OpenPhil’s past AI beliefs
My forecasts actually were funded by OP! I would guess that the main counterfactual change as a result of this was going with Hypermind over Good Judgement. It might be interesting to look at differences between those populations of forecasters—I would not model “super forecasters” as homogeneous and in retrospect the particular forecasters we got seemed not super good at AI questions, or else just weren’t trying hard enough. But I also worked with some very good, AI-focused forecasters as a sanity check and they were also surprised by progress as determined by pre-registered predictions.

jsteinhardt 19 Dec 2025 20:39 UTC
LW: 2 AF: 1
0
AF
in reply to: Oliver Daniels’s comment on: Scalable End-to-End Interpretability
Thanks, appreciate it! Interested if you have any particular tasks you’d want as part of the safety case (we are actively building out a dataset of tasks for evaluating interpretability assistants and looking for ideas).

jsteinhardt 24 Mar 2025 19:27 UTC
2 points
0
in reply to: Maxwell Peterson’s comment on: Analyzing long agent transcripts (Docent)
Looks like an issue with the cross-posting (it works at https://bounded-regret.ghost.io/analyzing-long-agent-transcripts-docent/). Moderators, any idea how to fix?
EDIT: Fixed now, thanks to Oliver!

jsteinhardt 22 Mar 2025 17:01 UTC
LW: 5 AF: 3
2
AF
in reply to: Daniel Kokotajlo’s comment on: METR: Measuring AI Ability to Complete Long Tasks
I meant coding in particular, I agree algorithmic progress is not 3x faster. I checked again just now with someone and they did indeed report 3x speedup for writing code, although said that the new bottleneck becomes waiting for experiments to run (note this is not obviously something that can be solved by greater automation, at least up until the point that AI is picking better experiments than humans).

jsteinhardt 20 Mar 2025 23:49 UTC
LW: 8 AF: 4
−8
AF
in reply to: Daniel Kokotajlo’s comment on: METR: Measuring AI Ability to Complete Long Tasks
Research engineers I talk to already report >3x speedups from AI assistants. It seems like that has to be enough that it would be showing up in the numbers. My null hypothesis would be that programmer productivity is increasing exponentially and has been for ~2 years, and this is already being taken into account in the curves, and without this effect you would see a slower (though imo not massively slower) exponential.
(This would argue for dropping the pre-2022 models from the graph which I think would give slightly faster doubling times, on the order of 5-6 months if I had to eyeball.).

jsteinhardt 20 Mar 2025 6:17 UTC
LW: 16 AF: 8
−4
AF
in reply to: Daniel Kokotajlo’s comment on: METR: Measuring AI Ability to Complete Long Tasks
Doesn’t the trend line already take into account the effect you are positing? ML research engineers already say they get significant and increasing productivity boosts from AI assistants and have been for some time. I think the argument you are making is double-counting this. (Unless you want to argue that the kink with Claude is the start of the super-exponential, which we would presumably get data on pretty soon).

jsteinhardt 11 Nov 2023 7:35 UTC
2 points
0
in reply to: Daniel Kokotajlo’s comment on: GPT-2030 and Catastrophic Drives: Four Vignettes
When only a couple thousand copies you probably don’t want to pay for the speedup, eg even going an extra 4x decreases the number of copies by 8x.

I also think when you don’t have control over your own hardware the speedup schemes become harder, since they might require custom network topologies. Not sure about that though

jsteinhardt 7 Oct 2023 4:40 UTC
83 points
49
on: Related Discussion from Thomas Kwa’s MIRI Research Experience
While I am not close to this situation, I felt moved to write something, mostly to support junior researchers and staff such as TurnTrout, Thomas Kwa, and KurtB who are voicing difficult experiences that may be challenging for them to talk about; and partly because I can provide perspective as someone who has managed many researchers and worked in a variety of research and non-research organizations and so can more authoritatively speak to what behaviors are ‘normal’ and what patterns tend to lead to good or bad outcomes. Caveat that I know very little about any internal details of MIRI, but I am still reasonably confident of what I’m saying based on general patterns and experience in the world.
Based on reading Thomas Kwa’s experience, as well as KurtB’s experience, Nate Soares’ behavior is far outside any norms of acceptable behavior that I’d endorse. Accepting or normalizing this behavior within an organization has a corrosive effect on the morale, epsistemics, and spiritual well-being of its members. The morale effects are probably obvious, but regarding epistemics, leadership is significantly less likely to get useful feedback if people are afraid to cross them (psychological safety is an important concept here). Finally, regarding spirit, normalizing this behavior sends a message to people that they aren’t entitled to set boundaries or be respected, which can create far-reaching damage in their other interactions and in their image of themselves. Based on this, I feel very worried for MIRI and think it should probably do a serious re-think of its organizational culture.
Since some commenters brought up academia and the idea that some professors can be negligent or difficult to work with, I will compare Nate’s behavior to professors in CS academia. Looking at what Thomas Kwa described, I can think of some professors who exhibit individual traits in Thomas’ description, but someone who had all of them at once would be an outlier (in a field that is already welcoming to difficult personalities), and I would strongly warn students against working with such a person. KurtB’s experience goes beyond that and seems at least a standard deviation worse; if someone behaved this way, I would try to minimize their influence in any organization I was part of and refuse to collaborate with them, and I would expect even a tenured faculty to have a serious talking-to about their behavior from colleagues (though maybe some places would be too cowardly to have this conversation), and for HR complaints to stack up.
Nate, the best description I can think of for what’s going on is that you have fairly severe issues with emotional regulation. Your comments indicate that you see this as a basic aspect of your emotional make-up (and maybe intimately tied to your ability to do research), but I have seen this pattern several times before and I am pretty confident this is not the case. In previous cases I’ve seen, the person in question expresses or exhibits and unwillingness to change up until the point that they face clear consequences for their actions, at which point (after a period of expressing outrage) they buckle down and make the changes, which usually changes their own life for the better, including being able to think more clearly. A first step would be going to therapy, which I definitely recommend. I am pretty confident that even for your own sake you should make a serious effort to make changes here. (I hope this doesn’t come across as condescending, as I genuinely believe this is good advice.)
Along these lines, for people around Nate who think that they “have” to accept this behavior because Nate’s work is important, even on those grounds alone setting boundaries on the behavior will lead to better outcomes.
Here is an example of how an organization could set boundaries on this behavior: If Nate yells at a staff member, that staff member no longer does ops work for Nate until he apologizes and expresses a credible commitment to communicate more courteously in the future. (This could be done in principle by making it opt-in to do continued ops work for Nate if this happens, and working hard to create a real affordance for not opting in.)
The important principle here is that Nate internalizes the costs of his decisions (by removing his ability to impose costs on others, and bearing the resulting inconvenience). Here the cost to Nate is also generally lower than the cost that would have been imposed on others (inflating your own bike tire is less costly than having your day ruined by being yelled at), though this isn’t crucial. The important thing is Nate would have skin in the game—if he still doesn’t change, then I believe somewhat more that he’s actually incapable of doing so, but I would guess that this would actually lead to changes. And if MIRI for some reason believes that other people should be willing to bear large costs for small benefits to Nate, they should also hire a dedicated staff to do damage control for him. (Maybe some or all of this is already happening… I am not at MIRI so I don’t know, but it doesn’t sound this way based on the experiences that have been shared.)
In summary: based on my own personal experience across many organizations, Nate’s behavior is not okay and MIRI should set boundaries on it. I do not believe Nate’s claim that this is a fundamental aspect of his emotional make-up, as it matches other patterns in the past that have changed when consequences were imposed, and even if it is a fundamental aspect he should face the natural consequences of his actions. These consequences should center on removing his ability to harm others, or, if this is not feasible, creating institutions at MIRI to reliably clean up after him and maintain psychological safety.
What links here?
- Raghuvar Nadig's comment on Related Discussion from Thomas Kwa’s MIRI Research Experience by Raemon (11 Oct 2023 14:21 UTC; 11 points)
- TurnTrout's comment on Related Discussion from Thomas Kwa’s MIRI Research Experience by Raemon (8 Oct 2023 17:58 UTC; 10 points)

jsteinhardt 20 Aug 2023 21:58 UTC
2 points
0
in reply to: habryka’s comment on: AI Forecasting: Two Years In
I don’t see it in the header in Mobile (although I do see the updated text now about it being a link post). Maybe it works on desktop but not mobile?

jsteinhardt 20 Aug 2023 21:44 UTC
9 points
6
in reply to: Dan H’s comment on: AI Forecasting: Two Years In
Is it clear these results don’t count? I see nothing in the Metaculus question text that rules it out.

jsteinhardt 20 Aug 2023 21:43 UTC
2 points
0
on: AI Forecasting: Two Years In
Mods, could you have these posts link back to my blog Bounded Regret in some form? Right now there is no indication that this is cross-posted from my blog, and no link back to the original source.

jsteinhardt 15 Jul 2023 6:59 UTC
27 points
10
in reply to: Ben Pace’s comment on: Elon Musk announces xAI
Dan spent his entire PhD working on AI safety and did some of the most influential work on OOD robustness and OOD detection, as well as writing Unsolved Problems. Even if this work is less valued by some readers on LessWrong (imo mistakenly), it seems pretty inaccurate to say that he didn’t work on safety before founding CAIS.

jsteinhardt 25 Jun 2023 20:19 UTC
10 points
1
in reply to: the gears to ascension’s comment on: Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?
Melanie Mitchell and Meg Mitchell are different people. Melanie was the participant in this debate, but you seem to be ascribing Meg’s opinions to her, including linking to video interviews with Meg in your comments.

jsteinhardt 10 Jun 2023 0:37 UTC
2 points
0
in reply to: reallyeli’s comment on: What will GPT-2030 look like?
I’m leaving it to the moderators to keep the copies mirrored, or just accept that errors won’t be corrected on this copy. Hopefully there’s some automatic way to do that?

jsteinhardt 9 Jun 2023 3:52 UTC
5 points
0
in reply to: Daniel Kokotajlo’s comment on: What will GPT-2030 look like?
Oops, thanks, updated to fix this.