jsteinhardt

Karma: 6,189

jsteinhardt 27 May 2026 4:41 UTC
3 points
1
in reply to: catawampless’s comment on: Cognitive Security as an AI Safety Cause Area
I agree that this is an important externality and it’s something I think about a fair amount.

My current view on this is:
- We can roughly decompose into two questions: (A) “does AI behavior X have psychological effect Y on humans” and (B) “how much of a propensity does this AI system have to exhibit behavior X?”
- We will typically answer (A) with a combination of existing psychology literature, longitudinal studies, and intuition from domain experts.
- We will typically answer (B) with in silico simulations
- We will also use longitudinal studies to sanity check that the answers to (A) and (B) actually compose as expected.
To answer (B), you need simulations that are similar enough to humans to elicit similar behaviors from the language model. But these simulations are short-term, not long-term, so they don’t need to simulate the long-term effects on humans.

An unscrupulous company could potentially use the simulations from (B) to optimize for behaviors that elicit the desired short-term responses from humans. But since we’re looking at short-term effects, they could have already optimized directly on their pool of users; there isn’t much uplift from a simulation.

The main advantage of simulations is that they (1) give you apples-to-apples comparisons across different models, and (2) let you make measurements even if you don’t have a ton of user traffic to draw on. Both of these differentially help evaluators compared to large companies.

jsteinhardt 25 May 2026 23:47 UTC
6 points
0
in reply to: Andrii Vasylenko’s comment on: Cognitive Security as an AI Safety Cause Area
For long timespans, I agree you probably want data from real humans rather than in silico simulations. Generating such data ethically is a problem that has been studied quite a bit for other technologies such as social media. For instance, The Welfare Effects of Social Media (Allcott, Braghieri, Eichmeyer, and Gentzkow) pays a random subset of users to stop using Facebook and looks at the resulting effects on well-being and several other cognitive attributes.

Incidentally, all four of the authors of that study have gone on to continue doing some pretty cool work!

jsteinhardt 25 May 2026 21:06 UTC
7 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: Cognitive Security as an AI Safety Cause Area
Which part are you thinking is hard? The in silico simulations, or large-scale recruitment?

Cognitive Security as an AI Safety Cause Area

jsteinhardt25 May 2026 18:30 UTC

155 points

18 comments2 min readLW link

The Case for Evaluating Model Behaviors

jsteinhardt20 May 2026 18:42 UTC

40 points

3 comments3 min readLW link

Building Technology to Drive AI Governance

jsteinhardt18 Feb 2026 18:30 UTC

59 points

4 comments10 min readLW link

(bounded-regret.ghost.io)

jsteinhardt 9 Jan 2026 5:05 UTC
2 points
0
in reply to: Tim Hua’s comment on: Oversight Assistants: Turning Compute into Understanding
Is the worry that if the overseer is used at training time, the model will be eval aware and learn to behave differently when overseen?

jsteinhardt 8 Jan 2026 1:49 UTC
3 points
0
in reply to: Tim Hua’s comment on: Oversight Assistants: Turning Compute into Understanding
Thanks! Some thoughts here:
The first is how to train oversight AIs when the oversight tasks are no longer easily verifiable—for example, sophisticated reward hacks that can fool expert coders, or hard-to-verify sandbagging behavior on safety-related research. You mentioned that this would get covered in the next post, so I’m looking forward to that.
I think the thing that helps you out here is compositionality—all of these properties hopefully reduce to simpler concepts that are themselves verifiable, so hopefully e.g. a smart enough interp assistant could understand all the individual concepts as well as how they compose together and use this to understand more complex latent reasoning that isn’t directly verifiable.
The second is how robust these oversight mechanisms are to optimization. It seems like a bad idea to train directly against unwanted concepts in predictive concept decoders, but maybe training directly against investigator agents for unwanted behaviors is fine? Using Docent to surface issues in RL environments (and then fixing those issues) also seems good? In some cases, if we have good interpretability, we can actually understand where the undesirable behaviors come from (i.e., data attribution) and address the source of the problem. That’s great!
But I don’t think this would always be the case even if we have way better versions of existing tools. Of course, we could just use these tools mostly for auditing (as described in “Putting up bumpers”), but that limits their utility by a lot. Clearer thinking about how much we can train against various discriminators seems pretty crucial to all of this.
I agree with most of this. I’d just add that it’s not totally obvious to me that RLHF is the way we should be doing alignment out into the future—it’s kind of like electroshock therapy for LLMs which feels kind of pathological from a psychological standpoint. I’d guess that there are more psychologically friendly ways to train LMs—and understanding the relationship between training data and behaviors feels like a good way to study this!
Related to the second point: this post seems to assume a non-adversarial dynamic between the supervisors and the supervised (e.g., the AIs are not scheming). Good oversight can be one of our best tools to prevent scheming/adversarial dynamics from arising in the first place. Many of the tools developed can also be used to detect scheming (e.g., I wrote about using investigator agents to create realistic honeypots here). Still, more thought on how these problems play out in adversarial scenarios would probably be fruitful.
Do you mean that the oversight system is scheming or the subject model? For the subject model the hope is that a sufficiently powerful overseer can catch that. If you’re worried about the overseer, one reason for optimism is that the oversight models can have significantly smaller parameter count than frontier systems, so are less likely to have weird emergent properties.

jsteinhardt 6 Jan 2026 23:33 UTC
3 points
0
in reply to: anaguma’s comment on: Oversight Assistants: Turning Compute into Understanding
The explainer model is actually evaluated based on how well the explanations predict ground-truth activation patterns, so it’s not being evaluated by an LM-judge, but against the underlying ground-truth.
There is still room to hack the metric to some extent (in particular, we use an LM-based simulator to turn the explanations into predictions, so you could do better by providing more simulator-friendly explanations). This is probably happening, but we did a head-to-head comparison of LM-generated vs. human-generated explanations, and based on spot-checking them by hand, the higher-scoring explanations under our metric really did seem better.
There’s also a number of other sanity checks in the paper if you’re interested!

Oversight Assistants: Turning Compute into Understanding

jsteinhardt6 Jan 2026 0:50 UTC

85 points

7 comments9 min readLW link

(bounded-regret.ghost.io)

jsteinhardt 27 Dec 2025 19:16 UTC
3 points
0
in reply to: Lukas Finnveden’s comment on: Contradict my take on OpenPhil’s past AI beliefs
My guess would be that it’s because they paid Hypermind directly rather than making the grant to me.

jsteinhardt 27 Dec 2025 3:31 UTC
7 points
3
in reply to: Eliezer Yudkowsky’s comment on: Contradict my take on OpenPhil’s past AI beliefs
If you are interested, I did a detailed analysis of different groups of forecasters here: https://bounded-regret.ghost.io/scoring-ml-forecasts-for-2023/
I wouldn’t treat competitive forecasters as a homogeneous group, but I also think basically everyone was surprised by the rate of progress on the MATH dataset. The main difference is that the better forecasters adjusted quickly after the first surprise and were mostly calibrated after.

jsteinhardt 27 Dec 2025 3:28 UTC
7 points
2
in reply to: Lukas Finnveden’s comment on: Contradict my take on OpenPhil’s past AI beliefs
My forecasts actually were funded by OP! I would guess that the main counterfactual change as a result of this was going with Hypermind over Good Judgement. It might be interesting to look at differences between those populations of forecasters—I would not model “super forecasters” as homogeneous and in retrospect the particular forecasters we got seemed not super good at AI questions, or else just weren’t trying hard enough. But I also worked with some very good, AI-focused forecasters as a sanity check and they were also surprised by progress as determined by pre-registered predictions.

jsteinhardt 19 Dec 2025 20:39 UTC
LW: 2 AF: 1
0
AF
in reply to: Oliver Daniels’s comment on: Scalable End-to-End Interpretability
Thanks, appreciate it! Interested if you have any particular tasks you’d want as part of the safety case (we are actively building out a dataset of tasks for evaluating interpretability assistants and looking for ideas).

Scalable End-to-End Interpretability

jsteinhardt18 Dec 2025 22:37 UTC

121 points

3 comments3 min readLW link

Analyzing long agent transcripts (Docent)

jsteinhardt24 Mar 2025 20:49 UTC

41 points

2 comments1 min readLW link

(bounded-regret.ghost.io)

jsteinhardt 24 Mar 2025 19:27 UTC
2 points
0
in reply to: Maxwell Peterson’s comment on: Analyzing long agent transcripts (Docent)
Looks like an issue with the cross-posting (it works at https://bounded-regret.ghost.io/analyzing-long-agent-transcripts-docent/). Moderators, any idea how to fix?
EDIT: Fixed now, thanks to Oliver!

jsteinhardt 22 Mar 2025 17:01 UTC
LW: 5 AF: 3
2
AF
in reply to: Daniel Kokotajlo’s comment on: METR: Measuring AI Ability to Complete Long Tasks
I meant coding in particular, I agree algorithmic progress is not 3x faster. I checked again just now with someone and they did indeed report 3x speedup for writing code, although said that the new bottleneck becomes waiting for experiments to run (note this is not obviously something that can be solved by greater automation, at least up until the point that AI is picking better experiments than humans).

jsteinhardt 20 Mar 2025 23:49 UTC
LW: 8 AF: 4
−8
AF
in reply to: Daniel Kokotajlo’s comment on: METR: Measuring AI Ability to Complete Long Tasks
Research engineers I talk to already report >3x speedups from AI assistants. It seems like that has to be enough that it would be showing up in the numbers. My null hypothesis would be that programmer productivity is increasing exponentially and has been for ~2 years, and this is already being taken into account in the curves, and without this effect you would see a slower (though imo not massively slower) exponential.
(This would argue for dropping the pre-2022 models from the graph which I think would give slightly faster doubling times, on the order of 5-6 months if I had to eyeball.).

jsteinhardt 20 Mar 2025 6:17 UTC
LW: 16 AF: 8
−4
AF
in reply to: Daniel Kokotajlo’s comment on: METR: Measuring AI Ability to Complete Long Tasks
Doesn’t the trend line already take into account the effect you are positing? ML research engineers already say they get significant and increasing productivity boosts from AI assistants and have been for some time. I think the argument you are making is double-counting this. (Unless you want to argue that the kink with Claude is the start of the super-exponential, which we would presumably get data on pretty soon).

jsteinhardt

Cog­ni­tive Se­cu­rity as an AI Safety Cause Area

The Case for Eval­u­at­ing Model Behaviors

Build­ing Tech­nol­ogy to Drive AI Governance

Over­sight As­sis­tants: Turn­ing Com­pute into Understanding

Scal­able End-to-End Interpretability

An­a­lyz­ing long agent tran­scripts (Do­cent)

Cognitive Security as an AI Safety Cause Area

The Case for Evaluating Model Behaviors

Building Technology to Drive AI Governance

Oversight Assistants: Turning Compute into Understanding

Scalable End-to-End Interpretability

Analyzing long agent transcripts (Docent)