David Rein

Karma: 155

David Rein 20 Apr 2026 22:55 UTC
1 point
0
in reply to: Asa Cooper Stickland’s comment on: Prompted CoT Early Exit Undermines the Monitoring Benefits of CoT Uncontrollability
Oh right of course, thanks!

David Rein 20 Apr 2026 19:23 UTC
1 point
0
AF
on: Prompted CoT Early Exit Undermines the Monitoring Benefits of CoT Uncontrollability
we show that a range of frontier models (Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro) can be prompted to “early exit” their CoT and displace reasoning into the response. This undermines the controllability frame: these prompted models retain most of their reasoning capability (4–8pp average accuracy cost vs 20–29pp for no reasoning at all) while moving it into the stylistically controllable channel
I’m sure I’m just missing context, but why are models better at controlling style outside of their CoT? I’m surprised by this, I feel like the distinction between CoT and “reasoning in your output, e.g. via code comments” feels very weak/blurry to me.

David Rein 8 Sep 2025 20:58 UTC
3 points
0
in reply to: tdko’s comment on: METR Research Update: Algorithmic vs. Holistic Evaluation
Important to caveat that these results are pretty small—I wouldn’t take the absolute numbers too seriously beyond the general “algorithmic scoring may often overestimate software capabilities”.

David Rein 6 Aug 2025 21:13 UTC
1 point
0
in reply to: Oliver Sourbut’s comment on: Will compute bottlenecks prevent a software intelligence explosion?
Hmm, I actually kind of lean towards it being rational, and labs just underspending on labor vs. capital for contigent historical/cultural reasons. I do think a lot of the talent juice is in “banal” progress like efficiently running lots of experiments, and iterating on existing ideas straightforwardly (as opposed to something like “only a few people have the deep brilliance/insight to make progress”), but that doesn’t change the upshot IMO.

David Rein 1 Aug 2025 17:59 UTC
1 point
0
in reply to: Oliver Sourbut’s comment on: Will compute bottlenecks prevent a software intelligence explosion?
Salaries have indeed now gotten pretty high—it seems like they’re within an OOM of compute spend (at least at Meta).

David Rein 25 May 2024 3:45 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: NYU Code Debates Update/Postmortem
That’s indeed what I meant!

David Rein 17 Jan 2024 21:17 UTC
1 point
0
AF
on: When can we trust model evaluations?
the existence of predicates on the world that are easier to evaluate than generate examples of (in the same way that verifying the answer to a problem in NP can be easier than generating it) guarantees that the model should be better at distinguishing between evaluation and deployment than any evaluator can be at tricking it into thinking it’s in deployment
Where does the guarantee come from? Why do we know that for this specific problem (generating vs. evaluating whether the model is deployed) it’s easier to evaluate? For many problems it’s equally difficult, right?

David Rein 28 Nov 2023 20:18 UTC
2 points
0
on: Anthropic Fall 2023 Debate Progress Update
Given that the judge that selects the best argument for BoN is the same as the one that chooses the winner, what is your main takeaway from the fact that ELO increases as you increase N? I see this as mainly a sanity check, but want to check if I’m missing something.

David Rein 23 Nov 2023 17:06 UTC
5 points
2
in reply to: johnswentworth’s comment on: Debate helps supervise human experts [Paper]
Another author here! Regarding specifically the 74% vs. 84% numbers—a key takeaway that our error analysis is intended to communicate is that we think a large fraction of the errors judges made in debates were pretty easily solvable with more careful judges, whereas this didn’t feel like it was the case with consultancy.
For example, Julian and I both had 100% accuracy as judges on human debates for the 36 human debates we judged, which was ~20% of all correct human debate judgments. So I’d guess that more careful judges overall could increase debate accuracy to at least 90%, maybe higher, although at that point we start hitting measurement limits from the questions themselves being noisy.

David Rein 23 Jan 2023 7:46 UTC
3 points
0
in reply to: dsj’s comment on: Models Don’t “Get Reward”
The issue with early finetuning is that there’s not much that humans can actually select on, because the models aren’t capable enough—it’s really hard for me to say that one string of gibberish is better/worse.

David Rein 9 Jan 2023 2:39 UTC
4 points
1
in reply to: janus’s comment on: What specific thing would you do with AI Alignment Research Assistant GPT?
I think the issue with the more general “neocortex prosthesis” is that if AI safety/alignment researchers make this and start using it, every other AI capabilities person will also start using it.

David Rein 3 Jan 2023 17:29 UTC
1 point
0
on: Debate update: Obfuscated arguments problem
- unreasonable ^5
I think there’s a typographical error—this doesn’t link to any footnote for me, and there doesn’t appear to be a fifth footnote at the end of the post

David Rein 3 Jan 2023 17:20 UTC
1 point
0
on: Debate update: Obfuscated arguments problem
Geoffrey and others raised this general problem several years ago (e.g. here)

This link no longer works—I get a permission denied message.

David Rein 27 Dec 2022 5:41 UTC
3 points
0
on: Strong Evidence is Common
This post is short, but important. The fact that we regularly receive enormously improbable evidence is relevant for a wide variety of areas. It’s an integral part of having accurate beliefs, and despite this being such a key idea, it’s underappreciated generally (I’ve only seen this post referenced once, and it’s never come up in conversation with other rationalists).

David Rein 21 Dec 2022 0:41 UTC
1 point
0
on: phone.spinning’s Shortform
Has anyone thought about the best ways of intentionally inducing the most likely/worst kinds of misalignment in models, so we can test out alignment strategies on them? I think red teaming kinda fits this, but that’s more focused on eliciting bad behavior, instead of causing a more general misalignment. I’m thinking about something along the lines of “train with RLHF so the model reliably/robustly does bad things, and then we can try to fix that and make the model good/non-harmful”, especially in the sandwiching context where the model is more capable than the overseer.
This is especially relevant for Debate, where we currently do self-play with a helpful assistant-style RLHF’d model, where one of the models is prompted to argue for an incorrect answer. But prompting the model to argue for an incorrect answer is a very simple/rough way of inducing misalignment, which is (at least partially) what we’re trying to design Debate to be robust against.

David Rein 21 Dec 2022 0:37 UTC
1 point
0
on: phone.spinning’s Shortform
Is there any other reason to think that scalable oversight is possible at all in principle, other the standard complexity theory analogy? I feel like this is forming the basis of a lot of our (and other’s) work in safety, but I haven’t seen work that tries to understand/conceptualize this analogy concretely.

David Rein 21 Dec 2022 0:36 UTC
1 point
0
on: phone.spinning’s Shortform
Is anyone thinking about how to scale up human feedback collection by several orders of magnitudes? A lot of alignment proposals aren’t focused on the social choice theory questions, which I’m okay with, but I’m worried that there may be large constant factors in the scalability of human feedback strategies like amplification/debate, such that there could be big differences between collecting 50k trajectories versus say 50-500M. Obviously cost/logistics are a giant bottleneck here, but I’m wondering about what other big challenges might be (especially if we could make intellectual progress on this before we may need to)

David Rein 20 Dec 2022 23:29 UTC
1 point
0
on: phone.spinning’s Shortform
How does shard theory differ from the Olah-style interpretability agenda? Why is there any reason to believe we can learn about “shards” without interpretability?

David Rein 6 Oct 2022 20:45 UTC
1 point
0
in reply to: David Rein’s comment on: phone.spinning’s Shortform
This is about 100T tokens, assuming ~2 tokens per word. That’s quite a lot of supervision.

David Rein 5 Oct 2022 17:02 UTC
1 point
0
on: phone.spinning’s Shortform
When doing sandwiching experiments, a key property of your “amplification strategy” (i.e. the method you use to help the human complete the task) should only help the person complete the task correctly.
For example, lets say you have a language model give arguments for why a certain answer to a question is correct. This is fine, but we don’t want it to be the case that the system is also capable of convincing the person of an incorrect answer. In this example, you can easily evaluate this, by prompting or finetuning the model to argue for incorrect answers, and seeing if people also believe the incorrect arguments. This description of the problem/task naturally leads to debate, where the key property we want is for models arguing for correct answers to win more often than models arguing for incorrect answers. But even if you aren’t using debate, to evaluate a sandwiching attempt, you need to compare it to the case where you’re using the strategy to try and convince the person of an incorrect answer.