maxnadeau

Karma: 724

maxnadeau 7 Apr 2026 23:00 UTC
LW: 6 AF: 5
0
AF
on: My picture of the present in AI
While the Chain-of-Thought (CoT) for OpenAI models reasonably accurately reflects the model’s cognition, the CoT for Anthropic models does so to a substantially lesser extent. This may be due to “spillover” effects where reinforcement on outputs transfers to the CoT because Anthropic’s CoT is less distinct from the output
Could primary main cause instead be (as Tim Hua notes) that Anthropic has trained past models to have reward-model-pleasing chains of thought?

maxnadeau 13 Feb 2026 1:43 UTC
LW: 3 AF: 2
0
AF
on: Prioritizing threats for AI control
“When something quite bad can occur in a small number of actions (potentially even in a single trajectory) or in a short period of time, it’s important to sometimes block suspicious looking actions until they are reviewed.”
Why? I think there is not perfect overlap between the categories “single actions that are catastrophic, i.e. high stakes failures” and “actions that require monitoring and blocking in the moment”. One example where they come apart is “producing a batch of data that would cause the next generation of models to be misaligned”—this might be a high-stakes failure, but you have plenty of time to catch it before you actually train that model. Another example would be “writing code that is catastrophically insecure but only gets run rarely”.

maxnadeau 6 Feb 2026 4:58 UTC
LW: 6 AF: 4
−2
AF
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
What are your objections to Alex/Buck’s definition, for example? I think they define it pretty well. Of course, their definition admits some edge cases and ambiguities, but so do lots of concepts—still seems like useful concept to me.

maxnadeau 26 Nov 2025 2:19 UTC
12 points
4
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
I’m glad you’re betting on your own taste/expertise instead of donating on behalf of the community—that seems sensible to me.

maxnadeau 26 Nov 2025 2:14 UTC
7 points
2
on: Three positive updates I made about technical grantmaking at Coefficient Giving (fka Open Phil)
(I am also a grantmaker at Coefficient/OP)
The arguments/evidence in this post seem true and underrated to me, and I think more people should come work with us.
In particular, I also have updated upward on how impactful the job is over the last year. It does really seem to me like each grantmaker enables a ton of good projects. Here’s an attempt to make more concrete how much is enabled by additional grantmakers: If Jake hadn’t joined OP, I think we would our interp/theory grants would have been fewer in number and less impactful, because I don’t know those areas nearly as well as Jake does. Jake’s superior knowledge improves our grantmaking in these areas in multiple ways:
- Better sourcing: Jake’s involvement meant that the proposals in these areas that were even available to us to evaluate were much better. His contributions to the interp/theory sections of the RFP meant the incoming proposals were higher-quality than if I had attempted to write them, and he had good suggestions/steers for grant applicants that I couldn’t have offered. He was also able to proactively ideate and realize projects that I wouldn’t have thought of or wouldn’t have had time for.
- More grants, in more varied subareas: because Jake knows those areas better, he can evaluate proposals faster and is more comfortable arguing for/defending these grants than I am. This allows us to make more, and more varied, grants in those areas.
- [The obvious one] Jake has better discernment among proposals in these areas than I do, which straightforwardly increases the impact of our grantmaking.
I think there are probably more buckets of similar scale/impact grantmaking to interp and theory that we’re currently neglecting. We need to hire more people to open up these new vistas of TAIS grantmaking, each of which will contain not just mediocre/marginal grants, but also some real gems! I think this dynamic is often underappreciated; additional grantmakers take ownership for new areas, rather than just helping us make better choices on the margin.
I also think that Jake obviously had way more impact on theory/interp than if he had done direct work. He funded dozens of projects by capable researchers, many of whom wouldn’t have worked on AI safety otherwise. I think most TAIS researchers aren’t taking this nearly seriously enough, and I think the case for grantmaking roles looks very strong in light of this.

maxnadeau 20 Oct 2025 19:44 UTC
3 points
0
on: Humanity Learned Almost Nothing From COVID-19
I fervently agree that the degree of inaction here is embarrassing and indefensible.
Here’s one proposed explanation, though definitely not a justification, of why the Biden admin didn’t do more. To be clear, I’m not saying they’re the only ones who should have done/do more:
The shrinking anti-pandemic agenda
After coming up with a $65 billion moonshot plan, Biden asked for about half of that as part of his initial Build Back Better proposal. But as the entirety of BBB shrank in an effort to secure the support of Joe Manchin and Kyrsten Sinema, the pandemic prevention shrank to about $2.7 billion, of which roughly half is to modernize the CDC’s labs.
And it’s far from clear that even this relatively small amount will pass.
The extreme shrinkage of the pandemic prevention agenda in part reflects a partisan calculation. To Democrats who agree that this should be a priority, it doesn’t feel like it’s a distinctively progressive priority that should squeeze out ideas like free preschool or Medicaid expansion, which everyone understands Republicans oppose. They feel like this bill is supposed to be dessert, and pandemic prevention is vegetables.
And the good news is that it’s true — pandemic prevention is not a super partisan topic, and there are prospects for bipartisan cooperation.
The problem is that once you get into the regular appropriations process, the logic of base rates starts to dominate everything. To secure a 30% increase in pandemic preparedness funding would be a big step for appropriators since obviously most programs can’t score increases nearly that large. But we are currently spending peanuts on a problem that has both massive economic consequences and carries genuine existential risk. We don’t need a large increase in pandemic preparedness funding; we need to go from “not seriously investing in preventing pandemics” to “genuinely trying to prevent pandemics” with a gargantuan investment in funds.
What’s particularly galling is that even as the need exceeds the demand of standard appropriations, it’s still relatively modest compared to the $725 billion defense budget for fiscal year 2020. And due to base rate issues, the Biden administration’s 2021 requested increase — though modest in percentage terms — still amounts to $12 billion for one year in a world where asking for a $7 billion per year increase in defending ourselves against pandemics is considered outrageous.

maxnadeau 9 Oct 2025 0:47 UTC
24 points
0
in reply to: plex’s comment on: ete’s Shortform
Thanks for writing this! Just booked a time in your calendly to discuss at more length.

maxnadeau 14 May 2025 18:52 UTC
LW: 8 AF: 6
0
AF
in reply to: ryan_greenblatt’s comment on: An alignment safety case sketch based on debate
For more discussion of the hard cases of exploration hacking, readers should see the comments of this post.

maxnadeau 16 Apr 2025 7:20 UTC
7 points
3
in reply to: robo’s comment on: The 4-Minute Mile Effect
Typo: should be “Gell-Mann”

maxnadeau 30 Mar 2025 20:18 UTC
3 points
0
on: Tormenting Gemini 2.5 with the [[[]]][][[]] Puzzle
I figured out the encoding, but I expressed the algorithm for computing the decoding in different language from you. My algorithm produces equivalent outputs but is substantially uglier. I wanted to leave a note here in case anyone else had the same solution.
Alt phrasing of the solution:
Each expression (i.e. a well-formed string of brackets) has a “degree”, which is defined as the the number of well-formed chunks that the encoding can be broken up into. Some examples: [], [[]], and [-[][]] have degree one, [][], -[][[]], and [][[][]] have degree two, etc.
Here’s a special case: the empty string maps to 0, i.e. decode(“”) = 0
When an encoding has degree one, you take off the outer brackets and do 2^decode(the enclosed expression), defined recursively. So decode([]) = 2^decode(“”) = 2^0 = 1, decode([[]]) = 2^decode([]) = 2, etc.
Negation works as normal. So decode([-[]]) = 2^decode(-[]) = 2^(-decode([])) = 2^(-1) = ¹⁄₂
So now all we have to deal with is expressions with degree >1.
When an expression has degree >1, you compute its decoding as the product of decoding of the first subexpression and inc(decode(everything after the first subexpression)). I will define the “inc” function shortly.
So decode([][[]]) = decode([]) * inc(decode([[]]) = 1 * inc(2)
decode([[[]]][][[]]) = decode([[[]]]) * inc(decode([][[]])) = 4 * inc(decode([]) * inc(decode([[]]))) = 4 * inc(1 * inc(2))
What is inc()? inc() is a function that computes a prime factorization of a number and the increments (from one prime to the next) all the prime bases. So inc(10) = inc(2 * 5) = 3 * 7 = 21, and inc(36) = inc(2^2 * 3^2) = 3^2 * 5^2 = 225. But inc() doesn’t just take in integers, it can take in any number representable as a product of primes raised to powers. So inc(2^(1/2) * 3^(-1)) = 3^(1/2) * 5^(-1) = sqrt(3)/5. I asked the language models whether there’s a standard name for the set of numbers definable in this way, and they didn’t have ideas.

maxnadeau 7 Mar 2025 2:26 UTC
1 point
0
on: Six Thoughts on AI Safety
On point 6, “Humanity can survive an unaligned superintelligence”: In this section, I initially took you to be making a somewhat narrow point about humanity’s safety if we develop aligned superintelligence and humanity + the aligned superintelligence has enough resources to out-innovate and out-prepare a misaligned superintelligence. But I can’t tell if you think this conditional will be true, i.e. whether you think the existential risk to humanity from AI is low due to this argument. I infer from this tweet of yours that AI “kill[ing] us all” is not among your biggest fears about AI, which suggests to me that you expect the conditional to be true—am I interpreting you correctly?

maxnadeau 7 Mar 2025 2:14 UTC
LW: 9 AF: 6
1
AF
on: We should start looking for scheming “in the wild”
To make a clarifying point (which will perhaps benefit other readers): you’re using the term “scheming” in a different sense from how Joe’s report or Ryan’s writing uses the term, right?
I assume your usage is in keeping with your paper here, which is definitely different from those other two writers’ usages. In particular, you use the term “scheming” to refer to a much broader set of failure modes. In fact, I think you’re using the term synonymously with Joe’s “alignment-faking”—is that right?

maxnadeau 3 Mar 2025 23:42 UTC
21 points
2
on: Open problems in emergent misalignment
People interested in working on these sorts of problems should consider applying to Open Phil’s request for proposals: https://www.openphilanthropy.org/request-for-proposals-technical-ai-safety-research/

maxnadeau 11 Feb 2025 5:27 UTC
1 point
0
on: Detecting Strategic Deception Using Linear Probes
This section of our RFP has some other related work you might want to include, e.g. Orgad et al.

maxnadeau 25 Jan 2025 3:42 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: Six Thoughts on AI Safety
I think the link in footnote two goes to the wrong place?

maxnadeau 23 Jan 2025 19:26 UTC
3 points
0
in reply to: Lucius Bushnaq’s comment on: Jesse Hoogland’s Shortform
I haven’t read the paper, but based only on the phrase you quote, I assume it’s referring to hacks like the one shown here: https://arxiv.org/pdf/2210.10760#19=&page=19.0

maxnadeau 8 Jan 2025 1:50 UTC
LW: 5 AF: 4
0
AF
in reply to: elifland’s comment on: AI Timelines
Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I’m intuitively skeptical.
One (edit: minor) component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you’re describing.

maxnadeau 8 Jan 2025 1:31 UTC
LW: 7 AF: 5
0
AF
in reply to: Ajeya Cotra’s comment on: AI Timelines
You mentioned CyBench here. I think CyBench provides evidence against the claim “agents are already able to perform self-contained programming tasks that would take human experts multiple hours”. AFAIK, the most up-to-date CyBench run is in the joint AISI o1 evals. In this study (see Table 4.1, and note the caption), all existing models (other than o3, which was not evaluated here) succeed on ⁰⁄₁₀ attempts at almost all the Cybench tasks that take >40 minutes for humans to complete.

maxnadeau 4 Nov 2024 23:47 UTC
5 points
0
on: Brief analysis of OP Technical AI Safety Funding
(I work at Open Phil on TAIS grantmaking)

I agree with most of this. A lot of our TAIS grantmaking over the last year was to evals grants solicited through this RFP. But I want to make a few points of clarification:
1. Not all the grants that have been or will be approved in 2024 are on our website. For starters, there are still two months left in the year. But also, there are some grants that have been approved, but haven’t been put on the website yet. So $28 million is an modest underestimate, so it isn’t directly comparable to the 2022/2023 numbers.
2. I agree that evals don’t create new technological approaches to making AIs safer, but I think there are lots of possible worlds where eval results create more willpower and enthusiasm for putting safeguards in place (especially when those safeguards take work/cost money/etc). Specifically, I think evals can show people what a model is capable of, and what the trend lines are over time, and these effects can (if AIs are increasingly more capable) get people to invest more effort in safeguards. So I don’t agree with the claim that evals won’t “cause any action if they are built”, and I also disagree that “very few TAIS grants are directly focused on making sure systems are aligned / controllable / built safely”.
I appreciate you examining our work and giving your takes!

maxnadeau 5 Oct 2024 0:36 UTC
63 points
18
on: the case for CoT unfaithfulness is overstated
Edit: Let me know if you or someone you know is interested in working on this sort of research. I work at Open Phil and we’re interested in getting a better sense of how much demand for funding in this area there is.
Good post. I agree with the headline that faithful CoT is a promising path forward for interpreting and supervising LLMs.
TL;DR: I’d suggest researchers focus more narrowly on a subset of unfaithful chain-of-thought: steganographic/encoded reasoning that the model is using to perform hidden serial reasoning, as studied in Roger et al. IMO this type of unfaithfulness is more important, more neglected, and more tractable to fix than the sort of post-hoc reasoning studied in Turpin et al., or the hidden parallel reasoning studied in Pfau et al.
I think it’s worth distinguishing between three kinds of unfaithful chain of thought that can occur when you ask a model a question and give it time to generate a chain of thought before producing an answer (I’m slightly adapting this trichotomy from Lanham et al., who should get the credit for this conceptual breakdown):
- Epiphenomenal (typically post-hoc) reasoning: This is when the process of writing a CoT explanation is not actually helping the model produce its answer. This can be okay, if (as OP mentions), the model is just accurately explaining the reasoning process that it followed to produce its answer. But it can also lead to explanations that are inaccurate, as Turpin et al. demonstrate; these explanations leave out important reasons for a decision and/or make false predictions about how the model would behave on other inputs. For more on whether models’ explanations make accurate predictions about their counterfactual behaviors, see here.
- Hidden parallelized reasoning: This is when a model does a bunch of reasoning while generating a chain of thought, but it doesn’t encode any of its intermediate calculations/conclusions in the tokens it produces. It just uses the fact that it has lots of tokens to think with to do a large number of parallelized calculations. See Pfau et al. for a demonstration of what this looks like; you can make models that get a question wrong if forced to answer immediately, but get it right if you ask them “[Question]?...........................Give your answer now:”.
  - Something like hidden parallelized reasoning may emerge naturally from pre-training, as studied in Wu et al. They show that larger models do more “pre-caching”, i.e. computing a feature that is useless for predicting the literal next token, but will be useful for predicting some future next token down the line.
- Steganography/encoded serial reasoning: This is when a model uses the words it generates in its chain of thought to accomplish as task that it could not do in a single forward pass, but it encodes meaning in its word choice that is opaque to a typical overseer. This is demonstrated in Roger et al., which takes a model that is too dumb to learn a QA task without CoT but can learn the task with CoT, and then teaches it to perform the task using steganographic CoTs.
I think steganography is the most concerning form of unfaithful chain-of-thought of these three, but it’s also the easiest to study and mitigate. That is to say, I think it’s more important, more neglected, and more tractable than the other two.
- Importance/scariness of steg vs. other forms of unfaithfulness: I think steganographic reasoning in chains of thought is both reasonably likely to emerge naturally and potentially quite dangerous, especially compared to epiphenomenal and hidden parallelized reasoning.
  - Why likely? It seems useful for multi-step reasoning to arbitrarily assign information-laden connotations to pairs of words that are synonymous to humans, especially under a financial pressure to use reduce the number of tokens used. Also, OpenAI performed no public tests for steganography on o1 and this OpenAI RS says he’d be totally happy with steganography, so I get the sense that OpenAI at least isn’t prioritizing avoiding steg highly.
  - Why dangerous? Steganographic CoTs allow models to perform long chains of serial reasoning outside human oversight, and to send messages to other AIs (esp. copies of themselves).
    This could allow models to:
    Generate complex plans for escaping safety measures and going rogue.
    Carefully analyze their inputs to distinguish real chances to defect from synthetic honeypots, or distinguish insecure containment environments (that could be escaped from) from secure ones.
    Subtly collude with other copies of themselves in order to coordinate on e.g. sabotaging AI developers’ cybersecurity, alignment research, or AI training code.
  - Meanwhile, epiphenomenal and hidden parallelized reasoning don’t allow for very long and complex trains of thought, or for hidden communication with other AIs.
- Neglectedness: Almost all work on CoT faithfulness (e.g. this) focuses entirely on unfaithful post-hoc reasoning and ignores steganography entirely. I conjecture that this is because long, RL-trained CoTs are new with o1, and there hasn’t been clear evidence of steganographic reasoning occurring organically yet. That said, I’d guess o1 is already steganographic, on this basis of this sort of thing.
- Tractability: Roger et al. find that “paraphrasing is an effective defense against all steganography schemes we’ve tried” and Radhakrishnan et al. find that “Question Decomposition Improves the Faithfulness of Model-Generated Reasoning”. I am pretty skeptical that these basic defenses can resist optimization pressures against them, and I’d bet that further research will overturn Roger et al.’s blue-team victory. But I’d like to see the SOTA on defenses pushed further; I think that defenses can probably reduce steganography somewhat and make it incrementally harder for future AIs to defect.
I think there are a lot of exciting follow-up directions for future research here; finding cases where steg emerges organically, testing out different paraphrasing approaches for reducing steg, finding ways of decomposing tasks into subproblems to reduce how much context/background info any given copy of a model has (and then preventing the copies from steganographically communicating).
(I’d be interested in hearing about any good work in this area that I haven’t mentioned in this comment, if anyone is aware of some)
What links here?