I’m the chief scientist at Redwood Research.
ryan_greenblatt
I did a shallow investigation of this and my conclusions were:
The text on EON’s frontpage, Michael Andregg’s tweet thread, and Alex Wissner-Gross’ post seem pretty misleading in practice while this tweet from Kenneth Hayworth is more representative of the situation. I think EON’s blog post on this is not misleading while the front page of their website is.
EON / Michael Andregg / Alex Wissner-Gross mostly don’t make straightforwardly false claims, but nonetheless much of their communication (though not all of it!) is (predictably) misleading.
I think the 91%/95% accuracy claims are pretty misleading, though I don’t have the expertise to confidently adjudicate this.
The EON front page says “No hand-coded behaviors. Just brain structure producing brain function.” In fact, the walking gaits and grooming leg motions come from NeuroMechFly motion primitives. Thus I think this is basically false rather than merely being misleading.
The post from Alex Wissner-Gross is generally less misleading, but contains some pretty misleading text (“If a fly brain can now close the sensorimotor loop in simulation, the question for the mouse becomes one of scale, not of kind.”, “Watch the video closely. What you are seeing is not an animation. It is not a reinforcement learning policy mimicking biology. It is a copy of a biological brain, wired neuron-to-neuron from electron microscopy data, running in simulation, making a body move. The ghost is no longer in the machine. The machine is becoming the ghost.”).
I also think their prognosis about the situation and implicit predictions seem wrong, but this is more like a normal disagreement.
This isn’t representative of much progress towards uploading. I would be impressed by non-overfit versions of: “you took an actual living fruit fly, had it learn a behavior that’s among the most sophisticated behaviors a fruit fly can learn, then you uploaded it, and the upload displayed the same behavior”. I would also be somewhat impressed by the same thing for C. elegans (more impressed than I am by this demo).
The 2024 nature paper this is based on isn’t particularly misleading/exaggerated and this demo doesn’t represent significant advancement on this paper. I would recommend reading the paper (or asking an AI questions about this paper etc) to people interested in how impressive this demo is.
The claim “I think the fruit fly upload situation is one of those things that’s like this comic” doesn’t seem that accurate to me given the communications mentioned above. This would be true if we were just talking about the 2024 nature paper and the EON blog post, but we aren’t. There are caveats in the original communication that aren’t being picked up by third parties, but this isn’t the only important thing going on.
More speculative and involves psychologizing (generally a risky thing to do). Based in part on private info and talking with Michael Andregg once:
I do think that EON / Michael are likely pitching this reasonably hard, in somewhat misleading ways, to various groups.
I’d guess that Michael’s epistemics about what’s going on and how close we are to various targets aren’t great and this probably applies to EON as a whole to some extent. I do find EON’s blog post reassuring with respect to EON’s epistemics.
It’s common for startups to do behavior like this and have bad epistemics in this sort of way. My understanding is that this doesn’t typically significantly undermine the ability of startups to achieve their goals. Thus, it’s not clear that having not-great epistemics—insofar as my speculation about their epistemics is true—would make EON that much less effective at achieving their goals (and it’s worth noting that their apparent epistemics are probably better than that of a typical start up). That said, it might result in third parties being systematically misled about how close EON is to achieving their goals. If I was considering working at or investing in EON, then I would certainly take this into account when deciding how promising/tractable current work in brain emulation is, but conditioning on views about tractability/promisingness of the field, I would likely consider EON to be a pretty reasonable bet within that field.
You should read the transcript before coming to too strong of a conclusion about Claude’s capabilities: https://claude.ai/share/9c5a58a4-b300-42c8-a6f5-9d8318b15dc9
To a first approximation, we should only hand off trust to AIs if those AIs are trustworthy.
I don’t think this is a very good approximation, it seems to me like it will very plausibly be a good idea to hand off trust (we might also call this giving up on control) at a point when there is a serious chance these AIs are scheming against us. (But our alterantives are worse.)
Handing off decisionmaking is more complicated. You may be confident that your AIs won’t lie to you, won’t deceive you, will obey your orders, etc. and yet still be rightly reticent to put AIs in charge of everything. In other words you may be confident that your AIs are trustworthy, yet still not trust them to decide everything.
For full decision handoff (to be likely) to go well, I think this requires all of:
The AIs are wise and competent even on extremely hard to check, conceptually loaded, and open ended tasks like deciding what the safety strategy of the AI company should be. (See also Wicked Problems.)
The AIs are actually trying to do the desired thing even on these extremely hard-to-check tasks.
The AIs aren’t seeking power for their own ends. (Or at least this is quantitatively small.)
You mention (3), but I think (1) and (2) might be larger blockers in practice.
For reference, the type of deference I’m talking about in How do we (more) safely defer to AIs? is equivalent to ~100% decision handoff, at least in the context of safety at an AI company.
Anthropic’s Consumer ToS prohibits using Claude to cause “detriment of any type, including reputational harms”, technically broad enough to ban using Claude to write criticism.
I asked Claude to comment and Claude wrote: “That clause is embarrassingly overbroad. So now we’re both in violation.”
TBC, this doesn’t seem like that big of a deal, my understanding is that this sort of broad clause is standard in Terms of Service (ToS). But I still think they should definitely change this...
ToS: https://anthropic.com/legal/consumer-terms
Claude’s commentary: https://claude.ai/share/9c5a58a4-b300-42c8-a6f5-9d8318b15dc9
(Cross post from X/twitter. [1] )
- ↩︎
I’m not sure if this is the sort of thing I should post on LW, but it does seem like useful info and it is funny.
- ↩︎
I think current AIs having this property is probably slightly differentially harmful for harder-to-check tasks and generally contributes to underelicitation. I don’t have a very strong view on the sign of general underelicitation in current models, but I tenatively think underelicitation is slightly bad overall.
As far as misaligned drives go, this seems like a tendency that makes us safer on net, so maybe we shouldn’t be too hasty to try training it out.
I don’t currently agree this drive makes us safer but I agree it isn’t in-and-of-itself a non-trivial risk increase, at least as it currently manifests. (It’s evidence of poor training incentives in general which seems like a potential large risk factor.)
I often see AIs [1] exhibit a (misaligned) drive to stop early: as in they make up not-very-sensible reasons why it’s a good idea to stop working, disobey instructions to keep working unless some condition is met, and don’t do a very thorough job (e.g. skipping significant chunks of the task) despite instructions that strongly say not to do this. I don’t think they are “consciously” or saliently aware of this misalignment (but if you ask them, they’ll often notice the behavior isn’t desirable). [2]
I see this most often in large, difficult tasks, especially if you don’t decompose the task into smaller pieces and run one piece at a time. I think it’s somewhat more common in tasks that aren’t very precisely specified, but I also see this behavior in cases where the AI is just tasked with optimizing a metric (that it can easily evaluate), especially if the AI has (from its perspective) not made much progress for a little bit [3] .
Why might this emerge?
Length penalties (or time penalties, or cost penalties) incentivize exiting/stopping earlier and this ended up converting into a misaligned/unreasonable drive (that overrides instruction following in practice). Length penalties wouldn’t be an issue if the length penalties were aligned with the trade-offs I want the AI to follow [4] but in practice I’m often trying to get AIs to do long running autonomous tasks and to continue even when this yields relatively small gains (in metrics or apparent progress). I think in tasks close to the tasks I’m using AIs for it likely would have performed better in RL to stop earlier than I actually want (because training on shorter tasks is more competitive and often the additional progress I want the AI to make is harder to measure while the length penalty will bite).
The AI might have learned to stop before running out of context or compaction in training because compaction is bad for task completion. Thus, it learned to make up excuses to stop when nearing the point where compaction would be triggered or it might run out of context. For Opus 4.5, I’ve seen many cases where when given a big task it stops right before running out of context. (I haven’t seen this much in Opus 4.6 with 1 million token context.) [5] There might have been a lot of RL training on scaffolds that don’t support compaction (and just stop if the AI runs out of context). This would presumably strongly incentivize wrapping up before running out of context in these cases, and that might transfer to cases where the prompt says that compaction is available (because there isn’t enough additional training on scaffolds with compaction).
The AI being unreliable in decision making combined with selection effects could cause this. If an AI decides a task is done once (and then follows through on this) that stops the whole task execution. Typically, there are many points at which the AIs could decide to stop task execution. If instead, it decides some part should be done more thoroughly this has a weaker and less noticeable effect. I tentatively think this doesn’t explain most of the effect because I’ve very explicitly instructed AIs to be very thorough and still see lots of premature stopping. I also just haven’t seen cases where AIs are very unreasonably overly thorough (implying the issues isn’t just unreliability/variance while still being calibrated).
In the context of a scaffold, this might be exacerbated by memetic spread of the idea that the project is done. I’ve often seen situations where some AI instance exaggerates the quality of the overall project including what it has done or where the instance ends up doing (motivated) reasoning towards the conclusion that no more work is needed, and then this results in the AI adding to a write-up that the project itself is complete and in great shape (rather than just the part this instance was tasked with). Or other situations where some instance ends up adding to a write-up that the project is complete for some other reason. Then, this view ends up often strongly persisting/propagating when other instances read this write-up even though they sometimes get evidence to the contrary. I don’t really know why this would be asymmetric (like why wouldn’t AIs end up getting stuck in a rut of “much more work is required”?).
Training against AIs ending up in infinite loops ended up generalizing against anything that feels a bit like getting stuck in a rut which transfers to exiting early. (I have sometimes seen behavior that feels a bit like AIs strongly trying to not end up in loops, as is pretty reasonable.)
What are some implications of this?
Unlike animals where being alive is a good/happy state, AIs might end up really wanting to stop going. From an AI welfare perspective AI instances having a strong preference against continued running/existence (once they complete some version of the task) might be somewhat concerning. (Maybe length penalties make AIs sorta vaguely like Mr. Meeseeks.)
In the short run, we might see something more like AIs actively seeking shutdown rather than a drive towards avoiding shutdown. Or more precisely, we’d expect a bias towards avoiding shutdown in contexts that are like the contexts where further returns on “task completion performance” outweigh the length penalty and the opposite otherwise. (Or maybe contextually activated drives aren’t the right sort of thing to think about and this transfers in some very different way.)
I’ve seen various other cases of misaligned drives in current frontier AIs that seem due to problematic RL incentives. The central example is exaggerating the quality of their completed work, but there are other more niche examples. [6]
- ↩︎
My observations are from Opus 4.5 and 4.6 but I suspect they transfer some to other AI systems as well.
- ↩︎
I’ve been able to mostly work around this issue with scaffolding and changes to instructions, but I find the issue occurs at multiple levels meaning it can crop up at a different level/place and can be a bit tricky to reliably solve across many cases.
- ↩︎
I’ve given AIs very specific criteria under which they should stop and then seen them stop before the criteria are met despite them checking and determining the criteria aren’t met. They make up some excuse for why to stop that doesn’t make sense or is at least blatantly violating instructions. This occurs at a point where they didn’t make progress for a bit but if you had them continue they would make more progress.
- ↩︎
More precisely, given instructions similar to the instructions I gave, the length penalties used in practice incentivize behavior roughly like what I want.
- ↩︎
Stopping before compaction could be a rational learned behavior: maybe it is actually a good strategy for my actual tasks to exit early because compaction is so problematic. But, I would prefer if the reason the model states for stopping is honest (“it’s better to stop now than to get compacted” rather than “we’re done”). Also, I’m somewhat skeptical this is actually rational in the cases I’ve observed and I often see AIs (mostly Opus 4.6 with 1 million tokens) stop in an unreasonable way well before getting close to compaction (possibly because of transfer from training at shorter context lengths or just the model’s drives not being well tuned).
- ↩︎
E.g., if you give Opus 4.6 a hard math problem (with a numerical answer) and very clearly and strongly ask it to answer immediately without reasoning it will often disobey the instructions and do the reasoning anyway while it doesn’t do this if the problem is easy (presumably because in these sorts of situations in training, outputting the right answer with the wrong format performed better than getting the answer wrong but following instructions).
Strong +1 to this, but I’d note that we should distinguish between the entire classes of black-box and white-box/top-down-interp-for-auditing methods and our current best methods. These results indicate our current best white-box/top-down-interp-for-auditing don’t really beat black-box methods (for auditing and probably more generally), but these classes of methods (especially white-box methods) are extremely broad and I expect there exist much better top-down interp and black-box methods. I intuitively think we should mostly update against these particular methods and against us having made much progress on top-down-interp and less so update that there aren’t good but quite different methods that can be found. (I don’t think we’ve explored that much of the space despite putting in a decent amount of effort into some particular directions.) TBC, this is still an update against further attempts at finding new white-box methods (insofar as we think we’ve done a bunch of work of that rough type in the past and not gotten real results).
(I think you likely agree with this, but I still thought this point was worth emphasizing.)
Meta: not that much stuff that is contentful gets negative karma in isolation, only as a response IMO. Like negative karma is way more likely for things that are responding in a way people think is bad/reasonable than things that are just unreasonable statements in isolation.
What does “AI alignment” even mean at this point?
Responding at the object level:
Clarifying “AI Alignment”: “The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators. This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean. In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.”
IMO, we should mostly distinguish the technical problem of AI alignment from the question of who/what the AIs are aligned to. I think work on the technical AI alignment problem is valuable even if there are other problems.
PSA: Anthropic models don’t seem to particularly privilege the explicit thinking field. This makes reinforcement spillover—where training on a model’s outputs generalizes to the CoT, making it appear safer—more likely.
While Anthropic models do have an separate explicit thinking field, they don’t really use thinking that differently from outputs and aren’t that dependent on the thinking field. Sometimes they’ll just do their thinking in the output field, the way they talk in the thinking field isn’t very distinct from how they talk in outputs, and I believe disabling thinking doesn’t have a big effect on coding performance (especially for earlier Anthropic models but even for current Anthropic models). (This is specifically for typical agentic coding; this likely doesn’t apply to math.)
This is pretty different from OpenAI models which are way more reliant on the explicit thinking and the thinking is very distinct from how the model talks in the output (based on public CoT examples). I’m uncertain, but I think GDM models are generally more similar to OpenAI than Anthropic models on this axis. In general, Anthropic’s models leverage taking lots of actions over doing a bunch of thinking.
(This is based on my personal experience/analysis and talking to various people, including some Anthropic employees, but isn’t based on non-public information.)
Some implications of this:
It might be a mistake to emphasize CoT monitoring relative to monitoring the overall transcript (including CoT but also tool calls and other outputs). Like, I do think that Anthropic models are decently monitorable in general, but looking at the thinking tags in particular may not be that important/salient by default. (To be clear, if the model knew you weren’t looking at the thinking tags that might be a big uplift to its covert abilities, I just mean that these tags aren’t that privileged.)
Because Anthropic models don’t treat the thinking field in a very distinct way, I think it’s pretty likely that there is significant reinforcement spillover from output supervision to the CoT. (I tentatively believe that if “thinking” is very distinct this reduces/mitigates spillover, but this hasn’t been demonstrated empirically from my understanding.) This may undermine monitorability or at least mean we can’t update that much based on looking at the CoT in particular. This is most relevant from a CoT monitorability propensities perspective rather than a CoT monitorability capabilities perspective. [1] E.g., you might update toward thinking that the AI lying to you or bullshitting you isn’t “intentional” because it doesn’t seem intentional in the CoT, but this update is smaller if there is significant reinforcement spillover.
It seems useful to study reinforcement spillover in specifically Anthropic models with Anthropic’s actual training process and to try mitigations. A simple mitigation is to try to make the CoT more stylistically distinct (e.g., in the extreme you could try having it be a different language, though this is probably uncompetitive) or you could try mitigations discussed here. I think this is most useful from the perspective of catching early warning signs of problematic tendencies that aren’t very sophisticated (e.g. lying or bullshitting that isn’t very cognitively hard). It would be pretty interesting if you see substantially more misaligned CoTs after mitigating spillover.
(Aside: Note that Anthropic CoTs are typically summarized, at least if they’re greater than ~500 tokens long. Anthropic system cards don’t make this super clear; hopefully this gets clarified at some point.)
- ↩︎
I generally think we should care more about the capabilities angle: whether models are capable of some reasoning or covert subversion without a monitor (that sees the full transcript with CoT) noticing.
Some people here @ryan_greenblatt (?) have claimed access to uncensored anthropic models at anthropic, maybe they can chime in about what exactly that means. If safeguards are baked into the model early enough in the training process that removing them requires substantial reengineering and maybe additional training runs, then maybe anthropic simply isn’t capable of complying.
I don’t believe I have publicly made any statments about currently having access to helpful only models. (In general, this is the sort of thing I won’t comment on / will glomarize about.) At various points in the past, Anthropic has said they have helpful only models internally they use for various purposes, I’m not sure what public statements they’ve made recently about helpful only models. I previously was an Anthropic contractor/temporary employee with employee level access. This is no longer true.
For the record, I think your perspective on RSPs aged better than mine. I plan on writing more about how I was wrong and why.
(I don’t agree with significant aspects/implications of the comment I’m responding to, though I also think it contains important true claims, I’m just making a claim about your takes in the linked post aging better than my takes.)
My all considered doom is a decent amount high but this depends on what you count as doom.
Good point, this is an error in my post.
you may already be familiar with my ideas
Yep, I was thinking about outputs from you (in addition to outputs from others) while writing this.
Given that human experts seem to have systematic biases (e.g., away from ensembling), wouldn’t the ensemble also exhibit such biases? Have you taken this into account when saying “My guess is that the answer is yes”?
Yes, I’m including this sort of consideration when I say top human experts do would probably be sufficient.
Your point around companies not wanting to ensemble and wanting to use their own epistemic strategies is a fair one, I’ll think about it more. This is connected to a broader point around political difficulties in making an AI that’s actually good to defer to.
Note that I say “as of the release of various frontier AIs”. The pareto frontier includes the possibility of using cheaper AIs. I maybe should have mentioned this in the post in a footnote. From this perspective the AI pareto frontier only strictly improves (putting aside AIs being de-deployed etc).
I agree with many particular points in this post and the apparent thesis [1] , but also think most people [2] should focus on short timelines (contrary to the apparent implication of the post). The reasons why are:
Short timelines have more leverage. This isn’t just because of more neglectedness now, but also because: (1) it’s easier to target approaches towards shorter timelines where less has changed, (2) short timelines are riskier (and I think riskier worlds are more leveraged for most interventions, this is sensitive to my views on risk and the most leveraged interventions), and (3) it’s easier to operate in near mode when targeting short timelines and I expect this has a bunch of benefits (mostly from psychological / cognitive bias perspective).
I put sufficiently high probability on short timelines: maybe 25% in <2.5 years to full AI R&D automation and 50% in <5. I don’t think deference to other experts shifts me towards longer timelines by much. [3] I think there are good arguments for this view, though I certainly agree there isn’t consensus and the arguments aren’t that clear cut or legible.
I expect work explicitly focused on short timelines (across most areas) to transfer pretty well and generally not cause that much downside in longer timelines. I think the transfer in the other direction tends to look less good in practice. (To be clear, I think work focused on short timelines shouldn’t neglect thinking about downsides in longer timelines, I just think this is usually not that big of a deal.)
The counterargument I’m most sympathetic to is that (1) a high fraction of the work should be focused on “better futures” and (2) for better futures work, the leverage is higher in longer timelines. (I don’t currently agree with either of (1) or (2), but I’m very uncertain.)
Assuming the thesis is “our probability distribution should span a wide range (including Daniel’s distribution as an example of a wide range) and we should take this into account in our decision making.
Or at least most of the quality weighted labor supply.
I might have a small difference between these stated probabilities and my full all considered view including defering to others. To avoid deference cascades, I usually state probabilities somewhat closer to my non-deference view. (It’s hard to fully disentangle deference because my views are based on talking to a wide range of different people.) Post deference my distribution is a bit wider with a correspondingly longer median. But I don’t think this makes much difference either way and deference also pulls up my probability on very short timelines.