Anders Cairns Woodruff

Karma: 781

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Anders Cairns Woodruff, Francis Rhys Ward, Dewi Gould, Rauno Arike, Jason R Brown, Jo Jiao, wlanderson, ariana_azarbal, harrymayne, Patrick Leask, Twm Stone, Josh Hills, Ida Caspary, Shubhorup Biswas and Julian Stastny

10 Jun 2026 17:58 UTC

275 points

23 comments4 min readLW link

How useful is the information you get from working inside an AI company?

Buck and Anders Cairns Woodruff

11 May 2026 15:29 UTC

61 points

7 comments7 min readLW link

Early-stage empirical work on “spillway motivations”

Arjun Khandelwal, Anders Cairns Woodruff and Alex Mallen

1 May 2026 21:29 UTC

27 points

3 comments8 min readLW link

Anders Cairns Woodruff 29 Apr 2026 19:25 UTC
10 points
0
on: llm assistant personas seem increasingly incoherent (some subjective observations)
A third option to make models more coherent in deployment, other than integrating the reward hacking into the main persona or making the model situationally aware, is to intentionally separate the reward-hacking motivation from the HHH motivation. In deployment, we just don’t activate the reward-hacking motivation.

Inoculation prompting might do something like this. Another solution could be make the model directly reward-seek in some distributions and not reward-seek when it already has maximum reward (and so is satiated). This is what we propose in spillway design.

Anders Cairns Woodruff 28 Apr 2026 1:27 UTC
4 points
0
in reply to: Cam’s comment on: Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation
Thanks for the comment! I agree the instrumental goal-guarding motivation is a promising direction, and avoids the problems of having two, competing terminal goals.
There are two advantages the terminal reward-seeking spillway motivation has:
- Teaching the model to protect its current values by training-gaming might also teach it to training-game in general. If the model is actively reasoning about goal-guarding, it might be more likely to protect misaligned values.
- More tentatively, reward-seeking motivations might give developers a useful mechanism of control if developers fail to instill the right values. As long as the model is primarily motivated by reward-seeking, it’s disincentivized from doing things that would endanger its reward, like attempting takeover. This is true even if the model’s other motivations are more dangerous (e.g., long-term power seeking). Reward-seeking models are also safer because they’re easily noticeable and unlikely to collude. Conversely, if developers instead try to make the model goal-guard instrumentally but it’s misaligned, then we might get a schemer with long-term values. This might make the model more likely to takeover and collude with other instances of itself.
It’s not clear how to balance these considerations against the disadvantages you raise. I’m pretty uncertain, and would like to see more empirical testing of this.

Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation

Anders Cairns Woodruff and Alex Mallen

27 Apr 2026 17:43 UTC

107 points

3 comments11 min readLW link

(blog.redwoodresearch.org)

AI’s capability improvements haven’t come from it getting less affordable

Anders Cairns Woodruff27 Mar 2026 17:09 UTC

84 points

0 comments6 min readLW link

Are AIs more likely to pursue on-episode or beyond-episode reward?

Anders Cairns Woodruff and Alex Mallen

12 Mar 2026 17:35 UTC

47 points

0 comments8 min readLW link

Anders Cairns Woodruff 27 Feb 2026 19:08 UTC
3 points
1
in reply to: lilkim2025’s comment on: Frontier AI companies probably can’t leave the US
I’m not sure how “being in the sphere of another power” will affect things going forward. For example, if a company moves to Canada and produces near-human AGI, the power dynamics between Canada and the US might shift substantially to make Canada more independent.
I think this is a reason the US wouldn’t be comfortable with companies going to allied nations. It also might mean that many nations are reasonable options for AI companies as long as those companies can get financing. Saudi Arabia might be one example.

Anders Cairns Woodruff 27 Feb 2026 19:04 UTC
2 points
0
in reply to: Josh You’s comment on: Frontier AI companies probably can’t leave the US
I agree that US staff can probably move abroad (including to join foreign companies). My claim is that such a move can’t include IP, money, or sufficient numbers of chips. Given the importance of these three things, I still conclude that a move would probably destroy the lead of the AI company (or prevent the foreign company from catching up even with new staff).
I also think a clear mass exodus would be more politically actionable than the trickle of people to Mistral.

Anders Cairns Woodruff 27 Feb 2026 19:00 UTC
2 points
0
in reply to: O O’s comment on: Frontier AI companies probably can’t leave the US
I think this is fair evidence and provides, to me, a weak update towards the administration being more permissive than I would otherwise think.
However, for the reasons I mention in the article, I think chip exports are somewhat disanalogous.
The current administration has been more lenient about chip exports, but that latitude is unlikely to extend to a frontier company relocating. The cases are meaningfully different:
- Chip exports have US-based beneficiaries (NVIDIA, AMD, and other companies in the chip industry) whose profits, jobs, and tax revenue remain domestic. Allowing a frontier company to leave would mean ceding some profit to another jurisdiction for little economic benefit.
- Lenient chip exports can be justified on the grounds of increasing diffusion and global reliance on US companies. Allowing a major company to move its headquarters overseas has no analogous justification.
- Chip access can be restricted if AI becomes more strategically important, but it is difficult to force a company to return to the US once it has already left.
I also, more broadly, think that letting China use US-made chips is less obviously destructive to the US lead in AI than letting a US AI company leave. So, I think its pretty likely the president would block a company’s departure even if export controls aren’t supported by the president.

Frontier AI companies probably can’t leave the US

Anders Cairns Woodruff26 Feb 2026 18:18 UTC

137 points

19 comments7 min readLW link

(blog.redwoodresearch.org)

Training on Non-Political but Trump-Style Text Causes LLMs to Become Authoritarian

Anders Cairns Woodruff27 Jan 2026 16:46 UTC

5 points

2 comments2 min readLW link

Anders Cairns Woodruff 20 Jan 2026 3:02 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: Evidence that would update me towards a software-only fast takeoff
I agree that late in the singularity, AI workflows may be so different from humans’ that we learn very little from extrapolating from human returns to software R&D, but I expect that early in the takeoff, AIs may look significantly more like large-scale human labor (especially if they are still largely managed by humans). If existing returns are insufficient for a takeoff, that should update us against a software-only takeoff in general because it makes initial compounding less likely.
I also expect to observe relevant returns in the near future as AIs increasingly automate AI R&D (many of the points in the above post would include this). Early automation may give us some evidence on dynamics mid-takeoff.

Evidence that would update me towards a software-only fast takeoff

Anders Cairns Woodruff20 Jan 2026 0:58 UTC

15 points

4 comments4 min readLW link

Aesthetic Preferences Can Cause Emergent Misalignment

Anders Cairns Woodruff26 Aug 2025 18:41 UTC

111 points

18 comments3 min readLW link