Eliezer Yudkowsky comments on eggsyntax’s Shortform

Eliezer Yudkowsky 21 Oct 2025 22:48 UTC
4 points
0
Fair enough, but it was done with Anthropic’s heavy and active cooperation to provide facilities not usually available to outside researchers, unless I’m mistaken about that too?
- Buck 21 Oct 2025 22:58 UTC
  25 points
  3
  Parent
  That’s correct. Ryan summarized the story as:
  Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it’s been a long journey...) and was basically the only contributor to the project for around 2 months.
  By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.
  After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.
  This collaboration yielded the synthetic document fine-tuning and RL results and substantially improved the writing of the paper. I think this work is an interesting example of an AI company boosting safety research by collaborating and providing model access.
  So Anthropic was indeed very accommodating here; they gave Ryan an unprecedented level of access for this work, and we’re grateful for that. (And obviously, individual Anthropic researchers contributed a lot to the paper, as described in its author contribution statement. And their promotion of the paper was also very helpful!)
  My objection is just that this paragraph of yours is fairly confused:
  We don’t want to shoot the messenger — they went looking. They didn’t have to do that. They told us the results, and they didn’t have to do that. Anthropic finding these results is Anthropic being good citizens. And you want to be more critical of the A.I. companies that didn’t go looking.
  This paper wasn’t a consequence of Anthropic going looking, it was a consequence of Ryan going looking. If Anthropic hadn’t wanted to cooperate, then Ryan would have just published his results without Anthropic’s help, which would have been a moderately worse paper that would have probably gotten substantially less attention, but Anthropic didn’t have the opportunity to not publish (a crappier version of) the core results.
  Just to be clear, I don’t think this is that big a deal. It’s a bummer that Redwood doesn’t get as much credit for this paper as we deserve, but this is pretty unavoidable given how much more famous Anthropic is; my sense is that it’s worth the effort for safety people to connect the paper to Redwood/Ryan when discussing it, but it’s no big deal. I normally don’t bother to object to that credit misallocation. But again, the story of the paper conflicted with these sentences you said, which is why I bothered bringing it up.
  - Eliezer Yudkowsky 24 Oct 2025 0:17 UTC
    29 points
    0
    Parent
    Then I now agree that you’ve identified a conflict of fact with what I said.
    Thank you for taking the time to correct me and document your correction. I hope I remember this and can avoid repeating this mistake in the future.