Ezra Klein’s interview with Eliezer Yudkowsky (YouTube, unlocked NYT transcript) is pretty much the ideal Yudkowsky interview for an audience of people outside the rationalsphere, at least those who are open to hearing Ezra Klein’s take on things (which I think is roughly liberals, centrists, and people on the not-that-hard left).
Klein is smart, and a talented interviewer. He’s skeptical but sympathetic. He’s clearly familiar enough with Yudkowsky’s strengths and weaknesses in interviews to draw out his more normie-appealing side. He covers all the important points rather than letting the discussion get too stuck on any one point. If it reaches as many people as most of Klein’s interviews, I think it may even have a significant impact above counterfactual.
I’ll be sharing it with a number of AI-risk-skeptical people in my life, and insofar as you think it’s good for more people to really get the basic arguments — even if you don’t fully agree with Eliezer’s take on it — you may want to do the same.
[EDIT: please go here for further discussion, no need to split it]
Alignment faking, and the alignment faking research was done at Anthropic.
And we want to give credit to Anthropic for this. We don’t want to shoot the messenger — they went looking. They didn’t have to do that. They told us the results, and they didn’t have to do that. Anthropic finding these results is Anthropic being good citizens. And you want to be more critical of the A.I. companies that didn’t go looking.
It would be great if Eliezer knew that (or noted, if he knows but is just phrasing it really weirdly) the alignment faking paper research was initially done at Redwood by Redwood staff; I’m normally not prickly about this but it seems directly relevant to what Eliezer said here.
Fair enough, but it was done with Anthropic’s heavy and active cooperation to provide facilities not usually available to outside researchers, unless I’m mistaken about that too?
Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it’s been a long journey...) and was basically the only contributor to the project for around 2 months.
By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.
After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.
This collaboration yielded the synthetic document fine-tuning and RL results and substantially improved the writing of the paper. I think this work is an interesting example of an AI company boosting safety research by collaborating and providing model access.
So Anthropic was indeed very accommodating here; they gave Ryan an unprecedented level of access for this work, and we’re grateful for that. (And obviously, individual Anthropic researchers contributed a lot to the paper, as described in its author contribution statement. And their promotion of the paper was also very helpful!)
My objection is just that this paragraph of yours is fairly confused:
We don’t want to shoot the messenger — they went looking. They didn’t have to do that. They told us the results, and they didn’t have to do that. Anthropic finding these results is Anthropic being good citizens. And you want to be more critical of the A.I. companies that didn’t go looking.
This paper wasn’t a consequence of Anthropic going looking, it was a consequence of Ryan going looking. If Anthropic hadn’t wanted to cooperate, then Ryan would have just published his results without Anthropic’s help, which would have been a moderately worse paper that would have probably gotten substantially less attention, but Anthropic didn’t have the opportunity to not publish (a crappier version of) the core results.
Just to be clear, I don’t think this is that big a deal. It’s a bummer that Redwood doesn’t get as much credit for this paper as we deserve, but this is pretty unavoidable given how much more famous Anthropic is; my sense is that it’s worth the effort for safety people to connect the paper to Redwood/Ryan when discussing it, but it’s no big deal. I normally don’t bother to object to that credit misallocation. But again, the story of the paper conflicted with these sentences you said, which is why I bothered bringing it up.
Did you see the discussion here? It seems like many normie commenters thought Eliezer’s arguments were confusing and didn’t really connect with the questions being asked.
Although I don’t know of any interviews that would be much better to recommend.
Whoops, no, I didn’t; I’ll edit the post to direct people there for discussion. Unfortunately the LW search interface doesn’t surface shortform posts, and googling with site:lesswrong.com doesn’t surface it either. Thanks for pointing that out.
Ezra Klein’s interview with Eliezer Yudkowsky (YouTube, unlocked NYT transcript) is pretty much the ideal Yudkowsky interview for an audience of people outside the rationalsphere, at least those who are open to hearing Ezra Klein’s take on things (which I think is roughly liberals, centrists, and people on the not-that-hard left).
Klein is smart, and a talented interviewer. He’s skeptical but sympathetic. He’s clearly familiar enough with Yudkowsky’s strengths and weaknesses in interviews to draw out his more normie-appealing side. He covers all the important points rather than letting the discussion get too stuck on any one point. If it reaches as many people as most of Klein’s interviews, I think it may even have a significant impact above counterfactual.
I’ll be sharing it with a number of AI-risk-skeptical people in my life, and insofar as you think it’s good for more people to really get the basic arguments — even if you don’t fully agree with Eliezer’s take on it — you may want to do the same.
[EDIT: please go here for further discussion, no need to split it]
It would be great if Eliezer knew that (or noted, if he knows but is just phrasing it really weirdly) the alignment faking paper research was initially done at Redwood by Redwood staff; I’m normally not prickly about this but it seems directly relevant to what Eliezer said here.
Yeah, it seems like so many people tagged that mentally as ‘Anthropic research’, which is a shame. @Eliezer Yudkowsky FYI for future interviews.
Fair enough, but it was done with Anthropic’s heavy and active cooperation to provide facilities not usually available to outside researchers, unless I’m mistaken about that too?
That’s correct. Ryan summarized the story as:
So Anthropic was indeed very accommodating here; they gave Ryan an unprecedented level of access for this work, and we’re grateful for that. (And obviously, individual Anthropic researchers contributed a lot to the paper, as described in its author contribution statement. And their promotion of the paper was also very helpful!)
My objection is just that this paragraph of yours is fairly confused:
This paper wasn’t a consequence of Anthropic going looking, it was a consequence of Ryan going looking. If Anthropic hadn’t wanted to cooperate, then Ryan would have just published his results without Anthropic’s help, which would have been a moderately worse paper that would have probably gotten substantially less attention, but Anthropic didn’t have the opportunity to not publish (a crappier version of) the core results.
Just to be clear, I don’t think this is that big a deal. It’s a bummer that Redwood doesn’t get as much credit for this paper as we deserve, but this is pretty unavoidable given how much more famous Anthropic is; my sense is that it’s worth the effort for safety people to connect the paper to Redwood/Ryan when discussing it, but it’s no big deal. I normally don’t bother to object to that credit misallocation. But again, the story of the paper conflicted with these sentences you said, which is why I bothered bringing it up.
Did you see the discussion here? It seems like many normie commenters thought Eliezer’s arguments were confusing and didn’t really connect with the questions being asked.
Although I don’t know of any interviews that would be much better to recommend.
My girlfriend’s boss found it convincing, starting from no set opinion.
Whoops, no, I didn’t; I’ll edit the post to direct people there for discussion. Unfortunately the LW search interface doesn’t surface shortform posts, and googling with
site:lesswrong.com
doesn’t surface it either. Thanks for pointing that out.Yeah no worries, I was just curious if you’d seen it. No need to do a lit review before writing a shortform :)