@ryan_greenblatt and I are going to record another podcast together. We’d love to hear topics that you’d like us to discuss. (The questions people proposed last time are here, for reference.)
Ryan had suggested that, on his model, spending ~5%-more-than-commercially-expedient resources on alignment might drop takeover risks down to 50%. I’m interested in how he thinks this scales: how much more resources, in percentage terms, would be needed to drop the risk to 20%, 10%, 1%?
What would a class aimed at someone like me (read lesswrong for many years, familiar with the basics of LLM architecture and learning to some extent) have to cover to get me up to speed on AI futurism by your lights? I am imagining the output here being like a bulleted list of 12-30 broad thingies.
I would like, for obvious self-interested reasons, discussion of public policy ideas you think people should be pushing more on than they currently are.
I’d like to know what are your motivations for doing what you’re doing! In the first podcast you hinted at “weird reasons” but you didn’t say them explicitly in the end. I’m thinking about this quote:
Yeah, maybe a general question here is: I engage in recruiting sometimes and sometimes people are like, “So why should I work at Redwood Research, Buck?” And I’m like, “Well, I think it’s good for reducing AI takeover risk and perhaps making some other things go better.” And I feel a little weird about the fact that actually my motivation is in some sense a pretty weird other thing.
It might be good to have you talk about more research directions in AI safety you think are not worth pursuing or are over-invested in. Also I think it would be good to talk about what the plan for automating AI alignment work would look like in practice (we’ve talked about this a little in person, but it would be good for it to be public).
What interesting things do y’all think are up with AI lab politics these days? Also why is everyone (or just many people in these circle) going to Anthropic now?
Any changes in how things seem for control plans based on vibes and awareness present in more recent models? (GPT-5 series may not count here; I’m mostly interested in visiblity on the next generation that are coming, of which I think Opus 4.5 is a preview but I’m fairly unsure.)
Anything generally striking about how things look in the landscape and models versus a year ago?
In episode 1, you and Ryan discussed how you both came close to disbanding Redwood after the initial AI Control paper. I think folks would benefit from hearing more of your thoughts on why you decided to remain an external research organization, especially since my understanding is that you want to influince the practices of the frontier labs. This is a consideration that many folks should grapple with in their own research efforts.
Buck: Another factor here was: after we’d come up with a lot of the control stuff and finished that paper, we were seriously considering exploding Redwood and going to work at AI companies. And this meant that occasionally when staff had the reasonable enough preference for job security, they would ask us, “Okay, so how secure is this job?”
Ryan: And we’re like, “Not at all. Who knows?” To be clear, the view was: initially when we were thinking about control, we were like, “Probably the way to do this is to implement this at AI companies. Probably this is the most effective way to make this research happen”—which was reasonable at the time and remains kind of reasonable, though we’ve changed our view somewhat for various reasons. And so we were like, “We’re gonna write this initial paper, try to cause this paper to have some buzz, write up some blog posts, and then just dissolve the organization and go to AI companies and try to implement this and figure out how to make this happen.” I think this was a reasonable plan, but we decided against it for a bunch of reasons—a bunch of different factors.
I’d be interested in hearing more about Ryan’s proposal to do better generalization science(or if you don’t have much more to say in the podcast format I’d be interested in seeing the draft about it)
In the first you mention having a strong shared ontology (for thinking about AI) and iirc register a kind of surprise that others don’t share it. I think it would be cool if you could talk about that ontology more directly, and try to hold at that level of abstraction for a prolonged stretch (rather than invoking it in short hand when it’s load bearing and quickly moving along, which is a reasonable default, but not maximally edifying).
The first podcast was great. It strengthened my impression that both of you are on top of the big-picture, strategic situation in a way few (if any) other people are (that is, being both broad and detailed enough to be effective).
I’d like to hear you discussion the default alignment plan more. I’d like to hear you elaborate and speculate in particular on how automated alignment is likely to go under various of the plan/effort types.
The orgs aren’t publishing anything much like a plan or planning. It seems like somebody should be doing it. I nominate the two of you! :) no pressure. I do think it should be a distributed effort to refine the default plan and where it goes wrong; your efforts would do a lot to catalyze more of that discussion. The responses to What’s the short timeline plan? were pretty sketchy, and I haven’t seen a lot of improvements, outside of Ryan’s post, since then.
FWIW, I strongly agree with Ryan that big projects benefit from planning. Alignment isn’t a unified project, but it does seem like one in important ways. Like the Apollo program, which had extensive planning, there’s a lot of time pressure, so just doing stuff and getting there eventually won’t work, like it does for most innovations.
Would be great to get some discussion on the second part of Control—“getting useful [alignment] work out of the models”. Now with Opus 4.5, we might be in a decent position to test this a bit better?
@ryan_greenblatt and I are going to record another podcast together. We’d love to hear topics that you’d like us to discuss. (The questions people proposed last time are here, for reference.)
Ryan had suggested that, on his model, spending ~5%-more-than-commercially-expedient resources on alignment might drop takeover risks down to 50%. I’m interested in how he thinks this scales: how much more resources, in percentage terms, would be needed to drop the risk to 20%, 10%, 1%?
What would a class aimed at someone like me (read lesswrong for many years, familiar with the basics of LLM architecture and learning to some extent) have to cover to get me up to speed on AI futurism by your lights? I am imagining the output here being like a bulleted list of 12-30 broad thingies.
I would like, for obvious self-interested reasons, discussion of public policy ideas you think people should be pushing more on than they currently are.
I’d like to know what are your motivations for doing what you’re doing! In the first podcast you hinted at “weird reasons” but you didn’t say them explicitly in the end. I’m thinking about this quote:
He’s talking about the stuff around the simulation hypothesis and acausal trade in the preceding section.
Sure but he hasn’t laid out the argument. “something something simulation acausal trade” isn’t a motivation.
It might be good to have you talk about more research directions in AI safety you think are not worth pursuing or are over-invested in.
Also I think it would be good to talk about what the plan for automating AI alignment work would look like in practice (we’ve talked about this a little in person, but it would be good for it to be public).
Curious on any hot-takes you both have on OpenClaw/Moltbook and whether any emergent behaviours you’ve seen from agents shifts your P(doom).
What interesting things do y’all think are up with AI lab politics these days? Also why is everyone (or just many people in these circle) going to Anthropic now?
Any changes in how things seem for control plans based on vibes and awareness present in more recent models? (GPT-5 series may not count here; I’m mostly interested in visiblity on the next generation that are coming, of which I think Opus 4.5 is a preview but I’m fairly unsure.)
Anything generally striking about how things look in the landscape and models versus a year ago?
In episode 1, you and Ryan discussed how you both came close to disbanding Redwood after the initial AI Control paper. I think folks would benefit from hearing more of your thoughts on why you decided to remain an external research organization, especially since my understanding is that you want to influince the practices of the frontier labs. This is a consideration that many folks should grapple with in their own research efforts.
I’d be interested in hearing more about Ryan’s proposal to do better generalization science(or if you don’t have much more to say in the podcast format I’d be interested in seeing the draft about it)
I think its this? https://docs.google.com/document/d/1fSaw3NArDj2ndbem96V4E7x4hAlEaqrrBawjf0wNQ-A/edit?tab=t.0#heading=h.gimq5t8clw6i
In the first you mention having a strong shared ontology (for thinking about AI) and iirc register a kind of surprise that others don’t share it. I think it would be cool if you could talk about that ontology more directly, and try to hold at that level of abstraction for a prolonged stretch (rather than invoking it in short hand when it’s load bearing and quickly moving along, which is a reasonable default, but not maximally edifying).
The first podcast was great. It strengthened my impression that both of you are on top of the big-picture, strategic situation in a way few (if any) other people are (that is, being both broad and detailed enough to be effective).
I’d like to hear you discussion the default alignment plan more. I’d like to hear you elaborate and speculate in particular on how automated alignment is likely to go under various of the plan/effort types.
The orgs aren’t publishing anything much like a plan or planning. It seems like somebody should be doing it. I nominate the two of you! :) no pressure. I do think it should be a distributed effort to refine the default plan and where it goes wrong; your efforts would do a lot to catalyze more of that discussion. The responses to What’s the short timeline plan? were pretty sketchy, and I haven’t seen a lot of improvements, outside of Ryan’s post, since then.
FWIW, I strongly agree with Ryan that big projects benefit from planning. Alignment isn’t a unified project, but it does seem like one in important ways. Like the Apollo program, which had extensive planning, there’s a lot of time pressure, so just doing stuff and getting there eventually won’t work, like it does for most innovations.
Would be great to get some discussion on the second part of Control—“getting useful [alignment] work out of the models”. Now with Opus 4.5, we might be in a decent position to test this a bit better?
I’d like to hear Ryan talk more about his opinions on Anthropic and Dario’s writings.