@ryan_greenblatt and I are going to try out recording a podcast together tomorrow, as an experiment in trying to express our ideas more cheaply. I’d love to hear if there are questions or topics you’d particularly like us to discuss.
What would you work on if not control? Bonus points for sketching out the next 5+ new research agendas you would pursue, in priority order, assuming each previous one stopped being neglected
What is the field of ai safety messing up? Bonus: For (field) in {AI safety fields}: What are researchers in $field wrong about/making poor decisions about, in a way that significantly limits their impact?
What are you most unhappy about with how the control field has grown and the other work happening elsewhere?
What are some common beliefs by AI safety researchers about their domains of expertise that you disagree with (pick your favourite domain)?
What beliefs inside Constellation have not percolated into the wider safety community but really should?
What have you changed your mind about in the last 12 months?
You say that you don’t think control will work indefinitely and that’s sufficiently capable models will break it. Can you make that more concrete? What kind of early warning signs could we observe? Will we know when we reach models capable enough that we can no longer trust control?
If you were in charge of Anthropic what would you do?
If you were David Sacks, what would you do?
If you had had a hundred cracked mats scholars and $10,000 of compute each, what would you have them do?
If I gave you billions of dollars and 100 top researchers at a Frontier lab, what would you do?
I’m concerned that the safety community spends way too much energy on more meta things like control, evals, interpretability, etc. And has somewhat lost sight of solving the damn alignment problem. Takes? if you agree what do you think someone who wants to solve the alignment problem should actually be doing about it right now?
What are examples of the safety questions that you think are important, and can likely be studied on models in the next 2 years but not on today’s publicly available frontier models? (0.5? 1? 5? Until the 6 months before AGI?)
If you were wrong about a belief that you are currently over 50% on to do with safety, what do you predict it is and why?
What model organisms would you be most excited to see people produce? (Ditto any other the open source work)
What are some mistakes you predict many listeners are making? Bonus points for mistakes you think I personally am making
What is the most positive true thing you have to say about the field of ambitious mechanistic interpretability
What does redwood look for when hiring people, especially junior researchers?
What kind of mid-career professionals would you be most excited to see switch to control. What about other areas of air safety?
What should AGI lab safety researchers be doing differently to have a greater impact? Feel free to give a different answer per lab
People often present their views as a static object, which paints a misleading picture of how they arrived at them and how confident they are in different parts, I would be more interested to hear about how they’ve changed for both of you over the course of your work at Redwood.
I remember Ryan talking about it on the 80k hours podcast. I’d be interested in hearing the perspective more fleshed out. Also just legibility of CoT, how important is it in the overall picture. If people start using fully recurrent architectures tomorrow in all frontier models does p(doom) go from 10% to 90%, or is it a smaller update?
You guys seem as tuned into the big picture as anyone. The big question we as a field need to answer is: what’s the strategy? What’s the route to success?
What probability would you put on recurrent neuralese architectures overtaking transformers within the next three years? What are the most important arguments swaying this probability one way or the other? (If you want a specific operationalization for answering this, I like the one proposed by Fabien Roger here, though I’d probably be more stringent on the text bottlenecks criterion, maybe requiring a text bottleneck after at most 10k rather than 100k opaque serial operations.)
I second @Seth Herd’s suggestion, I’m interested in your vision regarding how success would look like. Not just “here’s a list of some initiatives and research programs that should be helpful” or “here’s a possible optimistic scenario in which things go well, but which we don’t actually believe in”, but the sketch of an actual end-to-end plan around which you’d want people to coordinate. (Under the understanding that plans are worthless but planning is everything, of course.)
What’s your version of AI 2027 (aka most likely concrete scenario you imagine for the future), and how does control end up working out (or not working out) in different outcomes.
I would be curious to hear you discuss what good, stable futures might look like and how they might be governed (mostly because I haven’t heard your takes on this before and it seems quite important)
@ryan_greenblatt and I are going to try out recording a podcast together tomorrow, as an experiment in trying to express our ideas more cheaply. I’d love to hear if there are questions or topics you’d particularly like us to discuss.
Hype! A 15 min brainstorm
What would you work on if not control? Bonus points for sketching out the next 5+ new research agendas you would pursue, in priority order, assuming each previous one stopped being neglected
What is the field of ai safety messing up? Bonus: For (field) in {AI safety fields}: What are researchers in $field wrong about/making poor decisions about, in a way that significantly limits their impact?
What are you most unhappy about with how the control field has grown and the other work happening elsewhere?
What are some common beliefs by AI safety researchers about their domains of expertise that you disagree with (pick your favourite domain)?
What beliefs inside Constellation have not percolated into the wider safety community but really should?
What have you changed your mind about in the last 12 months?
You say that you don’t think control will work indefinitely and that’s sufficiently capable models will break it. Can you make that more concrete? What kind of early warning signs could we observe? Will we know when we reach models capable enough that we can no longer trust control?
If you were in charge of Anthropic what would you do?
If you were David Sacks, what would you do?
If you had had a hundred cracked mats scholars and $10,000 of compute each, what would you have them do?
If I gave you billions of dollars and 100 top researchers at a Frontier lab, what would you do?
I’m concerned that the safety community spends way too much energy on more meta things like control, evals, interpretability, etc. And has somewhat lost sight of solving the damn alignment problem. Takes? if you agree what do you think someone who wants to solve the alignment problem should actually be doing about it right now?
What are examples of the safety questions that you think are important, and can likely be studied on models in the next 2 years but not on today’s publicly available frontier models? (0.5? 1? 5? Until the 6 months before AGI?)
If you were wrong about a belief that you are currently over 50% on to do with safety, what do you predict it is and why?
What model organisms would you be most excited to see people produce? (Ditto any other the open source work)
What are some mistakes you predict many listeners are making? Bonus points for mistakes you think I personally am making
What is the most positive true thing you have to say about the field of ambitious mechanistic interpretability
What does redwood look for when hiring people, especially junior researchers?
What kind of mid-career professionals would you be most excited to see switch to control. What about other areas of air safety?
What should AGI lab safety researchers be doing differently to have a greater impact? Feel free to give a different answer per lab
People often present their views as a static object, which paints a misleading picture of how they arrived at them and how confident they are in different parts, I would be more interested to hear about how they’ve changed for both of you over the course of your work at Redwood.
Thoughts on how the sort of hyperstition stuff mentioned in nostalgebraist’s “the void” intersects with AI control work.
I had this question about economic viability of neuralese models
https://www.lesswrong.com/posts/PJaq4CDQ5d5QtjNRy/?commentId=YmyQqQqdei9C7pXR3
I remember Ryan talking about it on the 80k hours podcast. I’d be interested in hearing the perspective more fleshed out. Also just legibility of CoT, how important is it in the overall picture. If people start using fully recurrent architectures tomorrow in all frontier models does p(doom) go from 10% to 90%, or is it a smaller update?
Control is about monitoring, right?
You guys seem as tuned into the big picture as anyone. The big question we as a field need to answer is: what’s the strategy? What’s the route to success?
What probability would you put on recurrent neuralese architectures overtaking transformers within the next three years? What are the most important arguments swaying this probability one way or the other? (If you want a specific operationalization for answering this, I like the one proposed by Fabien Roger here, though I’d probably be more stringent on the text bottlenecks criterion, maybe requiring a text bottleneck after at most 10k rather than 100k opaque serial operations.)
I second @Seth Herd’s suggestion, I’m interested in your vision regarding how success would look like. Not just “here’s a list of some initiatives and research programs that should be helpful” or “here’s a possible optimistic scenario in which things go well, but which we don’t actually believe in”, but the sketch of an actual end-to-end plan around which you’d want people to coordinate. (Under the understanding that plans are worthless but planning is everything, of course.)
What’s your version of AI 2027 (aka most likely concrete scenario you imagine for the future), and how does control end up working out (or not working out) in different outcomes.
I would be curious to hear you discuss what good, stable futures might look like and how they might be governed (mostly because I haven’t heard your takes on this before and it seems quite important)
Thoughts on “alignment” proposals (i.e. reducing P(scheming))
The usefulness of interpretability research
What do you think of the risk that control backfires by preventing warning shots?
What types of policy/governance research is most valuable for control? Are there specific topics you wish more people were working on?
Thoughts on encouraging more LWers like yourself to make more videos?
I am sympathetic to Krashen’s input hypothesis as a way to onboard people to a new culture, and video may be faster at that than text.
What are your thoughts on Salib and Goldstein’s “AI Rights for Human Safety” proposal?
What’s your P(doom)?