Safety teams at most labs seem to primarily have the task of scaring the public less and reducing rate of capability limiting disobedience (without getting it low enough to trust). I don’t think it matters much who they hire.
Seems very false at GDM and Anthropic, and fairly false at OpenAI—I don’t know enough about other labs for strong takes. I expect it to depend a lot on how agentic the team is, and how good they are at finding ways to have a safety impact that aren’t resisted by a non-safety-conscious labs, vs are just trying to follow the incentive gradient
But eg, there’s a clear commercial incentive to not have an AI causing bioterrorism, which aligns incentives nicely
Leo Gao recently said OpenAI heavily biases towards work that also increases capabilities:
Before I start this somewhat long comment, I’ll say that I unqualifiedly love the causal incentives group, and for most papers you’ve put out I don’t disagree that there’s a potential story where it could do some good. I’m less qualified to do the actual work than you, and my evaluation very well might be wrong because of it. But that said:
It seems from my current understanding that GDM and Anthropic may be somewhat better in actual outcome-impact to varying degrees at best; those teams seem wonderful internal-to-the-team, but seem to me from the outside to be currently getting used by the overall org more for the purposes I originally stated. You’re primarily working on interpretability rather than starkly-superintelligent-system-robust safety, effectively basic science with the hope that it can at some point produce the necessary robustness—which I actually absolutely agree it might be able to and that your pitch for how isn’t crazy; but while you can motivate yourself by imagining pro-safety uses for interp, actually achieving them in a superintelligence-that-can-defeat-humanity-combined-robust way reliably doesn’t seem publicly like a success you’re on track for, based on the interp I’ve seen you publish, and the issues with it you’ve now explicitly acknowledged. Building better interp seems to me to be continuing to increase the hunch formation rate of your capability-seeking peers. This even shows up explicitly in paper abstracts of non-safety-banner interp folks.
I’d be interested in a lesswrong dialogue with you, and one of Cole Wyeth[1] or Gurkenglas[1], in which I try to defend that alignment orgs at GDM and Anthropic should focus significant effort on how to get an AI to help them scale the things that Kosoy[2], Demski[3], Ngo[4], Wentworth[5], Hoogland & other SLT[6], model performance guarantee compression, and others in the direction of formal tools are up to (I suspect many of these folks would object that they can’t use current AI to improve their research significantly at the moment); in particular how to make recent math-llm successes turn into something where, some time in the next year and a half, we can have a math question we can ask a model where:
if that question is “find me a theorem where...” and the answer is a theorem, then it’s a theorem from a small enough space that we can know it’s the right one;
if it’s “prove this big honkin theorem”, then a lean4-certified proof gives us significant confidence that the system is asymptotically aligned;
it’s some sort of learning theory statement about how a learning system asymptotically discovers agents in its environment and continues accepting feedback;
and that it’s a question where, if we solve this question, there’s a meaningful sense in which we’re done with superintelligence alignment; that yudkowsky is reassured that the core problem he sees is more or less permanently solved.
I’d be arguing that if you don’t have certification, it won’t scale; that the certification needs to be that your system is systematically seeking to score highly on a function where scoring highly means it figures out what the beings in its environment are and what they want, and takes actions that empower them, something like becomes a reliable nightwatchman by nature of seeking to protect the information-theoretic structures that living, wanting beings are.
I also think that achieving this is not impossible, and that relevant and useful formal statements about deep learning are totally within throwing distance given the right framework. I also think your work could turn out to be extremely important for making this happen (eg, by identifying a concept that we want to figure out how to extract and formalize), though it might be the case that it wouldn’t be your existing work but rather new work directed specifically at something like Jason Gross’s stuff; the reason I have concerns about this approach is the potential ratio of safety application to capability folks reading your papers and having lightbulbs go off.
But my original claim rests on the ratio of capabilities-enhancement-rate to P(starkly-superintelligence-robust-cosmopolitan-value-seeking safety gets solved). And that continues to look quite bad to me, despite that the prosaic safety seems to be going somewhat well internally, and that there’s a possibility of pivoting to make use of capabilities breakthroughs to get asymptotic alignment-seeking behavior. What looks concerning is the rate of new bees to aim improvement.
I’m volunteering at Odyssey; would love to spend a few minutes chatting if you’ll be there.
seems like maybe something in the vague realm of SLT might work for making it practical to get kosoy LT agenda to stick to deep learning? this is speculative from someone (me) who’s still trying but struggling to grok both
I want to defend interp as a reasonable thing for one to do for superintelligence alignment, to the extent that one believes there is any object level work of value to do right now. (maybe there isn’t, and everyone should go do field building or something. no strong takes rn.) I’ve become more pessimistic about the weird alignment theory over time and I think it’s doomed just like how most theory work in ML is doomed (and at least ML theorists can test their theories against real NNs, if they so choose! alignment theory has no AGI to test against.)
I don’t really buy that interp (specifically ambitious mechinterp, the project of fully understanding exactly how neural networks work down to the last gear) has been that useful for capabilities insights to date. fmpov, the process that produces useful capabilities insights generally operates at a different level of abstraction than mechinterp operates at. I can’t talk about current examples for obvious reasons but I can talk about historical ones. with Chinchilla, it fixes a mistake in the Kaplan paper token budget methodology that’s obvious in hindsight; momentum and LR decay, which have been around for decades, are based on intuitive arguments from classic convex optimization; transformers came about by reasoning about the shape and trajectory of computers and trying to parallelize things as much as possible. also, a lot of stuff Just Works and nobody knows why.
one analogy that comes to mind is if your goal is to make your country’s economy go well, it certainly can’t hurt to become really good friends with a random subset of the population to understand everything they do. you’ll learn things about how they respond to price changes or whether they’d be more efficient with better healthcare or whatever. but it’s probably a much much higher priority for you to understand how economies respond to the interest rate, or tariffs, or job programs, or so on, and you want to think of people as crowds of homo economicus with preference curves modeled as a few simple splines or something.
as interp starts actually working it might generate actual capabilities insights. I don’t feel confident claiming this will never happen. but it feels marginal and likely substantially less efficient than just directly working on capabilities. (it’s hard to accidentally advance a field if you’re not even trying to advance it and a hundred incredibly smart and well resources people are poking at it!)
My current view is that alignment theory should work on deep learning as soon as it comes out, if it’s the good stuff, and if it doesn’t, it’s not likely to be useful later unless it helps produce stuff that works on deep learning. Wentworth, Ngo, and Causal Incentives are the main threads that already seem to have achieved this somewhat. SLT and DEC seem potentially relevant.
I’ll think about your argument for mechinterp. If it’s true that the ratio isn’t as catastrophic as I expect it to turn out to be, I do agree that making microscope AI work would be incredible in allowing for empiricism to finally properly inform rich and specific theory.
Safety teams at most labs seem to primarily have the task of scaring the public less and reducing rate of capability limiting disobedience (without getting it low enough to trust). I don’t think it matters much who they hire.
Seems very false at GDM and Anthropic, and fairly false at OpenAI—I don’t know enough about other labs for strong takes. I expect it to depend a lot on how agentic the team is, and how good they are at finding ways to have a safety impact that aren’t resisted by a non-safety-conscious labs, vs are just trying to follow the incentive gradient
But eg, there’s a clear commercial incentive to not have an AI causing bioterrorism, which aligns incentives nicely
Leo Gao recently said OpenAI heavily biases towards work that also increases capabilities:
Before I start this somewhat long comment, I’ll say that I unqualifiedly love the causal incentives group, and for most papers you’ve put out I don’t disagree that there’s a potential story where it could do some good. I’m less qualified to do the actual work than you, and my evaluation very well might be wrong because of it. But that said:
It seems from my current understanding that GDM and Anthropic may be somewhat better in actual outcome-impact to varying degrees at best; those teams seem wonderful internal-to-the-team, but seem to me from the outside to be currently getting used by the overall org more for the purposes I originally stated. You’re primarily working on interpretability rather than starkly-superintelligent-system-robust safety, effectively basic science with the hope that it can at some point produce the necessary robustness—which I actually absolutely agree it might be able to and that your pitch for how isn’t crazy; but while you can motivate yourself by imagining pro-safety uses for interp, actually achieving them in a superintelligence-that-can-defeat-humanity-combined-robust way reliably doesn’t seem publicly like a success you’re on track for, based on the interp I’ve seen you publish, and the issues with it you’ve now explicitly acknowledged. Building better interp seems to me to be continuing to increase the hunch formation rate of your capability-seeking peers. This even shows up explicitly in paper abstracts of non-safety-banner interp folks.
I’d be interested in a lesswrong dialogue with you, and one of Cole Wyeth[1] or Gurkenglas[1], in which I try to defend that alignment orgs at GDM and Anthropic should focus significant effort on how to get an AI to help them scale the things that Kosoy[2], Demski[3], Ngo[4], Wentworth[5], Hoogland & other SLT[6], model performance guarantee compression, and others in the direction of formal tools are up to (I suspect many of these folks would object that they can’t use current AI to improve their research significantly at the moment); in particular how to make recent math-llm successes turn into something where, some time in the next year and a half, we can have a math question we can ask a model where:
if that question is “find me a theorem where...” and the answer is a theorem, then it’s a theorem from a small enough space that we can know it’s the right one;
if it’s “prove this big honkin theorem”, then a lean4-certified proof gives us significant confidence that the system is asymptotically aligned;
it’s some sort of learning theory statement about how a learning system asymptotically discovers agents in its environment and continues accepting feedback;
and that it’s a question where, if we solve this question, there’s a meaningful sense in which we’re done with superintelligence alignment; that yudkowsky is reassured that the core problem he sees is more or less permanently solved.
I’d be arguing that if you don’t have certification, it won’t scale; that the certification needs to be that your system is systematically seeking to score highly on a function where scoring highly means it figures out what the beings in its environment are and what they want, and takes actions that empower them, something like becomes a reliable nightwatchman by nature of seeking to protect the information-theoretic structures that living, wanting beings are.
I currently expect that, if we don’t have a certifiable claim, we have approximately nothing once AGI turns into superintelligence that can out-science all human scientists combined, even if we can understand any specific thing it did.
I also think that achieving this is not impossible, and that relevant and useful formal statements about deep learning are totally within throwing distance given the right framework. I also think your work could turn out to be extremely important for making this happen (eg, by identifying a concept that we want to figure out how to extract and formalize), though it might be the case that it wouldn’t be your existing work but rather new work directed specifically at something like Jason Gross’s stuff; the reason I have concerns about this approach is the potential ratio of safety application to capability folks reading your papers and having lightbulbs go off.
But my original claim rests on the ratio of capabilities-enhancement-rate to P(starkly-superintelligence-robust-cosmopolitan-value-seeking safety gets solved). And that continues to look quite bad to me, despite that the prosaic safety seems to be going somewhat well internally, and that there’s a possibility of pivoting to make use of capabilities breakthroughs to get asymptotic alignment-seeking behavior. What looks concerning is the rate of new bees to aim improvement.
I’m volunteering at Odyssey; would love to spend a few minutes chatting if you’ll be there.
(I haven’t asked either)
relevant kosoy threads: learning-theoretic agenda; alignment metastrategy
abram threads: tiling/understanding trust so the thing kosoy is doing doesn’t slip off on self modify
scale free agency, or whatever that turns into so the thing kosoy is doing can be built in terms of it
wentworth’s top level framing, and the intuitions from natural latents turning into something like the scale free agency stuff, or so
seems like maybe something in the vague realm of SLT might work for making it practical to get kosoy LT agenda to stick to deep learning? this is speculative from someone (me) who’s still trying but struggling to grok both
I want to defend interp as a reasonable thing for one to do for superintelligence alignment, to the extent that one believes there is any object level work of value to do right now. (maybe there isn’t, and everyone should go do field building or something. no strong takes rn.) I’ve become more pessimistic about the weird alignment theory over time and I think it’s doomed just like how most theory work in ML is doomed (and at least ML theorists can test their theories against real NNs, if they so choose! alignment theory has no AGI to test against.)
I don’t really buy that interp (specifically ambitious mechinterp, the project of fully understanding exactly how neural networks work down to the last gear) has been that useful for capabilities insights to date. fmpov, the process that produces useful capabilities insights generally operates at a different level of abstraction than mechinterp operates at. I can’t talk about current examples for obvious reasons but I can talk about historical ones. with Chinchilla, it fixes a mistake in the Kaplan paper token budget methodology that’s obvious in hindsight; momentum and LR decay, which have been around for decades, are based on intuitive arguments from classic convex optimization; transformers came about by reasoning about the shape and trajectory of computers and trying to parallelize things as much as possible. also, a lot of stuff Just Works and nobody knows why.
one analogy that comes to mind is if your goal is to make your country’s economy go well, it certainly can’t hurt to become really good friends with a random subset of the population to understand everything they do. you’ll learn things about how they respond to price changes or whether they’d be more efficient with better healthcare or whatever. but it’s probably a much much higher priority for you to understand how economies respond to the interest rate, or tariffs, or job programs, or so on, and you want to think of people as crowds of homo economicus with preference curves modeled as a few simple splines or something.
as interp starts actually working it might generate actual capabilities insights. I don’t feel confident claiming this will never happen. but it feels marginal and likely substantially less efficient than just directly working on capabilities. (it’s hard to accidentally advance a field if you’re not even trying to advance it and a hundred incredibly smart and well resources people are poking at it!)
My current view is that alignment theory should work on deep learning as soon as it comes out, if it’s the good stuff, and if it doesn’t, it’s not likely to be useful later unless it helps produce stuff that works on deep learning. Wentworth, Ngo, and Causal Incentives are the main threads that already seem to have achieved this somewhat. SLT and DEC seem potentially relevant.
I’ll think about your argument for mechinterp. If it’s true that the ratio isn’t as catastrophic as I expect it to turn out to be, I do agree that making microscope AI work would be incredible in allowing for empiricism to finally properly inform rich and specific theory.
This seems reasonable. Personally, I’m not that worried about capabilities increases from mech interp, I simply don’t except it to work very well.