I want to defend interp as a reasonable thing for one to do for superintelligence alignment, to the extent that one believes there is any object level work of value to do right now. (maybe there isn’t, and everyone should go do field building or something. no strong takes rn.) I’ve become more pessimistic about the weird alignment theory over time and I think it’s doomed just like how most theory work in ML is doomed (and at least ML theorists can test their theories against real NNs, if they so choose! alignment theory has no AGI to test against.)
I don’t really buy that interp (specifically ambitious mechinterp, the project of fully understanding exactly how neural networks work down to the last gear) has been that useful for capabilities insights to date. fmpov, the process that produces useful capabilities insights generally operates at a different level of abstraction than mechinterp operates at. I can’t talk about current examples for obvious reasons but I can talk about historical ones. with Chinchilla, it fixes a mistake in the Kaplan paper token budget methodology that’s obvious in hindsight; momentum and LR decay, which have been around for decades, are based on intuitive arguments from classic convex optimization; transformers came about by reasoning about the shape and trajectory of computers and trying to parallelize things as much as possible. also, a lot of stuff Just Works and nobody knows why.
one analogy that comes to mind is if your goal is to make your country’s economy go well, it certainly can’t hurt to become really good friends with a random subset of the population to understand everything they do. you’ll learn things about how they respond to price changes or whether they’d be more efficient with better healthcare or whatever. but it’s probably a much much higher priority for you to understand how economies respond to the interest rate, or tariffs, or job programs, or so on, and you want to think of people as crowds of homo economicus with preference curves modeled as a few simple splines or something.
as interp starts actually working it might generate actual capabilities insights. I don’t feel confident claiming this will never happen. but it feels marginal and likely substantially less efficient than just directly working on capabilities. (it’s hard to accidentally advance a field if you’re not even trying to advance it and a hundred incredibly smart and well resources people are poking at it!)
My current view is that alignment theory should work on deep learning as soon as it comes out, if it’s the good stuff, and if it doesn’t, it’s not likely to be useful later unless it helps produce stuff that works on deep learning. Wentworth (and now Condensation), SiLT, and Causal Incentives are the main threads that already seem to have achieved this somewhat; I’m optimistic Ngo is about to. DEC seems potentially relevant. (list edited 4mo later, same entries but improved ratings.)
I’ll think about your argument for mechinterp. If it’s true that the ratio isn’t as catastrophic as I expect it to turn out to be, I do agree that making microscope AI work would be incredible in allowing for empiricism to finally properly inform rich and specific theory.
I want to defend interp as a reasonable thing for one to do for superintelligence alignment, to the extent that one believes there is any object level work of value to do right now. (maybe there isn’t, and everyone should go do field building or something. no strong takes rn.) I’ve become more pessimistic about the weird alignment theory over time and I think it’s doomed just like how most theory work in ML is doomed (and at least ML theorists can test their theories against real NNs, if they so choose! alignment theory has no AGI to test against.)
I don’t really buy that interp (specifically ambitious mechinterp, the project of fully understanding exactly how neural networks work down to the last gear) has been that useful for capabilities insights to date. fmpov, the process that produces useful capabilities insights generally operates at a different level of abstraction than mechinterp operates at. I can’t talk about current examples for obvious reasons but I can talk about historical ones. with Chinchilla, it fixes a mistake in the Kaplan paper token budget methodology that’s obvious in hindsight; momentum and LR decay, which have been around for decades, are based on intuitive arguments from classic convex optimization; transformers came about by reasoning about the shape and trajectory of computers and trying to parallelize things as much as possible. also, a lot of stuff Just Works and nobody knows why.
one analogy that comes to mind is if your goal is to make your country’s economy go well, it certainly can’t hurt to become really good friends with a random subset of the population to understand everything they do. you’ll learn things about how they respond to price changes or whether they’d be more efficient with better healthcare or whatever. but it’s probably a much much higher priority for you to understand how economies respond to the interest rate, or tariffs, or job programs, or so on, and you want to think of people as crowds of homo economicus with preference curves modeled as a few simple splines or something.
as interp starts actually working it might generate actual capabilities insights. I don’t feel confident claiming this will never happen. but it feels marginal and likely substantially less efficient than just directly working on capabilities. (it’s hard to accidentally advance a field if you’re not even trying to advance it and a hundred incredibly smart and well resources people are poking at it!)
My current view is that alignment theory should work on deep learning as soon as it comes out, if it’s the good stuff, and if it doesn’t, it’s not likely to be useful later unless it helps produce stuff that works on deep learning. Wentworth (and now Condensation), SiLT, and Causal Incentives are the main threads that already seem to have achieved this somewhat; I’m optimistic Ngo is about to. DEC seems potentially relevant. (list edited 4mo later, same entries but improved ratings.)
I’ll think about your argument for mechinterp. If it’s true that the ratio isn’t as catastrophic as I expect it to turn out to be, I do agree that making microscope AI work would be incredible in allowing for empiricism to finally properly inform rich and specific theory.
This seems reasonable. Personally, I’m not that worried about capabilities increases from mech interp, I simply don’t except it to work very well.