I don’t see the immediate relevance. I think the implicit assumption here is that a process that builds an interpretable world-model pays some additional computational cost for the “interpretability” property, and that this cost scales with the world-model’s size? On the contrary, I argue that the necessary structure is already (approximately) learned by e. g. LLMs by default, and that the additional compute cost in building a translator from that structure to human programming languages is ~flat.
Here’s a framing: mechanistic interpretability/the science of reverse-engineering the functions learned by DL models currently scales poorly because it’s not bitter-lesson-pilled: requires more human labor the bigger a DL model is. The idea of this approach is to make that part unnecessary.
Alternatively, you mean that humans understanding the already pre-interpreted world-model afterwards is the step that doesn’t scale. But:
I don’t expect it to directly scale with the world-model’s size, see the “well-structured” property. (The world-model would be split into clearly delineated modules, and once we understand its basic structure, we could just go to the modules we care about and e. g. extract them, instead of having to understand the whole thing.)
The labor required for understanding it should be a rounding error compared to e. g. the labor that goes into scaling LLMs up by another order of magnitude.
Everything except the final “make sense of the already-interpreted world-model” step is supposed to be automated, by general-purpose methods whose efficiency does purely scale with compute/data.
(Also, if this is happening in the timeline where LLMs don’t plateau, at that point we probably have 10M/100M-context-length LLMs we could dump the codebase into to speed up our understanding of it.[1])
The main claim I’m making isn’t that there’s a greater compute cost that scales with the size of the world model, indeed I find it plausible that the costs are essentially flat. I’m claiming that the amount of labor and man-hours necessary to build powerful AIs that are interpretable vastly outweighs the expense in compute (relative to how much we have of each good) of making uninterpretable models, thus there’s much higher ROI in trying to scale AI than in trying to make AI better that doesn’t rely on uninterpretable end-to-end learning based on symbolic world models until you can scale labor as fast as compute scales, or faster, which only happens after AGI.
That said, this claim here does deserve a separate response:
Everything except the final “make sense of the already-interpreted world-model” step is supposed to be automated, by general-purpose methods whose efficiency does purely scale with compute/data.
If this is the plan, than my main criticism is that I’m deeply skeptical we can get enough labor for the other steps to be automated without at least fully automating AI research, meaning we can apply much greater AI labor to the problem, and while this sort of plan is good from a “How do we automate alignment perspective?”, it’s much worse as a plan for human alignment researchers.
The point at which the plan would be practical is basically the point where we have more or less achieved the holy grail of AI that can automate almost all jobs, conventionally called AGI, meaning it’s useful for AI alignment automation, but it isn’t a useful agenda for you to work on.
I still think your other research is nice, I’m just claiming that without AI research being fully automated, it’s not very useful to try to make AIs much more interpretable than they already are, because the marginal benefit of improved uninterpretable capabilities is far vaster than the marginal benefit of making interpretable AIs (ignoring existential risk here for the discussion).
I don’t see the immediate relevance. I think the implicit assumption here is that a process that builds an interpretable world-model pays some additional computational cost for the “interpretability” property, and that this cost scales with the world-model’s size? On the contrary, I argue that the necessary structure is already (approximately) learned by e. g. LLMs by default, and that the additional compute cost in building a translator from that structure to human programming languages is ~flat.
Here’s a framing: mechanistic interpretability/the science of reverse-engineering the functions learned by DL models currently scales poorly because it’s not bitter-lesson-pilled: requires more human labor the bigger a DL model is. The idea of this approach is to make that part unnecessary.
Alternatively, you mean that humans understanding the already pre-interpreted world-model afterwards is the step that doesn’t scale. But:
I don’t expect it to directly scale with the world-model’s size, see the “well-structured” property. (The world-model would be split into clearly delineated modules, and once we understand its basic structure, we could just go to the modules we care about and e. g. extract them, instead of having to understand the whole thing.)
The labor required for understanding it should be a rounding error compared to e. g. the labor that goes into scaling LLMs up by another order of magnitude.
Everything except the final “make sense of the already-interpreted world-model” step is supposed to be automated, by general-purpose methods whose efficiency does purely scale with compute/data.
(Also, if this is happening in the timeline where LLMs don’t plateau, at that point we probably have 10M/100M-context-length LLMs we could dump the codebase into to speed up our understanding of it.[1])
There are several safety-relevant concerns about this idea, but they may be ameliorable.
The main claim I’m making isn’t that there’s a greater compute cost that scales with the size of the world model, indeed I find it plausible that the costs are essentially flat. I’m claiming that the amount of labor and man-hours necessary to build powerful AIs that are interpretable vastly outweighs the expense in compute (relative to how much we have of each good) of making uninterpretable models, thus there’s much higher ROI in trying to scale AI than in trying to make AI better that doesn’t rely on uninterpretable end-to-end learning based on symbolic world models until you can scale labor as fast as compute scales, or faster, which only happens after AGI.
That said, this claim here does deserve a separate response:
If this is the plan, than my main criticism is that I’m deeply skeptical we can get enough labor for the other steps to be automated without at least fully automating AI research, meaning we can apply much greater AI labor to the problem, and while this sort of plan is good from a “How do we automate alignment perspective?”, it’s much worse as a plan for human alignment researchers.
The point at which the plan would be practical is basically the point where we have more or less achieved the holy grail of AI that can automate almost all jobs, conventionally called AGI, meaning it’s useful for AI alignment automation, but it isn’t a useful agenda for you to work on.
I still think your other research is nice, I’m just claiming that without AI research being fully automated, it’s not very useful to try to make AIs much more interpretable than they already are, because the marginal benefit of improved uninterpretable capabilities is far vaster than the marginal benefit of making interpretable AIs (ignoring existential risk here for the discussion).