I have a kinda symmetric feeling about “practical” research. “Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question ’does GPT-6 want to kill us all?”?
(I understand this feeling is not rational, it just shows my general inclination towards “theoretical” work)
“Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question ’does GPT-6 want to kill us all?”?
I understand this is more an illustration than a question, but I’ll try answering it anyway because I think there’s something informative about different perspectives on the problem :-)
Of course the skip-trigram result isn’t itself an answer to the question of whether some very capable ML system is planning to deceive the operator or seize power, but I claim it’s analogous to a lemma in some paper that establishes a field and that said field is one of our most important tools for x-risk mitigation. This was even our hope at the time, though I expected both the research and the field-building to go more slowly—actual events are something like a 90th-percentile outcome relative to my expectations in October-2021.[1]
Finally, while I deeply appreciate theoretical/conceptual research as a complement to empirical and applied research and want both, how on earth is either meant to help alone? If we get a conceptual breakthrough but don’t know how to build—and verify that we’ve correctly built—the thing, we’re still screwed; conversely if we get really good at building stuff and verifying our expectations but don’t expect some edge-case like FDT-based cooperation then we’re still screwed. Efforts which integrate both at least have a chance, if nobody else does something stupid first.
I still think it’s pretty unlikely (credible interval 0--40%) that we’ll have good enough interpretability tools by the time we really really need them, but I don’t see any mutually-exclusive options which are better.
This just feels like pretend, made-up research that they put math equations in to seem like it’s formal and rigorous.
Can you elaborate which parts feel made-up to you? E.g.:
modelling a superintelligent agent as a utility maximizer
considering a 3-step toy model with A1, O, A2
assuming that a specification of US exists
At the end of all those questions, I feel no closer to knowing if a machine would stop you from pressing a button to shut it off.
The authors do not claim to have solved the problem and instead state that this is an open problem. So this is not surprising that there is not a satisfying answer.
I would also like to note, that the paper has many more caveats.
Do you think it would still feel fake to you if the paper had a more positive answer to the problem described (eg a description how to modify a utility function of an agent in a toy model such that it does not incentivize the agent to prevent/cause the pressing of the shutdown button)?
From a pure world-modelling perspective, the 3 step model is not very interesting, because it doesn’t describe reality. It’s maybe best to think of it from an engineering perspective, as a test case. We’re trying to build an AI, and we want to make sure it works well. We don’t know exactly what that looks like in the real world, but we know what it looks like in simplified situations, where the off button is explicitly labelled for the AI and everything is well understood. If a proposed AI design does the wrong thing in the 3-step test case, then it has failed one of its unit tests, and should not be deployed to production (the real world). So the point of the paper is that a reasonable-sounding way you could design an AI with an off switch turns out to fail the unit-test.
I do generally think that too many of the AI-related posts here on LessWrong are “not real” in the way you’re suggesting, but this paper in particular seems “real” to me (whatever that means). I find the most “not real” posts are the verbose ones piled high with vague wordy abstractions, without an equation in sight. The equations in the corrigiblity paper aren’t there to seem impressive, they’re there to unambiguously communicate the math the paper is talking about, so that if the authors have made an error of reasoning, it will be as obvious as possible. The ways you keep something in contact with reality is checking either against experiment, or against the laws of mathematics. To quote Feynman, “if it disagrees with experiment, it’s wrong” and similarly, there’s a standard in mathematics that statements must be backed up by checkable calculations and proofs. So long as the authors are holding themselves to that standard (and so long as you agree that any well-designed AI should be able to perform well in this easy test case), then it’s “real”.
I have a kinda symmetric feeling about “practical” research. “Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question ’does GPT-6 want to kill us all?”? (I understand this feeling is not rational, it just shows my general inclination towards “theoretical” work)
I understand this is more an illustration than a question, but I’ll try answering it anyway because I think there’s something informative about different perspectives on the problem :-)
Skip-trigrams are a foundational piece of induction heads, which are themselves a key mechanism for in-context learning. A Mathematical Framework for Transformer Circuits was published less than a year ago, IMO subsequent progress is promising, and mechanistic interpretability has been picked up by independent researchers and other labs (e.g. Redwood’s project on GPT-2-small).
Of course the skip-trigram result isn’t itself an answer to the question of whether some very capable ML system is planning to deceive the operator or seize power, but I claim it’s analogous to a lemma in some paper that establishes a field and that said field is one of our most important tools for x-risk mitigation. This was even our hope at the time, though I expected both the research and the field-building to go more slowly—actual events are something like a 90th-percentile outcome relative to my expectations in October-2021.[1]
Finally, while I deeply appreciate theoretical/conceptual research as a complement to empirical and applied research and want both, how on earth is either meant to help alone? If we get a conceptual breakthrough but don’t know how to build—and verify that we’ve correctly built—the thing, we’re still screwed; conversely if we get really good at building stuff and verifying our expectations but don’t expect some edge-case like FDT-based cooperation then we’re still screwed. Efforts which integrate both at least have a chance, if nobody else does something stupid first.
I still think it’s pretty unlikely (credible interval 0--40%) that we’ll have good enough interpretability tools by the time we really really need them, but I don’t see any mutually-exclusive options which are better.
Nitpick:
This link probably meant to go to the induction heads and in context learning paper?
Fixed, thanks; it links to the transformer circuits thread which includes both the induction heads paper, SoLU, and Toy Models of Superposition.
Can you elaborate which parts feel made-up to you? E.g.:
modelling a superintelligent agent as a utility maximizer
considering a 3-step toy model with A1, O, A2
assuming that a specification of US exists
The authors do not claim to have solved the problem and instead state that this is an open problem. So this is not surprising that there is not a satisfying answer.
I would also like to note, that the paper has many more caveats.
Do you think it would still feel fake to you if the paper had a more positive answer to the problem described (eg a description how to modify a utility function of an agent in a toy model such that it does not incentivize the agent to prevent/cause the pressing of the shutdown button)?
.
From a pure world-modelling perspective, the 3 step model is not very interesting, because it doesn’t describe reality. It’s maybe best to think of it from an engineering perspective, as a test case. We’re trying to build an AI, and we want to make sure it works well. We don’t know exactly what that looks like in the real world, but we know what it looks like in simplified situations, where the off button is explicitly labelled for the AI and everything is well understood. If a proposed AI design does the wrong thing in the 3-step test case, then it has failed one of its unit tests, and should not be deployed to production (the real world). So the point of the paper is that a reasonable-sounding way you could design an AI with an off switch turns out to fail the unit-test.
I do generally think that too many of the AI-related posts here on LessWrong are “not real” in the way you’re suggesting, but this paper in particular seems “real” to me (whatever that means). I find the most “not real” posts are the verbose ones piled high with vague wordy abstractions, without an equation in sight. The equations in the corrigiblity paper aren’t there to seem impressive, they’re there to unambiguously communicate the math the paper is talking about, so that if the authors have made an error of reasoning, it will be as obvious as possible. The ways you keep something in contact with reality is checking either against experiment, or against the laws of mathematics. To quote Feynman, “if it disagrees with experiment, it’s wrong” and similarly, there’s a standard in mathematics that statements must be backed up by checkable calculations and proofs. So long as the authors are holding themselves to that standard (and so long as you agree that any well-designed AI should be able to perform well in this easy test case), then it’s “real”.
idk I think that reaction to miri is pretty common. how do you feel about this one? https://arxiv.org/abs/2208.08345