[please let me know if the following is confused; this is not my area]
Quite possibly I’m missing something, but I don’t see the sense in which this is good news on “ontology mismatch”. Whatever a system’s native ontology, we’d expect it to produce good translations into ours when it’s on distribution.
It seems to me that the system is leveraging a natural language chain-of-thought, because it must: this is the form of data it’s trained to take as input. This doesn’t mean that it’s using anything like our ontology internally—simply that it’s required to translate if it’s to break things down, and that it’s easier to make smaller inferential steps.
I don’t see a reason from this to be more confident that answers to “is world X a world Steve would like more than world Y?” would generalise well. (and I’d note that a “give this reward to the AGI” approach requires it to generalise extremely well)
Well, if we get to AGI from NLP, ie. a model trained on a giant human textdump, I think that’s promising because we’re feeding it primarily data that’s generated by the human ontology in the first place, so the human ontology would plausibly be the best compressor for it.
Sorry, I should clarify: my assumption here was that we find some consistent, non-goal-directed way of translating reality into a natural language description, and then using its potentially-great understanding of human preferences to define a utility function over states of reality. This is predicated on the belief that (1) our mapping from reality to natural language can be made to generalize just as well, even off-distribution, and (2) that future language models will actually be meaningfully difficult to knock off-distribution (given even better generalization abilities).
To my mind, the LLM’s internal activation ontology isn’t relevant. I’m imagining a system of “world model” → “text description of world” → “LLM grading of what human preferences would be about that world”. The “text description of world” is the relevant ontology, rather than whatever activations exist within the LLM.
That said, I might be misunderstanding your point. Do you mind taking another stab?
Ok, I think I see where you’re coming from now—thanks for clarifying. (in light of this, my previous comment was no argument against what you meant) My gut reaction is “that’s obviously not going to work”, but I’m still thinking through whether I have a coherent argument to that effect...
I think it comes down to essentially the same issue around sufficiently-good-generalisation: I can buy that a LLM may reach a very good idea of human preferences, but it won’t be perfect. Maximising according to good-approximation-to-values is likely to end badly for fragile value reasons (did you mention rethinking this somewhere in another comment? did I hallucinate that? might have been someone else).
We seem to need a system which adjusts on-the-fly to improve its approximation to our preferences (whether through corrigibility, actually managing to point at [“do what we want” de dicto], or by some other means). If we don’t have that in place, then it seems not to matter whether we optimize a UF based on a 50% approximation to our preferences, or a 99.99% approximation—I expect you need impractically many 9s before you end up somewhere good by aiming at a fixed target. (I could imagine a setup with a feedback loop to get improved approximations, but it seems the AGI would break that loop at the first opportunity: [allow the loop to work] ~ [allow the off-switch to be pressed])
If we do have an adjustment system in place, then with sufficient caution it doesn’t seem to make much difference in the limit whether we start from a 50% approximation or 99.99%. Though perhaps there’s still a large practical difference around early mundanely horrible failures.
The most plausible way I could imagine the above being wrong is where the very-good-approximation includes enough meta-preferences that the preferences do the preference adjustment ‘themselves’. This seems possible, but I’m not sure how we’d have confidence we’d got a sufficiently good solution. It seems to require nailing some meta-preferences pretty precisely, in order to give you a basin of attraction with respect to other preferences.
Hitting the attractor containing our true preferences does seem to be strictly easier than hitting our true preferences dead on, but it’s a double-edged sword: hit 99.9...9% of our preferences with a meta-preference slightly wrong and our post preference-self-adjustment situation may be terrible.
On a more practical level, [model of diff between worlds] → [text description of diff between worlds], may be a more workable starting point, though I suppose that’s not specific to this setup.
Yeah, I basically agree with everything you’re saying. This is very much a “lol we’re fucked what now” solution, not an “alignment” solution per se. The only reason we might vaguely hope that we don’t need 1- 0.1^10 accuracy, but rather 1 − 0.1^5 accuracy, is that not losing control in the face of a more powerful actor is a pretty basic preference that doesn’t take genius LLM moves to extract. Whether this just breaks immediately because the ASI finds a loophole is kind of dependent on “how hard is it to break, vs. to just do the thing they probably actually want me to do”.
This is functionally impossible in regimes like developing nanotechnology. Is it impossible for dumb shit, like “write me a groundbreaking alignment paper and also obey my preferences as defined from fine-tuning this LLM”? I don’t know. I don’t love the odds, but I don’t have a great argument that they’re less than 1%?
[please let me know if the following is confused; this is not my area]
Quite possibly I’m missing something, but I don’t see the sense in which this is good news on “ontology mismatch”. Whatever a system’s native ontology, we’d expect it to produce good translations into ours when it’s on distribution.
It seems to me that the system is leveraging a natural language chain-of-thought, because it must: this is the form of data it’s trained to take as input. This doesn’t mean that it’s using anything like our ontology internally—simply that it’s required to translate if it’s to break things down, and that it’s easier to make smaller inferential steps.
I don’t see a reason from this to be more confident that answers to “is world X a world Steve would like more than world Y?” would generalise well. (and I’d note that a “give this reward to the AGI” approach requires it to generalise extremely well)
Well, if we get to AGI from NLP, ie. a model trained on a giant human textdump, I think that’s promising because we’re feeding it primarily data that’s generated by the human ontology in the first place, so the human ontology would plausibly be the best compressor for it.
Sorry, I should clarify: my assumption here was that we find some consistent, non-goal-directed way of translating reality into a natural language description, and then using its potentially-great understanding of human preferences to define a utility function over states of reality. This is predicated on the belief that (1) our mapping from reality to natural language can be made to generalize just as well, even off-distribution, and (2) that future language models will actually be meaningfully difficult to knock off-distribution (given even better generalization abilities).
To my mind, the LLM’s internal activation ontology isn’t relevant. I’m imagining a system of “world model” → “text description of world” → “LLM grading of what human preferences would be about that world”. The “text description of world” is the relevant ontology, rather than whatever activations exist within the LLM.
That said, I might be misunderstanding your point. Do you mind taking another stab?
Ok, I think I see where you’re coming from now—thanks for clarifying. (in light of this, my previous comment was no argument against what you meant)
My gut reaction is “that’s obviously not going to work”, but I’m still thinking through whether I have a coherent argument to that effect...
I think it comes down to essentially the same issue around sufficiently-good-generalisation: I can buy that a LLM may reach a very good idea of human preferences, but it won’t be perfect. Maximising according to good-approximation-to-values is likely to end badly for fragile value reasons (did you mention rethinking this somewhere in another comment? did I hallucinate that? might have been someone else).
We seem to need a system which adjusts on-the-fly to improve its approximation to our preferences (whether through corrigibility, actually managing to point at [“do what we want” de dicto], or by some other means).
If we don’t have that in place, then it seems not to matter whether we optimize a UF based on a 50% approximation to our preferences, or a 99.99% approximation—I expect you need impractically many 9s before you end up somewhere good by aiming at a fixed target. (I could imagine a setup with a feedback loop to get improved approximations, but it seems the AGI would break that loop at the first opportunity: [allow the loop to work] ~ [allow the off-switch to be pressed])
If we do have an adjustment system in place, then with sufficient caution it doesn’t seem to make much difference in the limit whether we start from a 50% approximation or 99.99%. Though perhaps there’s still a large practical difference around early mundanely horrible failures.
The most plausible way I could imagine the above being wrong is where the very-good-approximation includes enough meta-preferences that the preferences do the preference adjustment ‘themselves’. This seems possible, but I’m not sure how we’d have confidence we’d got a sufficiently good solution. It seems to require nailing some meta-preferences pretty precisely, in order to give you a basin of attraction with respect to other preferences.
Hitting the attractor containing our true preferences does seem to be strictly easier than hitting our true preferences dead on, but it’s a double-edged sword: hit 99.9...9% of our preferences with a meta-preference slightly wrong and our post preference-self-adjustment situation may be terrible.
On a more practical level, [model of diff between worlds] → [text description of diff between worlds], may be a more workable starting point, though I suppose that’s not specific to this setup.
Yeah, I basically agree with everything you’re saying. This is very much a “lol we’re fucked what now” solution, not an “alignment” solution per se. The only reason we might vaguely hope that we don’t need 1- 0.1^10 accuracy, but rather 1 − 0.1^5 accuracy, is that not losing control in the face of a more powerful actor is a pretty basic preference that doesn’t take genius LLM moves to extract. Whether this just breaks immediately because the ASI finds a loophole is kind of dependent on “how hard is it to break, vs. to just do the thing they probably actually want me to do”.
This is functionally impossible in regimes like developing nanotechnology. Is it impossible for dumb shit, like “write me a groundbreaking alignment paper and also obey my preferences as defined from fine-tuning this LLM”? I don’t know. I don’t love the odds, but I don’t have a great argument that they’re less than 1%?