Yeah I think I remember Stuart talking about agents that request clarification whenever they are uncertain about how a concept generalizes. That is vaguely similar. I can’t remember whether he proposed any way to make that reflectively stable though.
From the perspective of this post, wouldn’t natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents?
LLMs in their current form don’t really cause Edge Instantiation problems. Plausibly this is because they internally implement many kinds of regularization toward “normality” (and also kinda quantilize by default). So maybe yeah, I think I agree with your statement in the sense that I think you intended it, as it refers to current technology. But it’s not clear to me that this remains true if we made something-like-an-LLM that is genuinely creative (in the sense of being capable of finding genuinely-out-of-the-box plans that achieve a particular outcome). It depends on how exactly it implements its regularization/redundency/quantilization and whether that implementation works for the particular OOD tasks we use it for.
Ultimately I don’t think LLM-ish vs RL-ish won’t be the main alignment-relevant axis. RL trained agents will also understand natural language, and contain natural-language-relevant algorithms. Better to focus on understood vs not-understood.
Edit: There are other instances of this too, where you can tell claude to protect you in minecraft, and it will constantly tp to your position, and build walls around you when monsters are around. Protecting you, but also preventing any movement or fun you may have wanted to have.
Although I don’t think the first example is great, seems more like a capability/observation-bandwidth issue.
I think you can have multiple failures at the same time. The reason I think this was also goodhart was because I think the failure-mode could have been averted if sonnet was told “collect wood WITHOUT BREAKING MY HOUSE” ahead of time.
Those are some great points, made me think of some more questions.
Any thoughts on what language “understood vs not understood” might be in? ARC Heuristic arguments or something like infrabayesianism? Like what is the type signature of this and how does this relate to what you wrote in the post? Also what is its relation to natural language?
The ideal situation understanding-wise is that we understand AI at an algorithmic level. We can say stuff like: there are X,Y,Z components of the algorithm, and X passes (e.g.) beliefs to Y in format b, and Z can be viewed as a function that takes information in format w and links it with… etc. And infrabayes might be the theory you use to explain what some of the internal datastructures mean. Heuristic arguments might be how some subcomponent of the algorithm works. Most theoretical AI work (both from the alignment community and in normal AI and ML theory) potentially has relevance, but it’s not super clear which bits are most likely to be directly useful.
This seems like the ultimate goal of interp research (and it’s a good goal). Or, I think the current story for heuristic arguments is using them to “explain” a trained neural network by breaking it down into something more like an X,Y,Z components explanation.
At this point, we can analyse the overall AI algorithm, and understand what happens when it updates its beliefs radically, or understand how its goals are stored and whether they ever change. And we can try to work out whether the particular structure will change itself in bad-to-us ways if it could self-modify. This is where it looks much more theoretical, like theoretical analysis of algorithms.
(The above is the “understood” end of the axis. The “not-understood” end looks like making an AI with pure evolution, with no understanding of how it works. There are many levels of partial understanding in between).
This kind of understanding is a prerequisite for the scheme in my post. This scheme could be implemented by modifying a well-understood AI.
Okay, that makes sense to me so thank you for explaining!
I guess what I was pointing at with the language thing is the question of what the actual underlying objects that you called XYZ were and their relation to the linguistic explanation of language as a contextually dependent symbol defined by many scenarios rather than some sort of logic.
Like if we use IB it might be easy to look at that as a probability distribution of probability distributions? I just thought it was interesting to get some more context on how language might help in an alignment plan.
Yeah I think I remember Stuart talking about agents that request clarification whenever they are uncertain about how a concept generalizes. That is vaguely similar. I can’t remember whether he proposed any way to make that reflectively stable though.
LLMs in their current form don’t really cause Edge Instantiation problems. Plausibly this is because they internally implement many kinds of regularization toward “normality” (and also kinda quantilize by default). So maybe yeah, I think I agree with your statement in the sense that I think you intended it, as it refers to current technology. But it’s not clear to me that this remains true if we made something-like-an-LLM that is genuinely creative (in the sense of being capable of finding genuinely-out-of-the-box plans that achieve a particular outcome). It depends on how exactly it implements its regularization/redundency/quantilization and whether that implementation works for the particular OOD tasks we use it for.
Ultimately I don’t think LLM-ish vs RL-ish won’t be the main alignment-relevant axis. RL trained agents will also understand natural language, and contain natural-language-relevant algorithms. Better to focus on understood vs not-understood.
If you put current language models in weird situations & give them a goal, I’d say they do do edge instantiation, without the missing “creativity” ingredient. Eg see claude sonnet in minecraft repurposing someone’s house for wood after being asked to collect wood.
Edit: There are other instances of this too, where you can tell claude to protect you in minecraft, and it will constantly tp to your position, and build walls around you when monsters are around. Protecting you, but also preventing any movement or fun you may have wanted to have.
Fair enough, good points. I guess I classify these LLM agents as “something-like-an-LLM that is genuinely creative”, at least to some extent.
Although I don’t think the first example is great, seems more like a capability/observation-bandwidth issue.
I think you can have multiple failures at the same time. The reason I think this was also goodhart was because I think the failure-mode could have been averted if sonnet was told “collect wood WITHOUT BREAKING MY HOUSE” ahead of time.
Those are some great points, made me think of some more questions.
Any thoughts on what language “understood vs not understood” might be in? ARC Heuristic arguments or something like infrabayesianism? Like what is the type signature of this and how does this relate to what you wrote in the post? Also what is its relation to natural language?
The ideal situation understanding-wise is that we understand AI at an algorithmic level. We can say stuff like: there are X,Y,Z components of the algorithm, and X passes (e.g.) beliefs to Y in format b, and Z can be viewed as a function that takes information in format w and links it with… etc. And infrabayes might be the theory you use to explain what some of the internal datastructures mean. Heuristic arguments might be how some subcomponent of the algorithm works. Most theoretical AI work (both from the alignment community and in normal AI and ML theory) potentially has relevance, but it’s not super clear which bits are most likely to be directly useful.
This seems like the ultimate goal of interp research (and it’s a good goal). Or, I think the current story for heuristic arguments is using them to “explain” a trained neural network by breaking it down into something more like an X,Y,Z components explanation.
At this point, we can analyse the overall AI algorithm, and understand what happens when it updates its beliefs radically, or understand how its goals are stored and whether they ever change. And we can try to work out whether the particular structure will change itself in bad-to-us ways if it could self-modify. This is where it looks much more theoretical, like theoretical analysis of algorithms.
(The above is the “understood” end of the axis. The “not-understood” end looks like making an AI with pure evolution, with no understanding of how it works. There are many levels of partial understanding in between).
This kind of understanding is a prerequisite for the scheme in my post. This scheme could be implemented by modifying a well-understood AI.
Not sure what you’re getting at here.
Okay, that makes sense to me so thank you for explaining!
I guess what I was pointing at with the language thing is the question of what the actual underlying objects that you called XYZ were and their relation to the linguistic explanation of language as a contextually dependent symbol defined by many scenarios rather than some sort of logic.
Like if we use IB it might be easy to look at that as a probability distribution of probability distributions? I just thought it was interesting to get some more context on how language might help in an alignment plan.