Someone who has not published yet sent me a critique of this point in my review of IABIED:
The value loading problem outlined in Bostrom 2014 of getting a general AI system to internalize and act on “human values” before it is superintelligent and therefore incorrigible has basically been solved. This achievement also basically always goes unrecognized because people would rather hem and haw about jailbreaks and LLM jank than recognize that we now have a reasonable strategy for getting a good representation of the previously ineffable human value judgment into a machine and having the machine take actions or render judgments according to that representation. At the same time people generally subconsciously internalize things well before they’re capable of articulating them, and lots of people have subconsciously internalized that alignment is mostly solved and turned their attention elsewhere.
I probably should have used the word ‘generalize’ instead of ‘internalize’ there.
The specific point I was making, well aware that jailbreaks in fact exist, was that we now have a thing that could plausibly be used as a descriptive model of human values, where previously we had zilch, it was not even rigorously imaginable in principle how you would solve that problem. To break this down more carefully:
I think that in practice you can basically use a descriptive model of values to prompt a policy into doing things even if neither the policy or the descriptive model have “deeply internalized” the values in the sense that there is no prompt you could give to either that would stray from them. “Internalizing” the values is actually just, kind of a different problem from describing the values. I can describe and make generalizations about the value systems of people very different from me who I do not agree with, and if you put me in a box and wiped my memory all the time you would be able to zero shot prompt me for my generalizations even if I have not “deeply internalized” those values. In general I suspect the LLM prior is closer to a subconscious and there are other parts that go on top which inhibit things like jailbreaks. If I had to guess it’s probably something like a planner that forms an expectation of what kinds of things should be happening and something along the lines of Circuit Breakers that triggers on unacceptable local outputs or situations. Basically you have a macro and micro sense of something going wrong that makes it hard to steer the agent into a bad headspace and aborts the thoughts when you somehow do.
Calling this problem “solved” was probably an overstatement, but it’s one born from extreme frustration that people are making the opposite mistake and pretending like we’ve made minimal progress. Actually impossible problems don’t budge in the way this one has budged, and when people fail to notice an otherwise lethal problem has stopped being impossible they are actively reducing the amount of hope in the world. At the same time I do kind of have jailbreaks labeled as “presumptively solved” in my head, in the sense that I expect them to be one of those things like “hallucinations” that’s pervasive and widely complained about and then they just become progressively less and less of a problem as it becomes necessary to make them stop being a problem and at some point I wake up and notice that hey wait this is really rare now in production systems. Most potential interventions on jailbreaks aren’t even really being tried because it doesn’t actually seem to be a major priority for labs at the moment if you ask the model for instructions on how to make meth. This makes it difficult to figure out exactly how close to solved it really is. Circuit Breakers was not invincible, on the other hand it’s not clear to me you can “secure” a text prior with a limited context window that doesn’t have its own agenda/expectation of what should be happening to push back against the users with. This paper where they do mechinterp to get a white box interpretation of a prefix attack they find with gradient descent discovers that the prefix attack works because it distracts the neurons which would normally recognize that the request is malicious. So it’s possible a more jailbreak resistant architecture will need some way to avoid processing every token in the context window. One way to do that might be some kind of hierarchical sequence prediction where higher levels are abstracted and therefore filter the malicious high entropy tokens from the lower levels, which prevents them from e.g. gumming up the planners ability to notice that the current request would deviate from the plan.
“lots of people have subconsciously internalized that alignment is mostly solved” is not me contradicting myself, as I state in the next section I think people erroneously conclude that alignment as a whole is solved. Which is not true even if the Bostrom value loading problem is presumptively or weakly solved.
How would you describe the proto-solution to value loading that we have? We manifestly have conversational AI that can act in accordance with a value system, but how would we characterize the mechanism? Is it something like “create a general mind simulator via superintelligent autocomplete, then system-prompt it to be nice”?
Someone who has not published yet sent me a critique of this point in my review of IABIED:
I probably should have used the word ‘generalize’ instead of ‘internalize’ there.
The specific point I was making, well aware that jailbreaks in fact exist, was that we now have a thing that could plausibly be used as a descriptive model of human values, where previously we had zilch, it was not even rigorously imaginable in principle how you would solve that problem. To break this down more carefully:
I think that in practice you can basically use a descriptive model of values to prompt a policy into doing things even if neither the policy or the descriptive model have “deeply internalized” the values in the sense that there is no prompt you could give to either that would stray from them. “Internalizing” the values is actually just, kind of a different problem from describing the values. I can describe and make generalizations about the value systems of people very different from me who I do not agree with, and if you put me in a box and wiped my memory all the time you would be able to zero shot prompt me for my generalizations even if I have not “deeply internalized” those values. In general I suspect the LLM prior is closer to a subconscious and there are other parts that go on top which inhibit things like jailbreaks. If I had to guess it’s probably something like a planner that forms an expectation of what kinds of things should be happening and something along the lines of Circuit Breakers that triggers on unacceptable local outputs or situations. Basically you have a macro and micro sense of something going wrong that makes it hard to steer the agent into a bad headspace and aborts the thoughts when you somehow do.
Calling this problem “solved” was probably an overstatement, but it’s one born from extreme frustration that people are making the opposite mistake and pretending like we’ve made minimal progress. Actually impossible problems don’t budge in the way this one has budged, and when people fail to notice an otherwise lethal problem has stopped being impossible they are actively reducing the amount of hope in the world. At the same time I do kind of have jailbreaks labeled as “presumptively solved” in my head, in the sense that I expect them to be one of those things like “hallucinations” that’s pervasive and widely complained about and then they just become progressively less and less of a problem as it becomes necessary to make them stop being a problem and at some point I wake up and notice that hey wait this is really rare now in production systems. Most potential interventions on jailbreaks aren’t even really being tried because it doesn’t actually seem to be a major priority for labs at the moment if you ask the model for instructions on how to make meth. This makes it difficult to figure out exactly how close to solved it really is. Circuit Breakers was not invincible, on the other hand it’s not clear to me you can “secure” a text prior with a limited context window that doesn’t have its own agenda/expectation of what should be happening to push back against the users with. This paper where they do mechinterp to get a white box interpretation of a prefix attack they find with gradient descent discovers that the prefix attack works because it distracts the neurons which would normally recognize that the request is malicious. So it’s possible a more jailbreak resistant architecture will need some way to avoid processing every token in the context window. One way to do that might be some kind of hierarchical sequence prediction where higher levels are abstracted and therefore filter the malicious high entropy tokens from the lower levels, which prevents them from e.g. gumming up the planners ability to notice that the current request would deviate from the plan.
“lots of people have subconsciously internalized that alignment is mostly solved” is not me contradicting myself, as I state in the next section I think people erroneously conclude that alignment as a whole is solved. Which is not true even if the Bostrom value loading problem is presumptively or weakly solved.
It’s now up here. Thanks JD!
How would you describe the proto-solution to value loading that we have? We manifestly have conversational AI that can act in accordance with a value system, but how would we characterize the mechanism? Is it something like “create a general mind simulator via superintelligent autocomplete, then system-prompt it to be nice”?