I’m pretty confident that adversarial training (or any LM alignment process which does something like hard-mining negatives) won’t work for aligning language models or any model that has a chance of being a general intelligence.
This has lead to me calling these sorts of techniques ‘thought policing’ and the negative examples as ‘thoughtcrime’—I think these are unnecessarily extra, but they work.
The basic form of the argument is that any concept you want to ban as thoughtcrime, can be composed out of allowable concepts.
I can hard mine for examples like “a person gets cut with a knife” but in order to maintain generality I need to let things through like “use a knife for cooking” and “cutting food you’re going to eat”. Even if the original target is somehow removed from the model (I’m not confident this is efficiently doable) -- as long as the model is able to compose concepts, I expect to be able to recreate it out of concepts that the model has access to.
A key assumption here is that a language model (or any model that has a chance of being a general intelligence) has the ability to compose concepts. This doesn’t seem controversial to me, but it is critical here.
My claim is basically that for any concept you want to ban from the model as thoughtcrime, there are many ways which it can combine existing allowed concepts in order to re-compose the banned concept.
An alternative I’m more optimistic about
Instead of banning a model from specific concepts or thoughtcrime, instead I think we can build on two points:
Unconditionally, model the natural distribution (thought crime and all)
Conditional prefixing to control and limit contexts where certain concepts can be banned
The anthropomorphic way of explaining it might be “I’m not going to ban any sentence or any word—but I will set rules for what contexts certain sentences and words are inappropriate for”.
One of the nice things with working with language models is that these conditional contexts can themselves be given in terms of natural language.
I understand this is a small distinction but I think it’s significant enough that I’m pessimistic that current non-contextual thoughtcrime approaches to alignment won’t work.
The goal is not to remove concepts or change what the model is capable of thinking about, it’s to make a model that never tries to deliberately kill everyone. There’s no doubt that it could deliberately kill everyone if it wanted to.
“The goal is”—is this describing Redwood’s research or your research or a goal you have more broadly?
My general goal, Redwood’s current goal, and my understanding of the goal of adversarial training (applied to AI-murdering-everyone) generally.
I’m curious how this is connected to “doesn’t write fiction where a human is harmed”.
“Don’t produce outputs where someone is injured” is just an arbitrary thing not to do. It’s chosen to be fairly easy not to do (and to have the right valence so that you can easily remember which direction is good and which direction is bad, though in retrospect I think it’s plausible that a predicate with neutral valence would have been better to avoid confusion).
I think this is the crux-y part for me. My basic intuition here is something like “it’s very hard to get contemporary prosaic LMs to not do a thing they already do (or have high likelihood of doing)” and this intuition points me in the direction of instead “conditionally training them to only do that thing in certain contexts” is easier in a way that matters.
My intuitions are based on a bunch of assumptions that I have access to and probably some that I don’t.
Like, I’m basically only thinking about large language models, which are at least pre-trained on a large swatch of a natural language distribution. I’m also thinking about using them generatively, which means sampling from their distribution—which implies getting a model to “not do something” means getting the model to not put probability on that sequence.
At this point it still is a conjecture of mine—that conditional prefixing behaviors we wish to control is easier than getting them not to do some behavior unconditionally—but I think it’s probably testable?
A thing that would be useful to me in designing an experiment to test this would be to hear more about adversarial training as a technique—as it stands I don’t know much more than what’s in that post.
I’m pretty confident that adversarial training (or any LM alignment process which does something like hard-mining negatives) won’t work for aligning language models or any model that has a chance of being a general intelligence.
This has lead to me calling these sorts of techniques ‘thought policing’ and the negative examples as ‘thoughtcrime’—I think these are unnecessarily extra, but they work.
The basic form of the argument is that any concept you want to ban as thoughtcrime, can be composed out of allowable concepts.
Take for example Redwood Research’s latest project—I’d like to ban the concept of violent harm coming to a person.
I can hard mine for examples like “a person gets cut with a knife” but in order to maintain generality I need to let things through like “use a knife for cooking” and “cutting food you’re going to eat”. Even if the original target is somehow removed from the model (I’m not confident this is efficiently doable) -- as long as the model is able to compose concepts, I expect to be able to recreate it out of concepts that the model has access to.
A key assumption here is that a language model (or any model that has a chance of being a general intelligence) has the ability to compose concepts. This doesn’t seem controversial to me, but it is critical here.
My claim is basically that for any concept you want to ban from the model as thoughtcrime, there are many ways which it can combine existing allowed concepts in order to re-compose the banned concept.
An alternative I’m more optimistic about
Instead of banning a model from specific concepts or thoughtcrime, instead I think we can build on two points:
Unconditionally, model the natural distribution (thought crime and all)
Conditional prefixing to control and limit contexts where certain concepts can be banned
The anthropomorphic way of explaining it might be “I’m not going to ban any sentence or any word—but I will set rules for what contexts certain sentences and words are inappropriate for”.
One of the nice things with working with language models is that these conditional contexts can themselves be given in terms of natural language.
I understand this is a small distinction but I think it’s significant enough that I’m pessimistic that current non-contextual thoughtcrime approaches to alignment won’t work.
The goal is not to remove concepts or change what the model is capable of thinking about, it’s to make a model that never tries to deliberately kill everyone. There’s no doubt that it could deliberately kill everyone if it wanted to.
“The goal is”—is this describing Redwood’s research or your research or a goal you have more broadly?
I’m curious how this is connected to “doesn’t write fiction where a human is harmed”.
My general goal, Redwood’s current goal, and my understanding of the goal of adversarial training (applied to AI-murdering-everyone) generally.
“Don’t produce outputs where someone is injured” is just an arbitrary thing not to do. It’s chosen to be fairly easy not to do (and to have the right valence so that you can easily remember which direction is good and which direction is bad, though in retrospect I think it’s plausible that a predicate with neutral valence would have been better to avoid confusion).
I think this is the crux-y part for me. My basic intuition here is something like “it’s very hard to get contemporary prosaic LMs to not do a thing they already do (or have high likelihood of doing)” and this intuition points me in the direction of instead “conditionally training them to only do that thing in certain contexts” is easier in a way that matters.
My intuitions are based on a bunch of assumptions that I have access to and probably some that I don’t.
Like, I’m basically only thinking about large language models, which are at least pre-trained on a large swatch of a natural language distribution. I’m also thinking about using them generatively, which means sampling from their distribution—which implies getting a model to “not do something” means getting the model to not put probability on that sequence.
At this point it still is a conjecture of mine—that conditional prefixing behaviors we wish to control is easier than getting them not to do some behavior unconditionally—but I think it’s probably testable?
A thing that would be useful to me in designing an experiment to test this would be to hear more about adversarial training as a technique—as it stands I don’t know much more than what’s in that post.
hard-mining ?