I think this is the crux-y part for me. My basic intuition here is something like “it’s very hard to get contemporary prosaic LMs to not do a thing they already do (or have high likelihood of doing)” and this intuition points me in the direction of instead “conditionally training them to only do that thing in certain contexts” is easier in a way that matters.
My intuitions are based on a bunch of assumptions that I have access to and probably some that I don’t.
Like, I’m basically only thinking about large language models, which are at least pre-trained on a large swatch of a natural language distribution. I’m also thinking about using them generatively, which means sampling from their distribution—which implies getting a model to “not do something” means getting the model to not put probability on that sequence.
At this point it still is a conjecture of mine—that conditional prefixing behaviors we wish to control is easier than getting them not to do some behavior unconditionally—but I think it’s probably testable?
A thing that would be useful to me in designing an experiment to test this would be to hear more about adversarial training as a technique—as it stands I don’t know much more than what’s in that post.
I think this is the crux-y part for me. My basic intuition here is something like “it’s very hard to get contemporary prosaic LMs to not do a thing they already do (or have high likelihood of doing)” and this intuition points me in the direction of instead “conditionally training them to only do that thing in certain contexts” is easier in a way that matters.
My intuitions are based on a bunch of assumptions that I have access to and probably some that I don’t.
Like, I’m basically only thinking about large language models, which are at least pre-trained on a large swatch of a natural language distribution. I’m also thinking about using them generatively, which means sampling from their distribution—which implies getting a model to “not do something” means getting the model to not put probability on that sequence.
At this point it still is a conjecture of mine—that conditional prefixing behaviors we wish to control is easier than getting them not to do some behavior unconditionally—but I think it’s probably testable?
A thing that would be useful to me in designing an experiment to test this would be to hear more about adversarial training as a technique—as it stands I don’t know much more than what’s in that post.