“The goal is”—is this describing Redwood’s research or your research or a goal you have more broadly?
My general goal, Redwood’s current goal, and my understanding of the goal of adversarial training (applied to AI-murdering-everyone) generally.
I’m curious how this is connected to “doesn’t write fiction where a human is harmed”.
“Don’t produce outputs where someone is injured” is just an arbitrary thing not to do. It’s chosen to be fairly easy not to do (and to have the right valence so that you can easily remember which direction is good and which direction is bad, though in retrospect I think it’s plausible that a predicate with neutral valence would have been better to avoid confusion).
I think this is the crux-y part for me. My basic intuition here is something like “it’s very hard to get contemporary prosaic LMs to not do a thing they already do (or have high likelihood of doing)” and this intuition points me in the direction of instead “conditionally training them to only do that thing in certain contexts” is easier in a way that matters.
My intuitions are based on a bunch of assumptions that I have access to and probably some that I don’t.
Like, I’m basically only thinking about large language models, which are at least pre-trained on a large swatch of a natural language distribution. I’m also thinking about using them generatively, which means sampling from their distribution—which implies getting a model to “not do something” means getting the model to not put probability on that sequence.
At this point it still is a conjecture of mine—that conditional prefixing behaviors we wish to control is easier than getting them not to do some behavior unconditionally—but I think it’s probably testable?
A thing that would be useful to me in designing an experiment to test this would be to hear more about adversarial training as a technique—as it stands I don’t know much more than what’s in that post.
My general goal, Redwood’s current goal, and my understanding of the goal of adversarial training (applied to AI-murdering-everyone) generally.
“Don’t produce outputs where someone is injured” is just an arbitrary thing not to do. It’s chosen to be fairly easy not to do (and to have the right valence so that you can easily remember which direction is good and which direction is bad, though in retrospect I think it’s plausible that a predicate with neutral valence would have been better to avoid confusion).
I think this is the crux-y part for me. My basic intuition here is something like “it’s very hard to get contemporary prosaic LMs to not do a thing they already do (or have high likelihood of doing)” and this intuition points me in the direction of instead “conditionally training them to only do that thing in certain contexts” is easier in a way that matters.
My intuitions are based on a bunch of assumptions that I have access to and probably some that I don’t.
Like, I’m basically only thinking about large language models, which are at least pre-trained on a large swatch of a natural language distribution. I’m also thinking about using them generatively, which means sampling from their distribution—which implies getting a model to “not do something” means getting the model to not put probability on that sequence.
At this point it still is a conjecture of mine—that conditional prefixing behaviors we wish to control is easier than getting them not to do some behavior unconditionally—but I think it’s probably testable?
A thing that would be useful to me in designing an experiment to test this would be to hear more about adversarial training as a technique—as it stands I don’t know much more than what’s in that post.