I’m more pessimistic than Ajeya; I foresee thorny meta-ethical challenges with building AI that does good things and not bad things, challenges not captured by sandwiching on e.g. medical advice. We don’t really have much internal disagreement about the standards by which we should judge medical advice, or the ontology in which medical advice should live. But there are lots of important challenges that are captured by sandwiching problems—sandwiching requires advances in how we interpret human feedback, and how we try to elicit desired behaviors from large ML models. This post was a step forward in laying out near-term alignment research that’s appealing to researchers used to working on capabilities.
Of course, this is all just my opinion. But while I would love to point to the sandwiching work done in the last 20 months that bears on the truth of the matter, there hasn’t been all that much. Meanwhile a small cottage industry has sprung up of people using RLHF to fine-tune language models to do various things, so it’s not like people aren’t interested in eliciting specific behaviors from models—it just seems like we’re in a research environment where handicapping your model’s capabilities in pursuit of deeper understanding is less memetically fit.
The big follow-up I’m aware of is Sam Bowman’s group currently working on artificial sandwiching. That work avoids some of the human-interaction challenges, but is still potentially interesting research on controlling language models with language models. I’ve also myself used this post as a template for suggesting prosaic alignment research on more than one occasion, all of it still in the works. So maybe the research inspired by this post needs a bit longer to bear fruit.
This was an important and worthy post.
I’m more pessimistic than Ajeya; I foresee thorny meta-ethical challenges with building AI that does good things and not bad things, challenges not captured by sandwiching on e.g. medical advice. We don’t really have much internal disagreement about the standards by which we should judge medical advice, or the ontology in which medical advice should live. But there are lots of important challenges that are captured by sandwiching problems—sandwiching requires advances in how we interpret human feedback, and how we try to elicit desired behaviors from large ML models. This post was a step forward in laying out near-term alignment research that’s appealing to researchers used to working on capabilities.
Of course, this is all just my opinion. But while I would love to point to the sandwiching work done in the last 20 months that bears on the truth of the matter, there hasn’t been all that much. Meanwhile a small cottage industry has sprung up of people using RLHF to fine-tune language models to do various things, so it’s not like people aren’t interested in eliciting specific behaviors from models—it just seems like we’re in a research environment where handicapping your model’s capabilities in pursuit of deeper understanding is less memetically fit.
The big follow-up I’m aware of is Sam Bowman’s group currently working on artificial sandwiching. That work avoids some of the human-interaction challenges, but is still potentially interesting research on controlling language models with language models. I’ve also myself used this post as a template for suggesting prosaic alignment research on more than one occasion, all of it still in the works. So maybe the research inspired by this post needs a bit longer to bear fruit.