This is a powerful result. I’m wondering how special purpose models made via this method compare to finetuned or lower capability models. For example, UNDO everything that isn’t related to accounting in a frontier model, then deploy it as a special purpose accounting bot.
Plausibly you could deploy the frontier model in uncensored form at first, and distill in phases until you have a model that is ‘just an accountant’ and will be confused by chemistry questions.
I think it’s a good line of thought. But I believe that it’s complicated.
Let there be a capability-scoped model, M_scoped, vs a fundamentally weaker model, M_weak. Here, M_scoped was initially trained with the full dataset D_full, whereas M_weak was trained with D_desirable. We also assume that, D_full—D_desirable = D_undesirable. M_scoped went through a subsequent capability suppression process to forget D_undesirable. Most likely, M_scoped would be very different from M_weak. It’s also possible/likely that M_scoped is overall just much better than M_weak in terms of general capabilities. I think a good relevant literature is https://arxiv.org/abs/2302.08582.
However, I expect the findings to be much more complicated empirically because a set of undesirable capabilities C_undesirable doesn’t always arise from just D_undesirable. Therefore, there is a fundamental disconnect between capabilities and data, which makes it difficult to easily come up with an answer for your question.
Filtering at D has not worked and I don’t see why it would it C is an emergent property.
Given finite resources, a decision would be between training two M_weaks on different datasets, vs train one M_dangerous with all available resources, and provide customers with M_scoped, (while giving full access to M_dangerous to LWers who promise they won’t be bad and using it internally to advance your business objectives).
A frontier lab can charge customers for making M_scoped from M_awesome but the business case for asking a customer to pay to censor D in specific ways at intake is challenging.
This is a powerful result. I’m wondering how special purpose models made via this method compare to finetuned or lower capability models. For example, UNDO everything that isn’t related to accounting in a frontier model, then deploy it as a special purpose accounting bot.
Plausibly you could deploy the frontier model in uncensored form at first, and distill in phases until you have a model that is ‘just an accountant’ and will be confused by chemistry questions.
I think it’s a good line of thought. But I believe that it’s complicated.
Let there be a capability-scoped model, M_scoped, vs a fundamentally weaker model, M_weak. Here, M_scoped was initially trained with the full dataset D_full, whereas M_weak was trained with D_desirable. We also assume that, D_full—D_desirable = D_undesirable. M_scoped went through a subsequent capability suppression process to forget D_undesirable. Most likely, M_scoped would be very different from M_weak. It’s also possible/likely that M_scoped is overall just much better than M_weak in terms of general capabilities. I think a good relevant literature is https://arxiv.org/abs/2302.08582.
However, I expect the findings to be much more complicated empirically because a set of undesirable capabilities C_undesirable doesn’t always arise from just D_undesirable. Therefore, there is a fundamental disconnect between capabilities and data, which makes it difficult to easily come up with an answer for your question.
I think I’ll bet on M_scoped winning out.
Filtering at D has not worked and I don’t see why it would it C is an emergent property.
Given finite resources, a decision would be between training two M_weaks on different datasets, vs train one M_dangerous with all available resources, and provide customers with M_scoped, (while giving full access to M_dangerous to LWers who promise they won’t be bad and using it internally to advance your business objectives).
A frontier lab can charge customers for making M_scoped from M_awesome but the business case for asking a customer to pay to censor D in specific ways at intake is challenging.