A finding from the Assistant Access paper is that they were able to do the steering gently and only conditionally on drifting too far away from the persona, so they observed such a small difference in capabilities that it’s unclear whether capabilities increased or decreased.
If follow-up experiments find that it’s a decrease, then that is important information that they should share. Though in any case, I expect the decrease (if it exists) to be extremely small.
It is a bad sign if Anthropic, the AI Safety company, is unwilling to use a technique with such small drop in capabilities when it increases safety for users.
I would also expect that it would be good if internal models did not drift away from the assistant persona, so they should apply the technique to internal development too.
It is a bad sign if Anthropic, the AI Safety company, is unwilling to use a technique with such small drop in capabilities when it increases safety for users.
To me the crux is how much does this actually increase safety. The assistant axis paper is super interesting and important in many ways but the mechanistic intervention may not actually decrease the probability of catastrophic misuse. E.g., bio/cyber capabilities may be closely aligned with the assistant axis. I would also be somewhat surprised if aligning a model with the assistant axis would decrease the probability of future, more advanced systems pursing instrumental goals.
Although it may be net positive to implement this kind of intervention, its probably important to have a high bar for which safety techniques are implemented within labs because there is a high infrastructure and logistical cost to applying a technique to every forward pass. If I was leading a lab, I would probably conclude that the work is not x-risk relevant enough to be a priority. Importantly, this could change and I think there should be lots of followup work to the assistant axis paper because its a really exciting finding.
Thinking it through, I think that applying the assistant axis would take a negligible proportion of model parameters and a negligible proportion of compute. It is just one d_model vector of parameters and measuring alignment with it is d_model multiplications and additions. That is trivial for a GPU. It doesn’t even take any matrix multiplications.
Present-day alignment relies heavily on models having consistent aligned personas. The assisted access gets you that. I agree that it wouldn’t do much about misuse, except if the jail breaks make use of getting the agent out of the assistant axis. But I think merely keeping users sane is still very good and worthwhile.
I agree, though, that none of this will help against an actual superintelligence. But if your plan is to make an aligned AI help you make a stronger aligned AI, assistant axis sounds helpful for that initial step.
A finding from the Assistant Access paper is that they were able to do the steering gently and only conditionally on drifting too far away from the persona, so they observed such a small difference in capabilities that it’s unclear whether capabilities increased or decreased.
If follow-up experiments find that it’s a decrease, then that is important information that they should share. Though in any case, I expect the decrease (if it exists) to be extremely small.
It is a bad sign if Anthropic, the AI Safety company, is unwilling to use a technique with such small drop in capabilities when it increases safety for users.
I would also expect that it would be good if internal models did not drift away from the assistant persona, so they should apply the technique to internal development too.
To me the crux is how much does this actually increase safety. The assistant axis paper is super interesting and important in many ways but the mechanistic intervention may not actually decrease the probability of catastrophic misuse. E.g., bio/cyber capabilities may be closely aligned with the assistant axis. I would also be somewhat surprised if aligning a model with the assistant axis would decrease the probability of future, more advanced systems pursing instrumental goals.
Although it may be net positive to implement this kind of intervention, its probably important to have a high bar for which safety techniques are implemented within labs because there is a high infrastructure and logistical cost to applying a technique to every forward pass. If I was leading a lab, I would probably conclude that the work is not x-risk relevant enough to be a priority. Importantly, this could change and I think there should be lots of followup work to the assistant axis paper because its a really exciting finding.
Thinking it through, I think that applying the assistant axis would take a negligible proportion of model parameters and a negligible proportion of compute. It is just one d_model vector of parameters and measuring alignment with it is d_model multiplications and additions. That is trivial for a GPU. It doesn’t even take any matrix multiplications.
Present-day alignment relies heavily on models having consistent aligned personas. The assisted access gets you that. I agree that it wouldn’t do much about misuse, except if the jail breaks make use of getting the agent out of the assistant axis. But I think merely keeping users sane is still very good and worthwhile.
I agree, though, that none of this will help against an actual superintelligence. But if your plan is to make an aligned AI help you make a stronger aligned AI, assistant axis sounds helpful for that initial step.