I tell you “the most likely token is unlocked with probability .04, the second-most likely is achieved with probability .015, and...”, and I’m basically right.[1]That happens over hundreds of diverse validation prompts.
How is this a relevant metric for safety at all? Can you, for example, do this for obviously safe models like a new fine-tune of Llama 7B? Wouldn’t computational irreducibly imply this is basically impossible. If banning all new models is your goal, why not just do that instead of making up a fake impossible to achieve metric?
As a modest proposal:
Before suggesting a new regulation regime for AI models, you must first show that it doesn’t ban obviously beneficial technologies (e.g. printing press, GPT 3.5, nuclear energy).
Can you, for example, do this for obviously safe models like a new fine-tune of Llama 7B?
Did you read the “anticipated questions” section?
If we pass this, no one will be able to train new frontier models for a long time.
Good.
But maybe “a long time” is too long. It’s not clear that this criterion can be passed, even after deeply qualitatively understanding the model.
I share this concern. That’s one reason I’m not lobbying to implement this as-is. Even solu-2l (6.3M params, 2-layer) is probably out of reach absent serious effort and solution of superposition.
Maybe there are useful relaxations which are both more attainable and require deep understanding of the model. We can, of course, set the acceptable misprediction rate to be higher at first, and decrease it over time. Another relaxation would be “only predict e.g. the algorithm written in a coding task, not the per-token probabilities”, but I think that seems way too easy.
Before suggesting a new regulation regime for AI models, you must first show that it doesn’t ban obviously beneficial technologies (e.g. printing press, GPT 3.5, nuclear energy).
I was trying to communicate that I already share the concern around excess strictness. So, I don’t understand why this (apparently condescendingly phrased) point is being repeated back to me again. The point of this post is to explore the pros and cons of this eval, and see if there are relaxations which capture most of the pros without most of the cons.
From the original comment:
How is this a relevant metric for safety at all?
If you don’t know what I think the pros are, maybe try asking more specific questions about more specific claims I make in the post?
FWIW its fairly obvious to me that these final two technologies have significant downsides, and so calling them obviously beneficial feels like a stretch.
Do you have an actual choice to ban either technology if you want to remain sovereign? Wealthy countries without their own nuclear weapons are generally shielded by the arsenals of ones that do. If you are not using at least gpt 3.5 welcome to the unemployment line in a foreseeable timespan.
Absolutely. I will go with a counterfactual and assume we don’t mean literally 3.5 but a model using the same architecture, level of compute, and scale but has been cleaned up and fine tuned for productive tasks.
How is this a relevant metric for safety at all? Can you, for example, do this for obviously safe models like a new fine-tune of Llama 7B? Wouldn’t computational irreducibly imply this is basically impossible. If banning all new models is your goal, why not just do that instead of making up a fake impossible to achieve metric?
As a modest proposal:
Before suggesting a new regulation regime for AI models, you must first show that it doesn’t ban obviously beneficial technologies (e.g. printing press, GPT 3.5, nuclear energy).
Did you read the “anticipated questions” section?
I was trying to communicate that I already share the concern around excess strictness. So, I don’t understand why this (apparently condescendingly phrased) point is being repeated back to me again. The point of this post is to explore the pros and cons of this eval, and see if there are relaxations which capture most of the pros without most of the cons.
From the original comment:
If you don’t know what I think the pros are, maybe try asking more specific questions about more specific claims I make in the post?
FWIW its fairly obvious to me that these final two technologies have significant downsides, and so calling them obviously beneficial feels like a stretch.
Do you have an actual choice to ban either technology if you want to remain sovereign? Wealthy countries without their own nuclear weapons are generally shielded by the arsenals of ones that do. If you are not using at least gpt 3.5 welcome to the unemployment line in a foreseeable timespan.
Ditto for countries that use and expand on got 3.5
Absolutely. I will go with a counterfactual and assume we don’t mean literally 3.5 but a model using the same architecture, level of compute, and scale but has been cleaned up and fine tuned for productive tasks.