However, I think these papers commit one major sin: Making a simplifying assumption that almost never holds, but is load-bearing for the headline results without appropriately highlighting that this is what is happening. More specifically, they assume that there is some set of desirable features f1, …, fn in R, such that our utility function u : R^n --> R is strictly increasing in each fi. (This is okay.) But they also assume that there is a cost function c : R^n --> R which says how difficult it is to obtain each configuration of the world. And, crucially, they assume that this c is strictly increasing in each fi.
This sounds fine as well, until you realise that the “increasing in each fi” part means that any win-win outcomes, any Pareto improvements where everybody gets slightly better off, are prohibited by definition. For example, suppose that f1 is “how good my house is” and f2 is “how good your house is”. And suppose that one scenario is the one where we both spend a month plotting how to burn down the other’s house, and then burn down each other’s houses. Intuitively, if we decided to not do this stupid thing, this would save us some effort and resources (lower c) and make us both better off (higher f1 and f2). But the model prohibits the existence of this option.
Similarly, the model suggests that if there was a misaligned AI that cares about different features than we do, the only way it can profit over the current state is to harm our interests. And I agree that at some point, once all available resources are used perfectly optimally, this will become true. However, we are very far from that point, and until then, win-win outcomes are all over the place.
To sum up: I really like those papers. But in the setting they consider, the conclusion that “nearly every optimisation must end in disaster” is not an interesting discovery, but an immediate consequence of the (unrealistic) assumption about the cost c being strictly increasing in each feature. The key result is a foregone conclusion; a rendering of a pre-existing informal intuition in formal math. And that is fine. I would just like those papers even more if they were more explicit about this.
I agree with the big point made here, and I do agree that in some cases we can avoid costs going up montononically according to your utility function.
But in most cases of interest, I do think the cost function actually does just strictly increase, albeit very slowly with optimized technology as you get better outcomes, and the heuristic argument is Landauer’s principle (for irreversible computers) or this paper (for reversible computers) shows that you must have energy costs (or another conserved quantity, the important thing here is that the cost always increases) increases the more you want to do computation, and for most utility functions, you can translate their goals into increasing the amount of computation you can do, because computation in our world is increasingly valuable, and by the heuristic argument above we do have costs strictly increase to reach higher utility states.
As you say, this doesn’t matter because the cost grows so slowly that the costs I mentioned above, such that it’s easy for improvements to outweigh costs and generate pareto improvements.
[% of loss explained] isn’t a good interpretability metric[edit: isn’t enough to get guarantees]. In interpretability, people use [% of loss explained] as a measure of the quality of an explanation. However, unless you replace the system-being-explained by its explanation, this measure has a fatal flaw.
Suppose you have misaligned superintelligence X pretending to be a helpful assistant A—that is, acting as A in all situations except those where it could take over the world. Then the explanation “X is behaving as A” will explain 100% of loss, but actually using X will still kill you.
For [% of loss explained] to be a useful metric [edit: robust for detecting misalignment], it would need to explain most of the loss on inputs that actually matter. And since we fundamentally can’t tell which ones those are, the metric will only be useful (for detecting misaligned superintelligences) if we can explain 100% of loss on all possible inputs.
The main use of % loss recovered isn’t to directly tell us when a misaligned superintelligence will kill you. In interpretability we hope to use explanations to understand the internals of a model, so the circuit we find will have a “can I take over the world” node. In MAD we do not aim to understand the internals, but the whole point of MAD is to detect when the model has new behavior not explained by explanations and flag this as potentially dangerous.
A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven’t solved everything, you must have made a bunch of progress.
For algorithmic tasks where humans just know an algorithm which performs well, I think you need to use something like causal scrubbing which checks the correspondence.
A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven’t solved everything, you must have made a bunch of progress.
Right, I agree. I didn’t realise the bolded statement was a poor/misleading summary of the non-bolded text below. I guess it would be more accurate to say something like “[% of loss explained] is a good metric for tracking intellectual progress in interpretability. However, it is somewhat misleading in that 100% loss explained does not mean you understand what is going on inside the system.”
I rephrased that now. Would be curious to hear whether you still have objections to the updated phrasing.
That said, if you train an AI on some IID training dataset and then explain 99.9% of loss validated as fully corresponding (via something like causal scrubbing), then you probably understand almost all the interesting stuff that SGD put into the model.
You still might die because you didn’t understand the key 0.1% or because some stuff was put into the model other than via SGD (e.g. gradient hacking or someone put in a backdoor).
Typical stories of deceptive alignment imply that to explain 99.9% of loss with a truely human understandable explanation, you’d probably have to explain the key AI machinery to a sufficient extent that you can understand if the AI is deceptively aligned (as the AI is probably doing reasoning about this on a reasonably large fraction of inputs).
The paper Consequences of Misaligned AI has a useful toy model of catastrophic optimisation. And another very nice paper just came out, which also uses this model: Against Proxy Optimization.
However, I think these papers commit one major sin: Making a simplifying assumption that almost never holds, but is load-bearing for the headline results without appropriately highlighting that this is what is happening. More specifically, they assume that there is some set of desirable features f1, …, fn in R, such that our utility function u : R^n --> R is strictly increasing in each fi. (This is okay.) But they also assume that there is a cost function c : R^n --> R which says how difficult it is to obtain each configuration of the world. And, crucially, they assume that this c is strictly increasing in each fi.
This sounds fine as well, until you realise that the “increasing in each fi” part means that any win-win outcomes, any Pareto improvements where everybody gets slightly better off, are prohibited by definition. For example, suppose that f1 is “how good my house is” and f2 is “how good your house is”. And suppose that one scenario is the one where we both spend a month plotting how to burn down the other’s house, and then burn down each other’s houses. Intuitively, if we decided to not do this stupid thing, this would save us some effort and resources (lower c) and make us both better off (higher f1 and f2). But the model prohibits the existence of this option.
Similarly, the model suggests that if there was a misaligned AI that cares about different features than we do, the only way it can profit over the current state is to harm our interests. And I agree that at some point, once all available resources are used perfectly optimally, this will become true. However, we are very far from that point, and until then, win-win outcomes are all over the place.
To sum up: I really like those papers. But in the setting they consider, the conclusion that “nearly every optimisation must end in disaster” is not an interesting discovery, but an immediate consequence of the (unrealistic) assumption about the cost c being strictly increasing in each feature. The key result is a foregone conclusion; a rendering of a pre-existing informal intuition in formal math. And that is fine. I would just like those papers even more if they were more explicit about this.
I agree with the big point made here, and I do agree that in some cases we can avoid costs going up montononically according to your utility function.
But in most cases of interest, I do think the cost function actually does just strictly increase, albeit very slowly with optimized technology as you get better outcomes, and the heuristic argument is Landauer’s principle (for irreversible computers) or this paper (for reversible computers) shows that you must have energy costs (or another conserved quantity, the important thing here is that the cost always increases) increases the more you want to do computation, and for most utility functions, you can translate their goals into increasing the amount of computation you can do, because computation in our world is increasingly valuable, and by the heuristic argument above we do have costs strictly increase to reach higher utility states.
As you say, this doesn’t matter because the cost grows so slowly that the costs I mentioned above, such that it’s easy for improvements to outweigh costs and generate pareto improvements.
[% of loss explained]
isn’t a good interpretability metric[edit: isn’t enough to get guarantees].In interpretability, people use [% of loss explained] as a measure of the quality of an explanation. However, unless you replace the system-being-explained by its explanation, this measure has a fatal flaw.
Suppose you have misaligned superintelligence X pretending to be a helpful assistant A—that is, acting as A in all situations except those where it could take over the world. Then the explanation “X is behaving as A” will explain 100% of loss, but actually using X will still kill you.
For [% of loss explained] to be
a useful metric[edit: robust for detecting misalignment], it would need to explain most of the loss on inputs that actually matter. And since we fundamentally can’t tell which ones those are, the metric will only be useful (for detecting misaligned superintelligences) if we can explain 100% of loss on all possible inputs.The main use of % loss recovered isn’t to directly tell us when a misaligned superintelligence will kill you. In interpretability we hope to use explanations to understand the internals of a model, so the circuit we find will have a “can I take over the world” node. In MAD we do not aim to understand the internals, but the whole point of MAD is to detect when the model has new behavior not explained by explanations and flag this as potentially dangerous.
A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven’t solved everything, you must have made a bunch of progress.
For algorithmic tasks where humans just know an algorithm which performs well, I think you need to use something like causal scrubbing which checks the correspondence.
Right, I agree. I didn’t realise the bolded statement was a poor/misleading summary of the non-bolded text below. I guess it would be more accurate to say something like “[% of loss explained] is a good metric for tracking intellectual progress in interpretability. However, it is somewhat misleading in that 100% loss explained does not mean you understand what is going on inside the system.”
I rephrased that now. Would be curious to hear whether you still have objections to the updated phrasing.
Agreed.
That said, if you train an AI on some IID training dataset and then explain 99.9% of loss validated as fully corresponding (via something like causal scrubbing), then you probably understand almost all the interesting stuff that SGD put into the model.
You still might die because you didn’t understand the key 0.1% or because some stuff was put into the model other than via SGD (e.g. gradient hacking or someone put in a backdoor).
Typical stories of deceptive alignment imply that to explain 99.9% of loss with a truely human understandable explanation, you’d probably have to explain the key AI machinery to a sufficient extent that you can understand if the AI is deceptively aligned (as the AI is probably doing reasoning about this on a reasonably large fraction of inputs).