My thinking is: if you can do IO at all, then you can do it with a constant factor blowup (depending only on the security parameter), so if the gap between human simulator and direct translator can be arbitrarily large then this only closes a constant amount of the gap. And requiring the breaker to use the full power of IO is relatively optimistic.
ETA: looking at this in a more zoomed out way, I’d say the current state of affairs is that it’s often very hard to say much about “what a model is doing” given its weights, and if you specifically train it to be incomprehensible then it feels like it could quickly get much worse. You might wonder what the theoretical situation is—in the limit does the situation favor the obfuscator or the interpreter?---and IO is mostly relevant because it shows that the obfuscator will win in the limit (for the particular quantitative question that matters for evaluating these proposals). This isn’t supposed to be a bizarre claim supported only by appeal to IO (although that would suffice for a counterexample in our methodology), it is intended to be consistent with the intuitive sense of despair that many people (e.g. me) get when they stare at some giant weight matrices, already find them fairly hard to make sense of, and then imagine them being optimized to be incomprehensible.
Since the point of the contest is to discover a method for which there is no conceivable way it could fail, I grant you that given the uncertainty about what would really happen that counterexample suffices. But I still believe that in practice the interpreter will win. I think you are overextending your intuition, neural networks are incomprehensible to you because you are human, but activations in a network are what they are because they have been optimized so that the rest of the network can process (“understand”) them. So if a value can be written and read by a network, another one could do the same, since they are both being optimized to do it. It is only through complex cryptographic magic that we can avoid that (and maybe not even then if it turns out that IO can only create programs that are exponentially large).
I also believe that gradient descent can learn to use the activations of an obfuscated reporter (and indeed I frequently rely on some currently-plausible intuitive assumption like “gradient descent can’t obfuscate something from itself”). But this isn’t enough for these proposals to work, they need the relationship between the reporter and its obfuscated version to be quantitatively “simpler” than the relationship between the direct translator and human simulator.
My thinking is: if you can do IO at all, then you can do it with a constant factor blowup (depending only on the security parameter), so if the gap between human simulator and direct translator can be arbitrarily large then this only closes a constant amount of the gap. And requiring the breaker to use the full power of IO is relatively optimistic.
ETA: looking at this in a more zoomed out way, I’d say the current state of affairs is that it’s often very hard to say much about “what a model is doing” given its weights, and if you specifically train it to be incomprehensible then it feels like it could quickly get much worse. You might wonder what the theoretical situation is—in the limit does the situation favor the obfuscator or the interpreter?---and IO is mostly relevant because it shows that the obfuscator will win in the limit (for the particular quantitative question that matters for evaluating these proposals). This isn’t supposed to be a bizarre claim supported only by appeal to IO (although that would suffice for a counterexample in our methodology), it is intended to be consistent with the intuitive sense of despair that many people (e.g. me) get when they stare at some giant weight matrices, already find them fairly hard to make sense of, and then imagine them being optimized to be incomprehensible.
Since the point of the contest is to discover a method for which there is no conceivable way it could fail, I grant you that given the uncertainty about what would really happen that counterexample suffices. But I still believe that in practice the interpreter will win. I think you are overextending your intuition, neural networks are incomprehensible to you because you are human, but activations in a network are what they are because they have been optimized so that the rest of the network can process (“understand”) them. So if a value can be written and read by a network, another one could do the same, since they are both being optimized to do it. It is only through complex cryptographic magic that we can avoid that (and maybe not even then if it turns out that IO can only create programs that are exponentially large).
I also believe that gradient descent can learn to use the activations of an obfuscated reporter (and indeed I frequently rely on some currently-plausible intuitive assumption like “gradient descent can’t obfuscate something from itself”). But this isn’t enough for these proposals to work, they need the relationship between the reporter and its obfuscated version to be quantitatively “simpler” than the relationship between the direct translator and human simulator.