When you say “which yields a solution of the form ”, are you saying that yields that, or are you saying that yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form .
I meant the former (which you’re right only has the solution with ). I only added the term to make it work for the inequality. As a result, it’s only a subset of the solutions for the inequality. The (quite complicated!) expression you provided also works.
Interesting! If I understand correctly, one assumption this would rest on is that deception is usually fairly continuous. I.e. to be really deceptive (and obfuscate your latent space) you have to first be a little deceptive (where small changes in your latent space are revealing). I wonder if more capable models might break that assumption? Accordingly, why not train the Surgeon in lockstep with the model? (or give it more runtime if it’s using a fixed algorithm) You’d probably finetune the training to make it proportional to how quickly the model itself is training.