A plausible way this agent could work internally is that there is a particular activation which tracks whether the agent has seen a factorization of RSA-2048: the activation is very large whenever the agent has seen such a factorization of RSA-2048 and very small otherwise.
Good point. What I’d really like is for the cap on the surgeon’s modifications to be based in some way on the weights of the agent. If the inputs and weights are typically order-unity and there are d layers and N neurons per layer then activations shouldn’t get much bigger than ∼Nd in the worst case (which corresponds to all weights of +1, all inputs of +1, so each layer just multiplies by N). So I’d like to see the surgeon’s modifications capped to be no more than this for sure.
In practice a tighter bound is given by looking at the eigenvalues of the weight layers, and the max ratio of activations to inputs is ∼∏iλmax,i, where the product runs over layers and λmax is the maximum-magnitude eigenvalue of a layer.
In other words, this tests the robustness of the agent under adversarial latent space perturbations of a fixed L1 norm. Other variants: swap the L1 norm for the L2 norm; make activations in the later layers use up more of the budget; probably other clever ideas.
Definitely! In particular, what I think makes sense is to make the surgeon try to minimize a loss which is shaped like (size of edits + loss of agent), where “size of edits” bakes in considerations like “am I editing an entire layer?” and “what is my largest edit?” and anything else that ends up mattering.
Thanks!
Good point. What I’d really like is for the cap on the surgeon’s modifications to be based in some way on the weights of the agent. If the inputs and weights are typically order-unity and there are d layers and N neurons per layer then activations shouldn’t get much bigger than ∼Nd in the worst case (which corresponds to all weights of +1, all inputs of +1, so each layer just multiplies by N). So I’d like to see the surgeon’s modifications capped to be no more than this for sure.
In practice a tighter bound is given by looking at the eigenvalues of the weight layers, and the max ratio of activations to inputs is ∼∏iλmax,i, where the product runs over layers and λmax is the maximum-magnitude eigenvalue of a layer.
Definitely! In particular, what I think makes sense is to make the surgeon try to minimize a loss which is shaped like (size of edits + loss of agent), where “size of edits” bakes in considerations like “am I editing an entire layer?” and “what is my largest edit?” and anything else that ends up mattering.