We assume the model will not be able to report the activation of a neuron in the final layer, even in the limit of training on this task, because it doesn’t have any computation left to turn the activation into a text output.
Surely there exist correct fixed points, though? (Although probably not that useful, even if feasible)
You mean a fixed point of the model changing its activations as well as what it reports? I was thinking we could rule out the model changing the activations themselves by keeping a fixed base model.
Surely there exist correct fixed points, though? (Although probably not that useful, even if feasible)
You mean a fixed point of the model changing its activations as well as what it reports? I was thinking we could rule out the model changing the activations themselves by keeping a fixed base model.