TurnTrout comments on Beth Barnes’s Shortform

TurnTrout 21 Sep 2021 16:49 UTC
LW: 2 AF: 2
0
AF
We assume the model will not be able to report the activation of a neuron in the final layer, even in the limit of training on this task, because it doesn’t have any computation left to turn the activation into a text output.
Surely there exist correct fixed points, though? (Although probably not that useful, even if feasible)
- Beth Barnes 22 Sep 2021 19:21 UTC
  LW: 2 AF: 2
  0
  AF Parent
  You mean a fixed point of the model changing its activations as well as what it reports? I was thinking we could rule out the model changing the activations themselves by keeping a fixed base model.