Nothing wrong with trying things out, but given the papers efforts to rule out semantic connections, the face that it only works on the same base model, and that it seems to be possible for pretty arbitrary ideas and transmission vectors, I would be fairly surprised it it was something grounded like pixel values.
I also would be surprised if neuronpedia had anything helpful. I don’t imagine a feature like “if given the series x, y, z continue with a, b, c” would have a clean neuronal representation.
Nothing wrong with trying things out, but given the papers efforts to rule out semantic connections, the face that it only works on the same base model, and that it seems to be possible for pretty arbitrary ideas and transmission vectors, I would be fairly surprised it it was something grounded like pixel values.
I also would be surprised if neuronpedia had anything helpful. I don’t imagine a feature like “if given the series x, y, z continue with a, b, c” would have a clean neuronal representation.