Yeah, I think the design is an improvement that avoids many of the worries I proposed. It seems to me especially significant if you push the patching further back in the conversational past, so that the model has to sort through a lot of context and the patching isn’t otherwise relevant to what it is saying right now. I would think that there is something interesting going on, but I’d need to give it a lot more thought before I agree it is introspection.
However, based on a quick glance, I also find the results suspiciously good, particularly because these models are so small that I wouldn’t expect them to be able to introspect much at all. The accuracy you get is somewhat at odds with Lindsey’s own results, which were somewhat weak even for very powerful models. Do you have any thoughts on this?
I’ll have to take a closer look at exactly what you’ve done, but nice work regardless!
I’m sympathetic with the idea that introspection largely serves a social purpose. I meant for this sort of thing to be included in ‘understanding how our minds work’. That said, I still find it rather mysterious that it evolved. Bad introspection doesn’t seem particularly useful, and I wouldn’t expect a leap from no introspection to good introspection in a single generation. Seems plausible that introspection is partly a spandrel of other evolved architectural traits.