I was skeptical about LLM introspection and didn’t expect this to work in such small models. But it did, and this has updated me quite a bit. I encourage you and others to run the experiment or variations of it and see for yourself, it is pretty lightweight.
One possible explanation for the difference with Lindsey’s results: my setup isolates the introspection (localization) ability, while Lindsey’s detection experiment requires both introspection and the willingness/capability to verbalize it. The fact that localization works so well suggests the bottleneck may be verbalization rather than introspection: the model appears to be unwilling to admit it can introspect. This was also suggested by some of vgel’s experiments on a 32B model.
I was skeptical about LLM introspection and didn’t expect this to work in such small models. But it did, and this has updated me quite a bit. I encourage you and others to run the experiment or variations of it and see for yourself, it is pretty lightweight.
One possible explanation for the difference with Lindsey’s results: my setup isolates the introspection (localization) ability, while Lindsey’s detection experiment requires both introspection and the willingness/capability to verbalize it. The fact that localization works so well suggests the bottleneck may be verbalization rather than introspection: the model appears to be unwilling to admit it can introspect. This was also suggested by some of vgel’s experiments on a 32B model.