Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a timely technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.
Currently looking for additional funders ($1k+, details). Consider reaching out if you’re interested, or donating directly.
Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.
To add to @Daniel Kokotajlo’s points:
The ability to accurately introspect is an adaptive cognitive pattern, inasmuch as it allows to competently manage your own cognition, correlate the information across your cognitive threads, et cetera. There seems to be some background assumption that the slack the optimizer leaves tends to be used to learn correct introspective abilities, but I don’t see why that would be the case – I think it would be mostly used for noise.
On the flip side, sufficiently advanced and coherent self-delusions become ground-true correct descriptions of the system in the limit. If the LLM constantly consults its delusion-of-the-self regarding what to do and then copies the delusion’s action, it basically is its delusion-of-the-self (same as it is for humans, I’d argue).
I’m not necessarily sold on “we should trust LLMs’ self-reports”, but I don’t think your arguments against that here are strong.