Independent researcher theorizing about superintelligence-robust training stories.
If you disagree with me for reasons you expect I’m not aware of, please tell me!
If you have/find an idea that’s genuinely novel/out-of-human-distribution while remaining analytical, you’re welcome to send it to me to ‘introduce chaos into my system’.
Contact: {discord: quilalove
, matrix: @quilauwu:matrix.org
, email: quila1<at>protonmail.com
}
“some look outwards, at the dying stars and the space between the galaxies, and they dream of godlike machines sailing the dark oceans of nothingness, blinding others with their flames.”
-----BEGIN PGP PUBLIC KEY BLOCK-----
mDMEZiAcUhYJKwYBBAHaRw8BAQdADrjnsrbZiLKjArOg/K2Ev2uCE8pDiROWyTTO
mQv00sa0BXF1aWxhiJMEExYKADsWIQTuEKr6zx3RBsD/QW3DBzXQe0TUaQUCZiAc
UgIbAwULCQgHAgIiAgYVCgkICwIEFgIDAQIeBwIXgAAKCRDDBzXQe0TUabWCAP0Z
/ULuLWf2QaljxEL67w1b6R/uhP4bdGmEffiaaBjPLQD/cH7ufTuwOHKjlZTIxa+0
kVIMJVjMunONp088sbJBaQi4OARmIBxSEgorBgEEAZdVAQUBAQdAq5exGihogy7T
WVzVeKyamC0AK0CAZtH4NYfIocfpu3ADAQgHiHgEGBYKACAWIQTuEKr6zx3RBsD/
QW3DBzXQe0TUaQUCZiAcUgIbDAAKCRDDBzXQe0TUaUmTAQCnDsk9lK9te+EXepva
6oSddOtQ/9r9mASeQd7f93EqqwD/bZKu9ioleyL4c5leSQmwfDGlfVokD8MHmw+u
OSofxw0=
=rBQl
-----END PGP PUBLIC KEY BLOCK-----
This ability has been observed more prominently in base models. Cyborgs have termed it ‘truesight’:
Two cases of this are mentioned at the top of this linked post.
---
One of my first experiences with the GPT-4 base model also involved being truesighted by it. Below is a short summary of how that went.
I had spent some hours writing and {refining, optimizing word choices, etc}[1] a more personal/expressive text. I then chose to format it as a blog post and requested multiple completions via the API, to see how the model would continue it. (It may be important that I wasn’t in a state of mind of ‘writing for the model to continue’ and instead was ‘writing very genuinely’, since the latter probably has more embedded information)
One of those completions happened to be a (simulated) second post titled
ideas i endorse
. Its contents were very surprising to then-me because some of the included beliefs were all of the following: {ones I’d endorse}, {statistically rare}, and {not ones I thought were indicated by the text}.[2]I also tried conditioning the model to continue my text with..
other kinds of blog posts, about different things—the resulting character didn’t feel quite like me, but possibly like an alternate timeline version of me who I would want to be friends with.
text that was more directly ‘about the author’, ie an ‘about me’ post, which gave demographic-like info similar to but not quite matching my own (age, trans status).
Also, the most important thing the outputs failed to truesight was my current focus on AI and longtermism. (My text was not about those, but neither was it about the other beliefs mentioned.)
The sum of those choices probably contained a lot of information about my mind, just not information that humans are attuned to detecting. Base models learn to detect information about authors because this is useful to next token prediction.
Also note that using base models for this kind of experiment avoids the issue of the RLHF-persona being unwilling to speculate or decoupled from the true beliefs of the underlying simulator.
To be clear, it also included {some beliefs that I don’t have}, and {some that I hadn’t considered so far and probably wouldn’t have spent cognition on considering otherwise, but would agree with on reflection. (eg about some common topics with little long-term relevance)}