Software engineering, parenting, cognition, meditation, other
Linkedin, Facebook, Admonymous (anonymous feedback)
Gunnar_Zarncke
where did you see it a few years ago?
I came up with the experiment and I do think it shows something significant about LLM “thinking” processes that is often not appreciated, but I no longer think it tells us much about consciousness of LLMs. Why would a specific mapping of memory and processing architectures (see my mapping in this comment https://www.lesswrong.com/posts/Jqre8WRvmJj5Ehmgv/there-is-no-one-there-a-simple-experiment-to-convince?commentId=f6mGRKzRXfk53K2L4 ) matter for consciousness? One reading of the experiment is that LLMs can hold multiple consistent answers to the task “in their mind” at the same time and only commit to them when needed/when the constraints force it. They may not be “aware” of doing that when asked to “think” of a number, but that is mostly because they have been trained on text where thinking is happening in human terms and not in LLM terms. What the experiment does prove is that LLMs do not have sufficient introspective access or just don’t understand how they operate when such task is posed. On the other hand, we humans also don’t understand what goes on in our neurons when we think of something. I think the experiment might be partly fixed or at least improved by using a less human-loaded terminology “think of” and instead ask to constrain a dataset or something.
We have to distinguish three types of memory here that LLMs and humans have to different degrees:
long-term memory: Humans can remember specific episodes by trying to remember something releted to something they are thinking about at a point in time. Then it comes up or not. This is loosely comparable to LLMs using a memory tool to fetch relevant memory items, documents from a project or previous conversations (or having them injected as part of a prompt from scaffolding logic). This is probably the least contentious point because it doesn’t matter for the argument. We are not talking about a number I remember as part of a conversation we had a while back. This would be much different from me looking up a number I wrote down on a piece of paper or the LLM looking it up from a file.
short-term memory: Humans can keep some amount of recently perceived content in the “back of their mind” without all of that being in their awareness at the same time (we know this because only a small part of that can be reported on exactly, but much of that seems to influence later thought). For LLMs this is the context window and they have much fuller access to it than humans and can access and exactly replay much of it. The post is not talking about short-term memory, because the number is prevented from posted to the conversation stream because the stream functions more like an exact scratchpad for the LLM. For a human that would be a bit like having access to a transcript of your speaking.
items in awareness: Humans can keep a certain number of elements in their awareness at the same time and report on them, for example the number discussed in the post. They can report on them and manipulate them to some degree. Some people can do it visually or verbally or otherwise to different degrees. This is the “think of a number” the post is talking about. Humans have it. What is the corresponding thing for LLMs? Presumably the closest analog is the activation pattern in latents space. The questions the post is asking is precisely: How closely does that activation space match human “thought”?
Congratulations! That makes a promising method to detect misalignment even cheaper. I think it is plausible that the simplification makes it more effective by reducing clutter that was never essential.
The next task seems to be now scaling it to larger models. Do you plan to work on that?
And people may also think of one number and then, as questions pile on, forget their original number or decide to switch to a simpler one or prank you or something.
But people would do that at significantly different frequencies and you can probably control for that with follow up questions.
But all of this doesn’t change that there arguably are stable states in human global workspace that can even be measured, even if not the content, then at least the stable duration. Maybe this is an artifact of human embeddedness where we have to maintain one physical person, something LLMs don’t.
I think the temperature zero or a fixed seed are not a blocker for this expiment if you sample multiple values and compare the distributions.
We need more posts like this that give people mental tools that help sharpening intuitions about AI entities. Jan Kulveit often writes about LLM psychology too, but what I like about Kaj’s post here is that it is not so theoretical and abstractly talking about LLM agents, but about the way we interact with the chatbots and respond emotionally, which is harder to notice and disentangle.
I guess it is sort of an answer. Maybe even more so than a polished one: Over long timescales, slow moving technical and infrastructure efforts are often failing because of policy resets. Maybe the lesson is to not try to work on policy driven technology. Or at least be aware of its pitfalls.
Wow. And funny coincidence, my grandmother recently also turned 100. But no nuclear policy.
My question for your father: What are the long term patterns in nuclear (and related) policy that he sees. Are there cycles, stabilizations, S-curves or something that is difficult for us to see because of our limited time horizon?
If we want to prevent AIs from colluding or out-cooperating us, we may want to prevent them from reading each other’s internals.
There are several reasons to expect AI systems to be unusually good at coordination across instances
I think this is primarily true for designed ensembles of agents. We see this in agent swarms for coding. These agents are designed to work on the same task, by their prompting and due to the models being generally trained to be helpful. Not much coordination needed if everything is set up to be cooperative. That’s very much different if agents find themselves in adversarial settings such as a user’s shopping agent interacting with a swarm of sales bots. The two sides sharing their internal state with each other doesn’t seem like the default outcome here.
With AIs, their creators have perfect read and write access to all of the computations which give rise to AI cognition.
I don’t dispute that LLM have much less privacy than humans. Yudkowsky is correct that LLMs have good reason for paranoia. But we can’t read LLMs perfectly—mechinterp is hard. And humans often have to fear hostile telepaths too. So more might transfer than we expect.
An individual human mind typically experiences a single stream of consciousness (with periodic interruptions for sleep). They remember their experience yesterday, and usually expect to continue in a similar state tomorrow. Circumstances change their mood and experience, but there is a lot in common throughout the thread that persists — and it is a single thread.
That is, of course, sort of true, but it appears to us more unified than it actually is. For illustration, see my poem Between Entries. Reflection is revisiting and compression.
When we interact with an AI, what specifically are we interacting with? And when an AI talks about itself, what is it talking about?
In May 2023, I asked ChatGPT 3.5:
Me: Define all the parts that belong to you, the ChatGPT LLM created by OpenAI.
See its answers here. They cover some of the listed contexts, but I agree that they depend on context. This is more provided as an illustration of what was a common “view” of ChatGPT then.
It is a common adage among AI researchers that creating an AI is less like designing it than growing it. AI systems built out of predictive models are shaped by the ambient expectations about them, and by their expectations about themselves. It therefore falls to us — both humans and increasingly also AIs — to be good gardeners. We must take care to provide the right nutrients, prune the stray branches, and pull out the weeds.
Very much agree! As I keep saying, AI may need a caregiver. We can probably learn a bit from parenting and caregiving in general here. Sure. That will not solve all of the problems, but probably help with this class of it.
I understand the push as drawing a clear border at a human is behind all aspects of the writing, i.e. the readers can trust that the author holds all of the mental structure behind the writing in mind and there is no risk of the author going “on rereading this it’s not what I meant.” cyborg writing is not strong enough for that and would have to go into a LLM block.
Actually, I would prefer if there were a standard for indicating different types of LLM writing.
LLM unedited
LLM transcribed
edited significantly by LLM
drafted by LLM, edited by human
cyborg/mixed
added: maybe we should also have a human written block. maybe with the name(s) of the writer(s).
I’m doing something like that too, but without the transcript part. I would interpret the rules pretty clearly as LLM output (mostly because of the last bullet point).
A clarification question: If I have a conversation with an LLM and have it summarize it and then significantly edit it, as per your rule, this has to go into an LLM Output block, right? How do I label that section? LLM+human?
My aunt once expressed bewilderment why some people who have made a lot of money fail to make good use of it during their old age. I’m not surprised. I explained that people rarely change their approach to life. It has satisfied or worked well for them thus far, and, by default, they will just continue. Decreasing ability to learn doesn’t make it better. What is the relation to your post? Humans are different from AI’s in some systematic ways that we can see in the long-term trajectories of AIs. Coherent trajectories of AIs are getting longer. You seem to expect them to extend and become as long as humans and on that scale to then become as non-strategic as humans’. I’m not sure. Humans on longer time-scales of say weeks to years, are pretty stable and strategic in the sense of reliably satisfying their needs and to some degree values. I would argue that this is the case because humans are very good at maintaining homeostasis. Homeostasis of many bodily and mental parameters the brain measures and has learned how to effectively keep stable. Not perfectly because keeping a lot of parameters stable is almost impossible—eating something will make you thirsty, drinking will make you need to pee, loneliness will make you want to see people, then you meet a lot of them and want less… And the sleep cycle will also not stop. This will lead to variation on thoughts we entertain that varies much more than AI’s that a) lack these facilities and b) are much more heavily trained to focus on specific results.
it depends how you map the architectures. see comment here: https://www.lesswrong.com/posts/Jqre8WRvmJj5Ehmgv/there-is-no-one-there-a-simple-experiment-to-convince?commentId=f6mGRKzRXfk53K2L4