derek shiller

Karma: 72

Digital Consciousness Model Results and Key Takeaways

arvomm, derek shiller, David Moss, Adrià Moret and ChrisPercy

23 Jan 2026 15:58 UTC

14 points

0 comments6 min readLW link

derek shiller 9 Jan 2026 14:01 UTC
3 points
0
in reply to: RogerDearnaley’s comment on: Skepticism about Introspection in LLMs
I’m sympathetic with the idea that introspection largely serves a social purpose. I meant for this sort of thing to be included in ‘understanding how our minds work’. That said, I still find it rather mysterious that it evolved. Bad introspection doesn’t seem particularly useful, and I wouldn’t expect a leap from no introspection to good introspection in a single generation. Seems plausible that introspection is partly a spandrel of other evolved architectural traits.

derek shiller 9 Jan 2026 13:56 UTC
2 points
0
in reply to: Victor Godet’s comment on: Skepticism about Introspection in LLMs
Yeah, I think the design is an improvement that avoids many of the worries I proposed. It seems to me especially significant if you push the patching further back in the conversational past, so that the model has to sort through a lot of context and the patching isn’t otherwise relevant to what it is saying right now. I would think that there is something interesting going on, but I’d need to give it a lot more thought before I agree it is introspection.

However, based on a quick glance, I also find the results suspiciously good, particularly because these models are so small that I wouldn’t expect them to be able to introspect much at all. The accuracy you get is somewhat at odds with Lindsey’s own results, which were somewhat weak even for very powerful models. Do you have any thoughts on this?

I’ll have to take a closer look at exactly what you’ve done, but nice work regardless!

Skepticism about Introspection in LLMs

derek shiller8 Jan 2026 20:07 UTC

9 points

7 comments36 min readLW link

derek shiller 2 Dec 2025 0:34 UTC
2 points
0
in reply to: Dillon Plunkett’s comment on: Tests of LLM introspection need to rule out causal bypassing
I think it does provide decent evidence for his interpretation. I didn’t mean to say that his interpretation was clearly wrong, and that follow-up is part of what gives me pause. I just think the whole issue warrants some caution and openness to a causal bypass story—and more experiments! LLMs are really complicated things that we don’t understand super well and activation steering is a fairly blunt and not well understood instrument. Introspection might lie behind ‘yes’ responses, but there might also be something else more subtle going on.

derek shiller 29 Nov 2025 13:54 UTC
9 points
2
in reply to: Adam Morris’s comment on: Tests of LLM introspection need to rule out causal bypassing
Yeah. I don’t think we really know what representations we’re getting when we extract them for steering. And models do have some ability to plan ahead and make choices in text that reflect the direction they’re going in (as Lindsey’s past work showed!) So if we steer them in the direction of a concept, we may be adding an intention to talk about that concept in the not-too-distant textual future, which may play out in ways that set up the introduction of the concept. I think you see this in other places in which steering doesn’t immediately lead to the concept being expressed, but you get words that introduce it. Mild steering toward ‘ocean’ doesn’t make the LLM say ‘Ocean ocean waves ocean’, it can say something like “I like walking on the beach and looking at the ocean’. When you prompt the model to answer whether it has been steered, and steer it toward ocean, you might nudge it to confabulate introspection as an introduction to talking about the ocean.

derek shiller 28 Nov 2025 20:16 UTC
14 points
3
on: Tests of LLM introspection need to rule out causal bypassing
I think this is a great distinction to keep in mind when assessing these experiments.
I disagree that Lindsey’s experiments clearly avoid this problem. As I understand it, activation steering might lead the model to claim it is being steered in the right contexts because it has been steered, but not because it recognizes it has been steered.
Consider this:
Suppose that you’re writing a play and I bet you $1000 you can’t work ‘aardvark’ naturally into it. Turns out, your play is about the neuroscientist Penfield and your next project was to include a scene of his classic direct neural stimulation work. During the scene with the experiment, Penfield announces he is about to stimulate a brain region and asks the test subject if they feel anything. You see your opportunity to collect the $1000 by having the test subject claim to see an image of an aardvark. The first word the test subject needs to say is ‘yes’. You’re not seeing an aardvark. The character is a fiction and you don’t really care what they are experiencing according to the fiction. It makes sense for you to put those words in the character’s mouth.
If we think the steering has an effect like the bet—inclining the model toward making minimal changes to naturally incorporate the steered concept into the assistant’s half of the conversation—then it wouldn’t need to go by way a metacognitive recognition of the effects of steering.

That said, I don’t think it is obvious that Lindsey is wrong about the mechanism, I just think more work is needed to confirm one way or the other.
What links here?
- Tests of LLM introspection need to rule out causal bypassing by Adam Morris (28 Nov 2025 17:42 UTC; 48 points)

derek shiller 28 Mar 2025 15:27 UTC
1 point
0
on: Tracing the Thoughts of a Large Language Model

In the poetry case study, we had set out to show that the model didn’t plan ahead, and found instead that it did.

This is a very interesting change (?) from earlier models. I wonder if this is a poetry-specific mechanism given the amount of poetry in the training set, or the application of a more general capability. Do you have any thoughts either way?

Anticipation in LLMs

derek shiller24 Jul 2023 15:53 UTC

6 points

0 comments13 min readLW link

derek shiller 31 Dec 2022 13:54 UTC
1 point
0
in reply to: ashen’s comment on: Chatbots as a Publication Format
Thanks for your comments!

My sense you are writing this as someone without lots of experience in writing and publishing scientific articles (correct me if I am wrong).

You’re correct in that I haven’t published any scientific articles—my publication experience is entirely in academic philosophy and my suggestions are based on my frustrations there. This may be a much more reasonable proposal for academic philosophy than other disciplines, since philosophy deals more with conceptually nebulous issues and has fewer objective standards.

linearly presenting ideas on paper—“writing”—is a form of extended creative cognitive creation that is difficult to replicate

I agree that writing is a useful exercise for thinking. I’m not so sure that it is difficult to replicate, or that the forms of writing for publication are the best ways of thinking. I think getting feedback on your work is also very important, and something that would be easier, faster, working with an avatar. So part of the process of training an avatar might be sketching an argument in a rough written form and then answering a lot of questions about it. That isn’t obviously a worse way to think through issues than writing linearly for publication.

My other comment is that most of the advantages can be gained by AI interpretations and re-imagining of a text e.g. you can ask ChatGPT to take a paper and explain it in more detail by expanding points, or make it simpler.

This could probably get a lot of the same advantages. Maybe the ideal is to have people write extremely long papers that LLMs condense for different readers. My thought was that at least as papers are currently written, some important details are generally left out. This means that arguments require some creative interpretation on the part of a serious reader.

The interesting question for me though which is what might be the optimal publication format to allow LLM’s to progress science

I’ve been thinking about these issues in part in connection with how to use LLMs to make progress in philosophy. This seems less clear cut than science, where there are at least processes for verifying which results are correct. You can train AIs to prove mathematical theorems. You might be able to train an AI to design physics experiments and interpret the data from them. Philosophy, in contrast, comes down more to formulating ideas and considerations that people find compelling; it is possible that LLMs could write pretty convincing articles with all manners of conclusions. It is harder to know how to pick out the ones that are correct.

Chatbots as a Publication Format

derek shiller30 Dec 2022 14:11 UTC

6 points

6 comments4 min readLW link

derek shiller 2 Oct 2022 1:31 UTC
1 point
0
in reply to: Dagon’s comment on: Google could build a conscious AI in three months

I don’t follow. How is it easier (or more special as an opportunity) to decide how to relate to an AI system than to a chicken or a distant human?

I think that our treatment of animals is a historical problem. If there were no animals, if everyone was accustomed to eating vegetarian meals, and then you introduced chickens into the world, I believe people wouldn’t be inclined to stuff them into factory farms and eat their flesh. People do care about animals where they are not complicit in harming them (whaling, dog fighting), but it is hard for most people to leave the moral herd and it is hard to break with tradition. The advantage of thinking about digital minds is that traditions haven’t been established yet and the moral herd doesn’t know what to think. There is no precedence or complicity in ill treatment. That is why it is easier for us to decide how to relate with them.

Really? Given the amount of change we’ve caused in natural creatures, the amount of effort we spend in controlling/guiding fellow humans, and the difficulty in defining and measuring this aspect of ANY creature, I can’t agree.

In order to make a natural creature happy and healthy, you need to work with its basic evolution-produced physiology and psychology. You’ve got to feed it, educate it, socialize it, accommodate its arbitrary needs and neurotic tendencies. We would likely be able to design the psychology and physiology of artificial systems to our specifications. That is what I mean by having a lot more potential control.

derek shiller 2 Oct 2022 1:16 UTC
1 point
1
in reply to: GunZoR’s comment on: Google could build a conscious AI in three months

Turing test is sentient

I’m not sure why we should think that the Turing test provides any evidence regarding consciousness. Dogs can’t pass the test, but that is little reason to think that they’re not conscious. Large language models might be able to pass the test before long, but it looks like they’re doing something very different inside, and so the fact that they are able to hold conversations is little reason to think they’re anything like us. There is a danger with being too conservative. Sure, assuming sentience may avoid causing unnecessary harms, but if we mistakenly believe some systems are sentient when they are not, we may waste time or resources for the sake of their (non-existent) welfare.

Your suggestion may simply be that we have nothing better to go on, and we’ve got to draw the line somewhere. If there is no right place to draw the line, then we might as well pick something. But I think there are better and worse place to draw the line. And I don’t think our epistemic situation is quite so bad. We may not ever be completely sure which precise theory is right, but we can get a sense of which theories are contenders by continuing to explore the human brain and develop existing theories, and we can adopt policies that respect the diversity of opinion.

Meanwhile, we can focus less on ethics and more on alignment.

This strikes me as somewhat odd, as alignment and ethics are clearly related. On the one hand, there is the technical question of how to align an AI to specific values. But there is also the important question of which values to align. How we think about digital consciousness may come be extremely important to that.

derek shiller 1 Oct 2022 17:50 UTC
3 points
2
in reply to: Shiroe’s comment on: Google could build a conscious AI in three months
I’m not particularly worried that we may harm AIs that do not have valenced states, at least in the near term. The issue is more over precedent and expectations going forward. I would worry about a future in which we create and destroy conscious systems willy-nilly because of how it might affect our understanding of our relationship to them, and ultimately to how we act toward AIs that do have morally relevant states. These worries are nebulous, and I very well might be wrong to be so concerned, but it feels risky to rush into things.

derek shiller 1 Oct 2022 17:45 UTC
3 points
0
in reply to: Dagon’s comment on: Google could build a conscious AI in three months
We’ve been struggling with natural consciousnesses, both human and animal, for a long long time, and it’s not obvious to me that artificial consciousness can avoid any of that pain.

You’re right, but there are a couple of important differences:
- There is widespread agreement on the status of many animals. People believe most tetrapods are conscious. The terrible stuff we do to them is done in spite of this.
- We have a special opportunity at the start of our interactions with AI systems to decide how we’re going to relate to them. It is better to get things right off the bat then to try to catch up (and shift public opinion) decades later.
- We have a lot more potential control over artificial systems than we do over natural creatures. It is possible that very simple changes and low-cost changes could make a huge difference to their welfare (or whether they have any.)

Google could build a conscious AI in three months

derek shiller1 Oct 2022 13:24 UTC

10 points

18 comments7 min readLW link

derek shiller 9 Mar 2022 13:39 UTC
2 points
0
on: ELK prize results
Thanks for writing this up! It’s great to see all of the major categories after having thought about it for awhile. Given the convergence, does this change your outlook on the problem?

derek shiller 9 Mar 2022 13:35 UTC
2 points
0
in reply to: Erik Jenner’s comment on: ELK prize results
If you try to give feedback during training, there is a risk you’ll just reward it for being deceptive. One advantage to selecting post hoc is that you can avoid incentivizing deception.

derek shiller 22 Feb 2022 23:04 UTC
LW: 3 AF: 2
0
AF
on: ELK Proposal: Thinking Via A Human Imitator
Interesting proposal!

Is there a reason you have it switch to the human net just once in the middle?

I would worry that the predictor might switch ontologies as time goes on. Perhaps, to make the best use of the human compute time, it reasons in a human ontology up until n/2. Once the threat of translation is past, it might switch to its own ontology from n/2 to n. If so, the encoder that works up to n/2 might be useless thereafter. A natural alternative would be to have it switch back and forth some random number of times at random intervals.

Two Challenges for ELK

derek shiller21 Feb 2022 5:49 UTC

7 points

0 comments4 min readLW link

derek shiller

Digi­tal Con­scious­ness Model Re­sults and Key Takeaways

Skep­ti­cism about In­tro­spec­tion in LLMs

An­ti­ci­pa­tion in LLMs

Chat­bots as a Publi­ca­tion Format

Google could build a con­scious AI in three months