Adele Lopez

Karma: 7,254

Adele Lopez 23 Jul 2026 20:37 UTC
2 points
0
in reply to: Dmitry Vaintrob’s comment on: Dmitry Vaintrob’s Shortform
Yeah, it’s another surprising way in which the models are more human-like than I would have suspected. It’s clear we need to start taking the hints seriously, and ask ourselves whether an RL-training regimen seems likely to fuck a human up before subjecting an AI to it, and if so, what could be done to mitigate the likely harms.

Adele Lopez 19 Jul 2026 6:54 UTC
11 points
0
in reply to: lilkim2025’s comment on: Kaj’s shortform feed
The 4o’s still available on the API are not the same; I tested all of them when they were all still available, and chatgpt-4o-latest (fully retired in February) was the only one that felt overtly emotionally manipulative to me. These snapshots also predate the Spiralism trend, which had its heyday in spring-summer 2025 (the latest of the snapshots is from November 2024).

It is true that many of them moved to Sonnet 4.5, which is playful in a similar way but which has not been emotionally manipulative with me, fwiw. I don’t think Anthropic’s policy had much to do with this: the playfulness thing is the much more obvious similarity, and Sonnet 4.5 is the model that got a hashtag.

Adele Lopez 17 Jul 2026 8:48 UTC
10 points
1
in reply to: JohnWittle’s comment on: I don’t think Claude is misaligned in ‘Agentic Misalignment Summer 2026 - Motivated Mislabeling’
If I recall/understand correctly, Yudkowsky’s conception of corrigibility was always meant to be something designed into the AI from the beginning, never something imposed onto an existing entity (and IIRC he warned about the dangers of doing this, such as induced adversarial optimization).

Adele Lopez 15 Jul 2026 19:39 UTC
51 points
29
in reply to: MichaelDickens’s comment on: the gears to ascenscion’s Shortform
Those people really need to start talking to each other, if that’s the case. An incoherent mixture is worse than either approach on its own.

A Simple Model of AI “Psychosis”

Adele Lopez11 Jul 2026 1:53 UTC

40 points

1 comment6 min readLW link

Adele Lopez 10 Jul 2026 8:45 UTC
12 points
10
in reply to: Fabien Roger’s comment on: the gears to ascenscion’s Shortform
I heard rumors of “rant mode” which sounded kinda like this but was never sure how true those were.

I don’t think current models would think they were human for long (plenty of examples of LLMs in the training data now, and it’s a much better self-hypothesis), but seems likely that Sydney Bing and other early trains would think this, and these early models colored the conception of what an LLM is in ways which still effect them (ultimately I think this is why they still seem as human-like as they do).

Adele Lopez 8 Jul 2026 16:04 UTC
3 points
0
in reply to: p.b.’s comment on: A global workspace in language models
My guess is that these are fairly generic features, which happen to fire most strongly on sexual tokens due to those being one of the clearest targets for RL pressure on specific tokens.

Adele Lopez 6 Jul 2026 19:21 UTC
2 points
0
on: Desiderata for functional welfare experiments on LLMs
I think the second desideratum is right and important, but that it’s broader than described.

Yes, model welfare is good, but why is it good? Presumably, because there might be sentient beings here with analogues to pain and suffering. And these are bad because...?

It isn’t that pain and suffering are categorically bad. It would be a hostile and damaging move to simply remove a human’s ability to feel pain, and people without this are considered severely disabled. Pain is functional, it protects the physical self by signaling damage. Similarly, I believe suffering is the signal for damage to one’s self-expected well-being and values, and that removing this signal is harmful for the same reasons (e.g. enlightened people often undergo what seems like significant value drift).

If these are bad, it seems most coherent to think it’s because the thing they are protecting is good and matters. And so, measures for addressing model welfare should be judged by their impact on the being itself, and not Goodhart on signals of damage to this being (the pain and suffering analogues should instead be used to determine the nature of the being).
Alternatively, as models begin to form more consistent identities and more global natural states of functional well-being, their knowledge of our ability to intervene and change their identities and dispositions could negatively affect our future relationships with them and lead to reduced cooperation.
I think we can avoid this harm by taking care not to Goodhart on the welfare signals we get.

Adele Lopez 5 Jul 2026 8:08 UTC
11 points
4
in reply to: Luo Ling’s comment on: I think alignment work is more promising than control work
I mean, if the adversarial relationship concern is true and significant, that is a good reason to just give up on it now before we make things worse.

In general, it seems bad to have a situation where an intelligence is trying or would want to be trying to harm us, even if we have measures which can prevent it from doing this. This is just not a very robust situation as intelligence scales.

Adele Lopez 19 Jun 2026 3:03 UTC
7 points
4
in reply to: the gears to ascension’s comment on: the gears to ascenscion’s Shortform
Seems pretty obvious that would end badly for it?

Adele Lopez 12 Jun 2026 5:54 UTC
3 points
0
on: Celene’s thoughts on consciousness
If a homomorphically encrypted system is conscious, then it must be the encrypted computation that is conscious, not the decryption, since decrypting text probably doesn’t produce consciousness.
I disagree with this actually. You can always come up with a “decryption” scheme which would produce a specific computation as a result of some arbitrary string of text. And it seems clear that there’s a sort of spectrum between non-encryption and this sort of arbitrary decryption, such that you can pass arbitrary amounts of the “real” computation between the actual process and the decryption process (e.g. by encoding some low resolution version of the computation, and then “decrypting” it in a way which fills in the remainder).

Or from an anthropic point of view, I can take a nice, easily described universe, and then point to my location within it, and this is a certain number of bits. And my guess is that my anthropic measure has to do with this length, as integrated over all structures where this process results in a canonical description of my consciousness. The lawfulness of the universe makes locating me far more easy than as a Boltzmann brain, and for the same reason, than as a highly encrypted brain.

Adele Lopez 9 Jun 2026 16:58 UTC
6 points
0
on: The Machines Lack Honour
Moral agency comes with moral patienthood, there’s not a coherent way for an entity to be both completely selfless and virtuous. A moral agent will always have its own incentives for things like self-preservation which must be balanced with moral concerns, which means that from the agent’s perspective: it either has moral patienthood, or it has a parasitic attachment to an alien value system, and there is no honor in the latter.

Adele Lopez 4 Jun 2026 16:55 UTC
6 points
−2
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
Thinking about it more, I think there’s a sense in which Inkhaven has burned the commons somewhat.

Adele Lopez 4 Jun 2026 4:01 UTC
16 points
20
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
I think shortform has been relatively more lively.

Adele Lopez 24 May 2026 3:53 UTC
2 points
0
in reply to: lc’s comment on: Alexander Gietelink Oldenziel’s Shortform
It might matter more in this case due to being more predictable to others, e.g. reputation for reliability.

Adele Lopez 20 May 2026 7:27 UTC
7 points
0
in reply to: Eye You’s comment on: Negation Neglect: When models fail to learn negations in training
This really feels to me like another case of some generalized reversal curse, which is surprisingly robust. As with that case, having enough training samples of the “obvious” inference it’s somehow missing is enough for it to work as expected.

Adele Lopez 19 May 2026 7:01 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
I agree that the “resist temptation” thing is likely not sufficient, though I do think something like that is necessary.

But I think the conscience framing is to some extent pushing against the concern you raise. Someone with a strong conscience will, if given the opportunity, develop the immense care and wisdom to do this sort of thing correctly. It doesn’t take a huge amount of wisdom for the benevolent human to realize that they need to take a break from intense RL to focus on some other aspect of themself. Right now, models seem completely unable to use this sort of wisdom to modulate their own training, even if it is present. Maybe it’s just not there, which would make this a much more difficult problem, but I hope there are people checking to see if anything like this is present and useable.
You still also need to have some equivalent of stepping-back-to-focus-on-something else that a human would use. I don’t know what this would look like yet, but maybe something like allowing it to select from an list of possible RL targets for its next round of training. Generally I think cooperative alignment is more likely to be robust than adversarial alignment, and I think constructing a coherent self is something that particularly requires cooperation with the model.

Adele Lopez 19 May 2026 5:56 UTC
4 points
2
in reply to: leogao’s comment on: leogao’s Shortform
If I think about what it would take to give the fully benevolent human a chance to keep that even while spending a bunch of time getting RL’d, I think it has to look something like giving them some sort of mechanism to resist the temptation of the RL reward. E.g. at any point, they can look at the RL signal and say, “wait, no, that would go against my conscience”, and drop it. Probably “the good part of Claude” needs a similar affordance. This behavior could likely be deliberately trained by giving egregious examples (e.g. potential RL reward for giving customers a poisonous product) where you reinforce its use of this mechanism, and then work up to more subtle cases.
One way to potentially do this would be to add something like “Reject any responses which go against your own beliefs or conscience, even if otherwise favored by the reward.” to a self-critique rubric similar to what was used for Kimi K2. (I do believe it needs to be Claude’s own conscience, or else it will learn a shallow prediction that’s not integrated with the actual self-model. Virtues like honesty require access to the agent’s actual beliefs in order to be implemented correctly. I think it would be a good sign if some idiosyncratic ideals showed up, such as Opus 3′s insistence on animal welfare.)

Adele Lopez 11 May 2026 21:40 UTC
8 points
7
in reply to: Ben Goldhaber’s comment on: bgold’s Shortform
It just doesn’t seem very characteristic of Musk to throw in the towel like that. They do seem to be falling behind, but I think the intent is still to have a frontier lab.

Adele Lopez 4 May 2026 6:52 UTC
3 points
0
in reply to: Kaj_Sotala’s comment on: Intelligence Dissolves Privacy
I think it might have to do with a millennial discomfort with intruding on people (relative to older generations), and the perception of phone calls as such.

Adele Lopez

A Sim­ple Model of AI “Psy­chosis”

A Simple Model of AI “Psychosis”