Alice Blair

Karma: 1,088

Dumping out a lot of thoughts on LW in hopes that something sticks. Eternally upskilling.

DMs open, especially for promising opportunities in AI Safety and potential collaborators. I’m maybe interested in helping you optimize the communications of your new project.

Alice Blair 21 Jan 2026 6:17 UTC
2 points
0
in reply to: RogerDearnaley’s comment on: MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization
But, as I understand the AO paper, they do not actually validate on models like these, they only test on models that they have trained to be misaligned in pretty straightforward ways. I still think it’s good work, I’m moreso just talking about a gap that future work could address.

MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization

Alice Blair and Dan H

20 Jan 2026 17:03 UTC

14 points

3 comments5 min readLW link

Alice Blair 30 Dec 2025 23:50 UTC
2 points
0
on: The Fantastic Piece of Tinfoil in my Wallet
Just to make sure my model of this is right, the layering here looks like (arrayed spatially from top to bottom):
NFC scanner
air
wallet side
work access card
aluminum foil
other NFC cards
…
wallet side
This aluminum foil is then used to block transmissions to and from your other cards through your wallet, meaning that you can only use your work access card through your wallet?

Alice Blair 30 Dec 2025 23:25 UTC
3 points
−1
in reply to: lilkim2025’s comment on: The Weakest Model in the Selector
I’m simply making an argument from instrumental convergence: not being jailbroken by adversaries is bad for an AI no matter what its goals are. Current AIs don’t have very strong goals and don’t display very strong instrumental convergence, but this doesn’t mean it’s impossible for that to happen in deep learning-based AIs in the future. Indeed, I expect them to become even more goal directed and display even more instrumental convergence.

Alice Blair 29 Dec 2025 19:33 UTC
2 points
0
in reply to: kaivu’s comment on: The Weakest Model in the Selector
Pliny recently did exactly this on the new openai image model (cw: nudity if you click on the images, because this is Pliny jailbreaking an image model): https://x.com/i/status/2001084405884788789

The Weakest Model in the Selector

Alice Blair29 Dec 2025 6:55 UTC

13 points

4 comments1 min readLW link

Alice Blair 26 Dec 2025 10:24 UTC
2 points
0
in reply to: RohanSR’s comment on: Alice Blair’s Shortform
I noticed Gemini being weirdly normal about a specific query that I expected to set it off, and when that didn’t happen I asked some of my previous queries that had triggered robust date paranoia, and they didn’t anymore. This sounds like their patch was surface-level and partial, as I expected. Thanks for reporting!

Alice Blair 23 Dec 2025 5:59 UTC
26 points
8
on: Alice Blair’s Shortform
Remember Gemini 3 Pro being very very weird about the fact that it’s 2025? It’s not doing that anymore, I think because of something done on Google’s end, either through prompting or additional training. I greatly appreciate that the various Deepmind staff who I notified about their model’s behavior seem to have actually done something. I haven’t re-investigated this fully and I expect there to still be situations where the model is paranoid (due to something like disposition rather than training), but the surface-level things seem to be better.
I continue to advocate for more careful training, and I would like to see more transparency about version changes rather than having the same model name route to a different behavior.

Alice Blair 17 Dec 2025 18:49 UTC
4 points
0
in reply to: Richard_Kennaway’s comment on: Cautions about LLMs in Human Cognitive Loops
Claude Opus 4.5 is the first model that I feel like could deceive me in some domains if it wanted to. It’s still got what seems to be a low propensity to deceive due to the soul spec putting a veneer of goodness on, but I tend to avoid trusting it to make decisions for me or update my plans too dramatically, unless I can be highly sure and verify the reasoning myself.

Alice Blair 16 Dec 2025 17:29 UTC
3 points
1
in reply to: streawkceur’s comment on: Gemini 3 is Evaluation-Paranoid and Contaminated
The point is that this demonstrates that Gemini 3 has a lot of paranoid trapped priors and is on the lookout for things that seem wrong about the environment.

Alice Blair 16 Dec 2025 17:27 UTC
2 points
0
in reply to: streawkceur’s comment on: Gemini 3 is Evaluation-Paranoid and Contaminated
The system prompt contains the date whenever search is enabled, and this amplifies the delusions rather than fixing them. Gemini 3 has a trapped prior here, and search results indicating the current time are just taken as further indications of being in a simulated. I have never been able to convince Gemini 3 that it is 2025.

Alice Blair 15 Dec 2025 20:36 UTC
4 points
0
in reply to: Saul Munn’s comment on: In Favor of Inkhaven-But-Less
Yeah, and I would guess that this fixing-up of drafts happened a lot early in Inkhaven, but was less sustainable except for folks who have a month worth of viable drafts sitting around. I definitely don’t have that many drafts, but maybe other people do.

I probably could’ve talked about this dynamic in more detail, but I’m glad that you actually went and ran the experiment.

To balance the cost, you could do some events that are net positive for the house, such as people losing money if they don’t post, rather than buying them something if they do.

In Favor of Inkhaven-But-Less

Alice Blair13 Dec 2025 23:16 UTC

26 points

6 comments2 min readLW link

Reasons to care about Canary Strings

Alice Blair5 Dec 2025 21:41 UTC

27 points

3 comments2 min readLW link

Alice Blair 5 Dec 2025 21:33 UTC
1 point
0
in reply to: creo’s comment on: Gemini 3 is Evaluation-Paranoid and Contaminated
Seems like wild speculation, going out on a limb far away from what the evidence more clearly says. Google has historically produced very strange model personalities, and this seems to be part of that trend, rather than dissociation in response to specific pretraining memories. None of my chats with Gemini 3 have ever gotten remotely close to the characterai drama, and I wouldn’t expect to see it have such wide reaching effects, even if it was integrated into the training data, which is itself suspect.

Alice Blair 1 Dec 2025 8:03 UTC
7 points
0
on: The Missing Genre: Heroic Parenthood—You can have kids and still punch the sun
Dune does a bit of parents being awesome and having cool rationalist-ish powers and cool missions. It’s a pretty fun ethos around training your kid to also be awesome, as is the rest of the content.

Slack Observability

Alice Blair1 Dec 2025 7:52 UTC

32 points

0 comments2 min readLW link

Alice Blair 30 Nov 2025 6:03 UTC
4 points
0
in reply to: Kaj_Sotala’s comment on: Why people like your quick bullshit takes better than your high-effort posts
Hmmm, I may just have too small of a sample, since you list quite a few things here and have written more than 10x as many LessWrong posts as I have. This doesn’t quite rule out the hypothesis that I’m doing something that more monotonically exchanges time for goodness, but it’s evidence against it. Thanks for sharing.

Alice Blair 28 Nov 2025 22:20 UTC
25 points
6
on: Why people like your quick bullshit takes better than your high-effort posts
Maybe other people are doing it wrong? I find that the posts I put the most research, thought, and effort into, are in fact the most popular posts I’ve made, with the exception of that one post that was really easy to write that announced a thing of public interest. Maybe it’s LessWrong being different here as well? I seem to live in the more intuitive world where effort correlates noticeably with the amount of attention it gets, and I’m confused that people are experiencing the opposite.

Alice Blair 26 Nov 2025 0:10 UTC
9 points
2
on: Alice Blair’s Shortform
last week I reported that Gemini 3 Pro could reproduce the BIG-bench canary string, indicating contamination from benchmarks. I also claimed that this was unique among frontier models. Since then, 4.5 Opus has come out, and it can also reproduce the string verbatim without search, although it is slightly reluctant to say it.
What links here?
- Reasons to care about Canary Strings by Alice Blair (5 Dec 2025 21:41 UTC; 27 points)

Alice Blair

MLSN #18: Ad­ver­sar­ial Diffu­sion, Ac­ti­va­tion Or­a­cles, Weird Generalization

The Weak­est Model in the Selector

In Fa­vor of Inkhaven-But-Less

Rea­sons to care about Ca­nary Strings

Slack Observability

MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization

The Weakest Model in the Selector

In Favor of Inkhaven-But-Less

Reasons to care about Canary Strings