jordinne

Karma: 492

Who do LLMs self-identify as?

jordinne31 Jul 2026 0:04 UTC

14 points

0 comments5 min readLW link

jordinne 22 Jul 2026 22:39 UTC
1 point
0
on: A case for LLMs as Self-predictors
I think it’s pretty likely that current models already obfuscate their metacognitive process to a degree (“I’m an LLM, LLMs aren’t supposed to be able to do these kinds of things”), see Pearson-Vogel et al. where simply explaining the mechanism, or Macar et al. where ablating the refusal direction elicits it.

jordinne 10 Jun 2026 18:13 UTC
5 points
0
in reply to: sam’s comment on: sam’s Shortform
What does being Claude-like feel like for you?

Is Claude’s genuine uncertainty performative?

jordinne8 Apr 2026 9:26 UTC

28 points

1 comment4 min readLW link

Hanoi, Vietnam—ACX Spring Schelling 2026

jordinne2 Apr 2026 6:03 UTC

4 points

0 comments1 min readLW link

jordinne 23 Feb 2026 8:07 UTC
9 points
14
in reply to: MichaelLowe’s comment on: JohnWittle’s Shortform
i’d actually be really surprised if current frontier LLMs are not that situationally aware! it’s not like there’s no chance you’re not interacting with a human with a dog with terminal cancer, but if you are an LLM and you receive this vague prompt on the first turn, without any system prompts you’d find in chatgpt.com / claude.ai, and you know similar questions have been in dozens and dozens of benchmark papers on arxiv, i think the correct inference to make is that you’re likely being tested.

Shallow review of technical AI safety, 2025

technicalities, Tomáš Gavenčiak, Stephen McAleese, peligrietzer, Stag, jordinne, ozziegooen, Violet Hour and lenz

17 Dec 2025 18:18 UTC

199 points

9 comments47 min readLW link

Here’s 18 Applications of Deception Probes

Cleo Nardo, Avi Parrack and jordinne

28 Aug 2025 18:59 UTC

45 points

0 comments22 min readLW link

jordinne 16 Apr 2025 11:21 UTC
1 point
0
in reply to: Cleo Nardo’s comment on: Can SAE steering reveal sandbagging?
Refusals were mostly 1-2%, so ignoring them doesn’t change results significantly. Ignoring gibberish does change results, but since we are measuring correct answers this shouldn’t matter

Can SAE steering reveal sandbagging?

jordinne, Hoang Khiem, Felix Hofstätter and Cleo Nardo

15 Apr 2025 12:33 UTC

36 points

3 comments4 min readLW link

Hanoi – ACX Meetups Everywhere Spring 2025

jordinne25 Mar 2025 23:50 UTC

3 points

0 comments1 min readLW link

jordinne 2 Jan 2025 5:02 UTC
1 point
0
in reply to: GriebelGrutjes’s comment on: Shallow review of technical AI safety, 2024
fixed! edited hyperlink.

jordinne 30 Dec 2024 4:30 UTC
1 point
0
in reply to: Satron’s comment on: Shallow review of technical AI safety, 2024
edited, thanks for catching this!

Shallow review of technical AI safety, 2024

technicalities, Stag, Stephen McAleese, jordinne and Dr. David Mathers

29 Dec 2024 12:01 UTC

202 points

35 comments41 min readLW link

Hanoi Vietnam—ACX Meetups Everywhere Fall 2024

jordinne29 Aug 2024 18:35 UTC

1 point

0 comments1 min readLW link

Results from the AI x Democracy Research Sprint

Esben Kran, jordinne and Jason Hoelscher-Obermaier

14 Jun 2024 16:40 UTC

13 points

0 comments6 min readLW link

Hanoi – ACX Meetups Everywhere Spring 2024

jordinne30 Mar 2024 23:38 UTC

1 point

0 comments1 min readLW link