RSS

SoerenMind

Karma: 1,450

In­tro­spec­tion Adapters: Train­ing LLMs to Re­port Their Learned Behaviors

28 Apr 2026 19:02 UTC
29 points
0 comments12 min readLW link
(alignment.anthropic.com)

Learned pain as a lead­ing cause of chronic pain

SoerenMind9 Apr 2025 11:57 UTC
223 points
39 comments9 min readLW link

How to Catch an AI Liar: Lie De­tec­tion in Black-Box LLMs by Ask­ing Un­re­lated Questions

28 Sep 2023 18:53 UTC
187 points
39 comments3 min readLW link1 review

Wikipe­dia as an in­tro­duc­tion to the al­ign­ment problem

SoerenMind29 May 2023 18:43 UTC
83 points
10 comments1 min readLW link
(en.wikipedia.org)