RSS

David Scott Krueger (formerly: capybaralet)

Karma: 2,455

https://​​twitter.com/​​DavidSKrueger
https://​​www.davidscottkrueger.com/​​
https://​​therealartificialintelligence.substack.com/​​p/​​the-real-ai-deploys-itself

An­ti­so­cial me­dia: AI’s kil­ler app?

David Scott Krueger (formerly: capybaralet)3 Oct 2025 0:00 UTC
35 points
8 comments5 min readLW link
(therealartificialintelligence.substack.com)

The real AI de­ploys itself

David Scott Krueger (formerly: capybaralet)25 Sep 2025 14:11 UTC
76 points
8 comments3 min readLW link
(therealartificialintelligence.substack.com)

An­nounc­ing “The Real AI”: a blog

David Scott Krueger (formerly: capybaralet)20 Sep 2025 1:27 UTC
32 points
1 comment2 min readLW link
(therealartificialintelligence.substack.com)

De­tect­ing High-Stakes In­ter­ac­tions with Ac­ti­va­tion Probes

21 Jul 2025 18:21 UTC
50 points
0 comments4 min readLW link

Up­com­ing work­shop on Post-AGI Civ­i­liza­tional Equilibria

21 Jun 2025 15:57 UTC
25 points
0 comments1 min readLW link

A re­view of “Why Did En­vi­ron­men­tal­ism Be­come Par­ti­san?”

David Scott Krueger (formerly: capybaralet)25 Apr 2025 5:12 UTC
24 points
0 comments4 min readLW link

Grad­ual Disem­pow­er­ment: Sys­temic Ex­is­ten­tial Risks from In­cre­men­tal AI Development

30 Jan 2025 17:03 UTC
167 points
65 comments2 min readLW link
(gradual-disempowerment.ai)

A Sober Look at Steer­ing Vec­tors for LLMs

23 Nov 2024 17:30 UTC
40 points
0 comments5 min readLW link

[Question] Is there any rigor­ous work on us­ing an­thropic un­cer­tainty to pre­vent situ­a­tional aware­ness /​ de­cep­tion?

David Scott Krueger (formerly: capybaralet)4 Sep 2024 12:40 UTC
19 points
7 comments1 min readLW link

An ML pa­per on data steal­ing pro­vides a con­struc­tion for “gra­di­ent hack­ing”

David Scott Krueger (formerly: capybaralet)30 Jul 2024 21:44 UTC
21 points
1 comment1 min readLW link
(arxiv.org)

[Link Post] “Foun­da­tional Challenges in As­sur­ing Align­ment and Safety of Large Lan­guage Models”

David Scott Krueger (formerly: capybaralet)6 Jun 2024 18:55 UTC
70 points
2 comments6 min readLW link
(llm-safety-challenges.github.io)

Test­ing for con­se­quence-blind­ness in LLMs us­ing the HI-ADS unit test.

David Scott Krueger (formerly: capybaralet)24 Nov 2023 23:35 UTC
25 points
2 comments2 min readLW link

“Pub­lish or Per­ish” (a quick note on why you should try to make your work leg­ible to ex­ist­ing aca­demic com­mu­ni­ties)

David Scott Krueger (formerly: capybaralet)18 Mar 2023 19:01 UTC
112 points
49 comments1 min readLW link1 review

[Question] What or­ga­ni­za­tions other than Con­jec­ture have (esp. pub­lic) info-haz­ard poli­cies?

David Scott Krueger (formerly: capybaralet)16 Mar 2023 14:49 UTC
20 points
1 comment1 min readLW link

A (EtA: quick) note on ter­minol­ogy: AI Align­ment != AI x-safety

David Scott Krueger (formerly: capybaralet)8 Feb 2023 22:33 UTC
46 points
20 comments1 min readLW link

Why I hate the “ac­ci­dent vs. mi­suse” AI x-risk di­chotomy (quick thoughts on “struc­tural risk”)

David Scott Krueger (formerly: capybaralet)30 Jan 2023 18:50 UTC
34 points
41 comments2 min readLW link

Quick thoughts on “scal­able over­sight” /​ “su­per-hu­man feed­back” research

David Scott Krueger (formerly: capybaralet)25 Jan 2023 12:55 UTC
27 points
9 comments2 min readLW link

Mechanis­tic In­ter­pretabil­ity as Re­v­erse Eng­ineer­ing (fol­low-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)3 Nov 2022 23:19 UTC
28 points
3 comments1 min readLW link

“Cars and Elephants”: a hand­wavy ar­gu­ment/​anal­ogy against mechanis­tic interpretability

David Scott Krueger (formerly: capybaralet)31 Oct 2022 21:26 UTC
51 points
25 comments2 min readLW link

[Question] I’m plan­ning to start cre­at­ing more write-ups sum­ma­riz­ing my thoughts on var­i­ous is­sues, mostly re­lated to AI ex­is­ten­tial safety. What do you want to hear my nu­anced takes on?

David Scott Krueger (formerly: capybaralet)24 Sep 2022 12:38 UTC
9 points
10 comments1 min readLW link