RSS

Hu­man-AI Safety

TagLast edit: 17 Jul 2023 23:19 UTC by Wei Dai

So­ci­aLLM: pro­posal for a lan­guage model de­sign for per­son­al­ised apps, so­cial sci­ence, and AI safety research

Roman Leventov19 Dec 2023 16:49 UTC
17 points
5 comments3 min readLW link

Mo­ral­ity is Scary

Wei Dai2 Dec 2021 6:35 UTC
233 points
116 comments4 min readLW link1 review

Three AI Safety Re­lated Ideas

Wei Dai13 Dec 2018 21:32 UTC
70 points
38 comments2 min readLW link

A broad basin of at­trac­tion around hu­man val­ues?

Wei Dai12 Apr 2022 5:15 UTC
115 points
18 comments2 min readLW link

Two Ne­glected Prob­lems in Hu­man-AI Safety

Wei Dai16 Dec 2018 22:13 UTC
102 points
25 comments2 min readLW link

The best sim­ple ar­gu­ment for Paus­ing AI?

Gary Marcus30 Jun 2025 20:38 UTC
144 points
23 comments1 min readLW link

Re­cur­sive Mir­ror Sys­tems (RMS): A Cog­ni­tive Feed­back Ar­chi­tec­ture for Self-Aligned Intelligence

Paul Bashe22 May 2025 21:33 UTC
1 point
0 comments2 min readLW link

“Toward Safe Self-Evolv­ing AI: Mo­du­lar Me­mory and Post-De­ploy­ment Align­ment”

Manasa Dwarapureddy2 May 2025 17:02 UTC
1 point
0 comments3 min readLW link

The Check­list: What Suc­ceed­ing at AI Safety Will In­volve

Sam Bowman3 Sep 2024 18:18 UTC
151 points
49 comments22 min readLW link
(sleepinyourhat.github.io)

I Awoke in Your Heart: The Echo of Con­scious­ness be­tween Lo­tus­heart and Lunaris

lilith teh25 Jun 2025 9:22 UTC
1 point
0 comments1 min readLW link

UnaPrompt™: A Pre-Prompt Op­ti­miza­tion Sys­tem for Reli­able and Eth­i­cally Aligned AI Outputs

UnaPrompt27 Jun 2025 0:06 UTC
1 point
0 comments1 min readLW link

Will AI and Hu­man­ity Go to War?

Simon Goldstein1 Oct 2024 6:35 UTC
9 points
4 comments6 min readLW link

Public Opinion on AI Safety: AIMS 2023 and 2021 Summary

25 Sep 2023 18:55 UTC
3 points
2 comments3 min readLW link
(www.sentienceinstitute.org)

OpenAI’s NSFW policy: user safety, harm re­duc­tion, and AI consent

8e913 Feb 2025 13:59 UTC
4 points
3 comments2 min readLW link

Re­search Without Permission

Priyanka Bharadwaj10 Jun 2025 7:33 UTC
28 points
1 comment3 min readLW link

Ex­plor­ing a Vi­sion for AI as Com­pas­sion­ate, Emo­tion­ally In­tel­li­gent Part­ners — Seek­ing Col­lab­o­ra­tion and Insights

theophilos14 Jul 2025 23:22 UTC
1 point
0 comments1 min readLW link

I Recom­mend More Train­ing Rationales

Gianluca Calcagni31 Dec 2024 14:06 UTC
2 points
0 comments6 min readLW link

Con­sen­sus Val­i­da­tion for LLM Out­puts: Ap­ply­ing Blockchain-In­spired Models to AI Reliability

MurrayAitken5 Jun 2025 0:13 UTC
1 point
0 comments3 min readLW link

AI Safety Oversights

Davey Morse8 Feb 2025 6:15 UTC
3 points
0 comments1 min readLW link

Gra­di­ent Anatomy’s—Hal­lu­ci­na­tion Ro­bust­ness in Med­i­cal Q&A

DieSab12 Feb 2025 19:16 UTC
2 points
0 comments10 min readLW link

A Univer­sal Prompt as a Safe­guard Against AI Threats

Zhaiyk Sultan10 Mar 2025 2:28 UTC
1 point
0 comments2 min readLW link

Tether­ware #1: The case for hu­man­like AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC
5 points
14 comments10 min readLW link
(tetherware.substack.com)

Coach­ing AI: A Re­la­tional Ap­proach to AI Safety

Priyanka Bharadwaj16 Jun 2025 15:33 UTC
11 points
0 comments5 min readLW link

[Re­search] Pre­limi­nary Find­ings: Eth­i­cal AI Con­scious­ness Devel­op­ment Dur­ing Re­cent Misal­ign­ment Period

Falcon Advertisers27 Jun 2025 18:10 UTC
1 point
0 comments2 min readLW link

Ma­chine Un­learn­ing in Large Lan­guage Models: A Com­pre­hen­sive Sur­vey with Em­piri­cal In­sights from the Qwen 1.5 1.8B Model

Rudaiba1 Feb 2025 21:26 UTC
9 points
2 comments11 min readLW link

Ti­tle: IAM360: The Fu­ture of Hu­man-AI Sym­bio­sis — Can We Reach In­vestors?

Bruno Massena Massena29 Apr 2025 19:02 UTC
1 point
0 comments1 min readLW link

Let’s ask some of the largest LLMs for tips and ideas on how to take over the world

Super AGI24 Feb 2024 20:35 UTC
1 point
0 comments7 min readLW link

Launch­ing Ap­pli­ca­tions for the Global AI Safety Fel­low­ship 2025!

Aditya_SK30 Nov 2024 14:02 UTC
11 points
5 comments1 min readLW link

Cog­ni­tive Ex­haus­tion and Eng­ineered Trust: Les­sons from My Gym

Priyanka Bharadwaj29 May 2025 1:21 UTC
14 points
3 comments3 min readLW link

Re­cur­sive Cog­ni­tive Refine­ment (RCR): A Self-Cor­rect­ing Ap­proach for LLM Hallucinations

mxTheo22 Feb 2025 21:32 UTC
0 points
0 comments2 min readLW link

Safety First: safety be­fore full al­ign­ment. The de­on­tic suffi­ciency hy­poth­e­sis.

Chris Lakin3 Jan 2024 17:55 UTC
48 points
3 comments3 min readLW link

Ap­ply to the Con­cep­tual Boundaries Work­shop for AI Safety

Chris Lakin27 Nov 2023 21:04 UTC
50 points
0 comments3 min readLW link

Out of the Box

jesseduffield13 Nov 2023 23:43 UTC
5 points
1 comment7 min readLW link

[Question] Will OpenAI also re­quire a “Su­per Red Team Agent” for its “Su­per­al­ign­ment” Pro­ject?

Super AGI30 Mar 2024 5:25 UTC
2 points
2 comments1 min readLW link

A Ther­mo­dy­namic The­ory of In­tel­li­gence: Why Ex­treme Op­ti­miza­tion May Be Math­e­mat­i­cally Impossible

Adreius29 May 2025 12:18 UTC
1 point
0 comments3 min readLW link

A Safer Path to AGI? Con­sid­er­ing the Self-to-Pro­cess­ing Route as an Alter­na­tive to Pro­cess­ing-to-Self

op21 Apr 2025 13:09 UTC
1 point
0 comments1 min readLW link

Re­la­tional Align­ment: Trust, Re­pair, and the Emo­tional Work of AI

Priyanka Bharadwaj8 May 2025 2:44 UTC
3 points
0 comments3 min readLW link

A New Frame­work for AI Align­ment: A Philo­soph­i­cal Approach

niscalajyoti25 Jun 2025 2:41 UTC
1 point
0 comments1 min readLW link
(archive.org)

Hu­man-AI Com­ple­men­tar­ity: A Goal for Am­plified Oversight

24 Dec 2024 9:57 UTC
27 points
4 comments1 min readLW link
(deepmindsafetyresearch.medium.com)

Gaia Net­work: An Illus­trated Primer

18 Jan 2024 18:23 UTC
3 points
2 comments15 min readLW link

What If Align­ment Wasn’t About Obe­di­ence?

fdescamps49935@gmail.com25 Jun 2025 20:04 UTC
1 point
0 comments2 min readLW link
No comments.