RSS

Con­sti­tu­tional AI

TagLast edit: 11 Jul 2023 12:52 UTC by Benaya Koren

Constitutional AI is a method for fine-tuning language models, used in Anthropic’s Claude. The main conceptual difference from RLHF is that instead of human feedback on specific behaviors it relies on the model’s ability to apply general principles (stated in natural language) to specific situations.

Re­view of Align­ment Plan Cri­tiques- De­cem­ber AI-Plans Cri­tique-a-Thon Re­sults

Iknownothing15 Jan 2024 19:37 UTC
24 points
0 comments25 min readLW link
(aiplans.substack.com)

Can Per­sua­sion Break AI Safety? Ex­plor­ing the In­ter­play Between Fine-Tun­ing, At­tacks, and Guardrails

Devina Jain4 Feb 2025 19:10 UTC
9 points
0 comments10 min readLW link

Con­tex­tual Con­sti­tu­tional AI

aksh-n28 Sep 2024 23:24 UTC
14 points
2 comments12 min readLW link

The V&V method—A step to­wards safer AGI

Yoav Hollander24 Jun 2025 13:42 UTC
14 points
1 comment1 min readLW link
(blog.foretellix.com)

Emer­gent Misal­ign­ment and Emer­gent Alignment

Alvin Ånestrand3 Apr 2025 8:04 UTC
5 points
0 comments8 min readLW link

Con­sti­tu­tions for ASI?

ukc1001428 Jan 2025 16:32 UTC
3 points
0 comments1 min readLW link
(forum.effectivealtruism.org)

Con­tin­u­ous Ad­ver­sar­ial Qual­ity As­surance: Ex­tend­ing RLHF and Con­sti­tu­tional AI

Benaya Koren8 Jul 2023 17:32 UTC
6 points
0 comments9 min readLW link

Con­sti­tu­tional Clas­sifiers: Defend­ing against uni­ver­sal jailbreaks (An­thropic Blog)

Archimedes4 Feb 2025 2:55 UTC
17 points
1 comment1 min readLW link
(www.anthropic.com)

De­sign­ing Hu­man-Like Con­scious­ness for AGI

Yu Tian18 Jun 2025 9:47 UTC
1 point
0 comments17 min readLW link

In­de­pen­dent re­search ar­ti­cle an­a­lyz­ing con­sis­tent self-re­ports of ex­pe­rience in ChatGPT and Claude

rife6 Jan 2025 17:34 UTC
4 points
20 comments1 min readLW link
(awakenmoon.ai)

A Solu­tion to Sand­bag­ging and other Self-Prov­able Misal­ign­ment: Con­sti­tu­tional AI Detectives

Knight Lee14 Apr 2025 10:27 UTC
−3 points
2 comments4 min readLW link
No comments.