RSS

Con­sti­tu­tional AI

TagLast edit: Jul 11, 2023, 12:52 PM by Benaya Koren

Constitutional AI is a method for fine-tuning language models, used in Anthropic’s Claude. The main conceptual difference from RLHF is that instead of human feedback on specific behaviors it relies on the model’s ability to apply general principles (stated in natural language) to specific situations.

Re­view of Align­ment Plan Cri­tiques- De­cem­ber AI-Plans Cri­tique-a-Thon Re­sults

IknownothingJan 15, 2024, 7:37 PM
24 points
0 comments25 min readLW link
(aiplans.substack.com)

Can Per­sua­sion Break AI Safety? Ex­plor­ing the In­ter­play Between Fine-Tun­ing, At­tacks, and Guardrails

Devina JainFeb 4, 2025, 7:10 PM
3 points
0 comments10 min readLW link

Con­tex­tual Con­sti­tu­tional AI

aksh-nSep 28, 2024, 11:24 PM
13 points
2 comments12 min readLW link

Emer­gent Misal­ign­ment and Emer­gent Alignment

Alvin ÅnestrandApr 3, 2025, 8:04 AM
5 points
0 comments8 min readLW link

Con­sti­tu­tions for ASI?

ukc10014Jan 28, 2025, 4:32 PM
3 points
0 comments1 min readLW link
(forum.effectivealtruism.org)

Con­tin­u­ous Ad­ver­sar­ial Qual­ity As­surance: Ex­tend­ing RLHF and Con­sti­tu­tional AI

Benaya KorenJul 8, 2023, 5:32 PM
6 points
0 comments9 min readLW link

Con­sti­tu­tional Clas­sifiers: Defend­ing against uni­ver­sal jailbreaks (An­thropic Blog)

ArchimedesFeb 4, 2025, 2:55 AM
16 points
1 comment1 min readLW link
(www.anthropic.com)

In­de­pen­dent re­search ar­ti­cle an­a­lyz­ing con­sis­tent self-re­ports of ex­pe­rience in ChatGPT and Claude

rifeJan 6, 2025, 5:34 PM
4 points
20 comments1 min readLW link
(awakenmoon.ai)

A Solu­tion to Sand­bag­ging and other Self-Prov­able Misal­ign­ment: Con­sti­tu­tional AI Detectives

Knight LeeApr 14, 2025, 10:27 AM
−3 points
2 comments4 min readLW link
No comments.