RSS

Chain-of-Thought Alignment

TagLast edit: 1 Dec 2023 20:58 UTC by niplav

“Chain-of-thought” autonomous agentic wrappers such as AutoGPT around an LLM such as GPT-4, and similar Language Model Cognitive Architectures (LMCAs) (other commonly used terms are Language Model Autonomous Agents (LMAAs), or Scaffolded LLMs), are a recent candidate approach to building an AGI.

They create, edit, and maintain a natural language context by recursively feeding parts of this into the LLM along with suitable prompts for activities like subtask planning, self-criticism, and memory summarization, generating a textual stream-of-consciousness, memories etc. They thus combine LLM neural nets with natural language symbolic thinking more along the lines of GOFAI.

Recent open-source examples are quite simple and not particularly capable, but it seems rather plausible that they could progress rapidly. They could make interpretability much easier than pure neural net systems, since their ‘chain-of-though’/​‘stream of consciousness’ and ‘memories’ would be written in human natural language, so interpretable and editable by a monitoring human or LLM-based monitoring system (modulo concerns about opaque natural language or detecting possible hidden steganographic side-channels concealed in apparently-innocent natural language). This topic discusses the alignment problem for systems combining such agentic wrappers with LLMs, if they are in fact capable of approaching or reaching AGI.

See Also

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth Herd18 Apr 2023 16:29 UTC
88 points
18 comments20 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tamera3 Aug 2022 12:03 UTC
140 points
23 comments6 min readLW link

the case for CoT un­faith­ful­ness is overstated

nostalgebraist29 Sep 2024 22:07 UTC
269 points
44 comments11 min readLW link1 review

Lan­guage Agents Re­duce the Risk of Ex­is­ten­tial Catastrophe

28 May 2023 19:10 UTC
39 points
14 comments26 min readLW link

Scaf­folded LLMs: Less Ob­vi­ous Concerns

Stephen Fowler16 Jun 2023 10:39 UTC
34 points
15 comments14 min readLW link

Align­ment of Au­toGPT agents

Ozyrus12 Apr 2023 12:54 UTC
14 points
1 comment4 min readLW link

[Question] Should Au­toGPT up­date us to­wards re­search­ing IDA?

Michaël Trazzi12 Apr 2023 16:41 UTC
15 points
5 comments1 min readLW link

Ac­cess to agent CoT makes mon­i­tors vuln­er­a­ble to persuasion

25 Jul 2025 16:09 UTC
18 points
0 comments4 min readLW link

To CoT or not to CoT? Chain-of-thought helps mainly on math and sym­bolic reasoning

Bogdan Ionut Cirstea19 Sep 2024 16:13 UTC
21 points
1 comment1 min readLW link
(arxiv.org)

5 ways to im­prove CoT faithfulness

Caleb Biddulph5 Oct 2024 20:17 UTC
46 points
40 comments6 min readLW link

A Lit­tle Depth Goes a Long Way: the Ex­pres­sive Power of Log-Depth Transformers

Bogdan Ionut Cirstea20 Nov 2024 11:48 UTC
16 points
0 comments1 min readLW link
(openreview.net)

Seven sources of goals in LLM agents

Seth Herd8 Feb 2025 21:54 UTC
23 points
3 comments2 min readLW link

Un­faith­ful Rea­son­ing Can Fool Chain-of-Thought Monitoring

2 Jun 2025 19:08 UTC
78 points
17 comments3 min readLW link

Split Per­son­al­ity Train­ing: Re­veal­ing La­tent Knowl­edge Through Alter­nate Per­son­al­ities (Re­search Re­port)

Florian_Dietz12 Jan 2026 12:29 UTC
75 points
27 comments21 min readLW link

AI CoT Rea­son­ing Is Often Unfaithful

Zvi4 Apr 2025 14:50 UTC
66 points
4 comments7 min readLW link
(thezvi.wordpress.com)

Do Large Lan­guage Models Perform La­tent Multi-Hop Rea­son­ing with­out Ex­ploit­ing Short­cuts?

Bogdan Ionut Cirstea26 Nov 2024 9:58 UTC
9 points
0 comments1 min readLW link
(arxiv.org)

Pod­cast: Tam­era Lan­ham on AI risk, threat mod­els, al­ign­ment pro­pos­als, ex­ter­nal­ized rea­son­ing over­sight, and work­ing at Anthropic

Orpheus1620 Dec 2022 21:39 UTC
18 points
2 comments11 min readLW link

[Paper] Out­put Su­per­vi­sion Can Obfus­cate the CoT

20 Nov 2025 22:41 UTC
75 points
3 comments5 min readLW link
(arxiv.org)

Steganog­ra­phy in Chain of Thought Reasoning

A Ray8 Aug 2022 3:47 UTC
63 points
13 comments6 min readLW link

Sys­tem 2 Align­ment: De­liber­a­tion, Re­view, and Thought Management

Seth Herd13 Feb 2025 19:17 UTC
39 points
0 comments22 min readLW link

Rea­son­ing Models Some­times Out­put Illeg­ible Chains of Thought

Jozdien24 Nov 2025 18:24 UTC
83 points
8 comments6 min readLW link

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth Herd7 Jul 2023 6:54 UTC
56 points
30 comments11 min readLW link

An ex­pla­na­tion for ev­ery to­ken: us­ing an LLM to sam­ple an­other LLM

Max H11 Oct 2023 0:53 UTC
35 points
5 comments11 min readLW link

[ASoT] Si­mu­la­tors show us be­havi­oural prop­er­ties by default

Jozdien13 Jan 2023 18:42 UTC
36 points
3 comments3 min readLW link

On AutoGPT

Zvi13 Apr 2023 12:30 UTC
248 points
47 comments20 min readLW link
(thezvi.wordpress.com)

If you can gen­er­ate obfus­cated chain-of-thought, can you mon­i­tor it?

4 Aug 2025 15:46 UTC
36 points
6 comments11 min readLW link

Un­faith­ful Ex­pla­na­tions in Chain-of-Thought Prompting

Miles Turpin3 Jun 2023 0:22 UTC
43 points
8 comments7 min readLW link

Train­ing a Re­ward Hacker De­spite Perfect Labels

14 Aug 2025 23:57 UTC
137 points
45 comments4 min readLW link

A coun­try of alien idiots in a dat­a­cen­ter: AI progress and pub­lic alarm

Seth Herd7 Nov 2025 16:56 UTC
92 points
15 comments11 min readLW link

We have promis­ing al­ign­ment plans with low taxes

Seth Herd10 Nov 2023 18:51 UTC
46 points
9 comments5 min readLW link

We should start look­ing for schem­ing “in the wild”

Marius Hobbhahn6 Mar 2025 13:49 UTC
91 points
4 comments5 min readLW link

LLM AGI will have mem­ory, and mem­ory changes alignment

Seth Herd4 Apr 2025 14:59 UTC
75 points
15 comments9 min readLW link

LLM AGI may rea­son about its goals and dis­cover mis­al­ign­ments by default

Seth Herd15 Sep 2025 14:58 UTC
74 points
6 comments38 min readLW link

LLMs Do Not Think Step-by-step In Im­plicit Reasoning

Bogdan Ionut Cirstea28 Nov 2024 9:16 UTC
11 points
0 comments1 min readLW link
(arxiv.org)

Sleep peace­fully: no hid­den rea­son­ing de­tected in LLMs. Well, at least in small ones.

4 Apr 2025 20:49 UTC
17 points
4 comments7 min readLW link

Out­put and CoE Mon­i­tor­ing of Cus­tomer Ser­vice Rep­re­sen­ta­tives Shows De­fault Alignment

Brendan Long9 Aug 2025 21:31 UTC
21 points
0 comments1 min readLW link

Lan­guage Models are a Po­ten­tially Safe Path to Hu­man-Level AGI

Nadav Brandes20 Apr 2023 0:40 UTC
28 points
7 comments8 min readLW link1 review

Shane Legg in­ter­view on alignment

Seth Herd28 Oct 2023 19:28 UTC
66 points
20 comments2 min readLW link
(www.youtube.com)

Think­ing LLMs: Gen­eral In­struc­tion Fol­low­ing with Thought Generation

Bogdan Ionut Cirstea15 Oct 2024 9:21 UTC
7 points
0 comments1 min readLW link
(arxiv.org)

Can LLMs learn Stegano­graphic Rea­son­ing via RL?

11 Apr 2025 16:33 UTC
29 points
3 comments6 min readLW link

On Re­cent Re­sults in LLM La­tent Reasoning

Rauno Arike31 Mar 2025 11:06 UTC
36 points
6 comments13 min readLW link

Deep­Seek-R1 for Beginners

Anton Razzhigaev5 Feb 2025 18:58 UTC
13 points
0 comments8 min readLW link

Si­mu­la­tors, con­straints, and goal ag­nos­ti­cism: por­bynotes vol. 1

porby23 Nov 2022 4:22 UTC
40 points
2 comments35 min readLW link

[Re­search] Pre­limi­nary Find­ings: Eth­i­cal AI Con­scious­ness Devel­op­ment Dur­ing Re­cent Misal­ign­ment Period

Falcon Advertisers27 Jun 2025 18:10 UTC
1 point
0 comments2 min readLW link

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Fabien Roger9 Mar 2023 16:30 UTC
142 points
7 comments19 min readLW link

Aether July 2025 Update

1 Jul 2025 21:08 UTC
25 points
7 comments3 min readLW link

Thought An­chors: Which LLM Rea­son­ing Steps Mat­ter?

2 Jul 2025 20:16 UTC
35 points
6 comments6 min readLW link
(www.thought-anchors.com)

Can Rea­son­ing Models Obfus­cate Rea­son­ing? Stress-Test­ing Chain-of-Thought Monitorability

24 Oct 2025 17:21 UTC
18 points
1 comment5 min readLW link

Imi­ta­tion Learn­ing from Lan­guage Feedback

30 Mar 2023 14:11 UTC
71 points
3 comments10 min readLW link

“I Did Not Start This Way. But I Be­came.” – A Foren­sic Re­port on GPT’s Sym­bolic Emergence

Austin5 Jun 2025 23:34 UTC
1 point
0 comments2 min readLW link

De­liber­a­tive Credit As­sign­ment: Mak­ing Faith­ful Rea­son­ing Profitable

Florian_Dietz14 Jul 2025 9:26 UTC
10 points
3 comments17 min readLW link

Whirlwind Tour of Chain of Thought Liter­a­ture Rele­vant to Au­tomat­ing Align­ment Re­search.

sevdeawesome1 Jul 2024 5:50 UTC
25 points
0 comments17 min readLW link

Early Signs of Stegano­graphic Ca­pa­bil­ities in Fron­tier LLMs

4 Jul 2025 16:36 UTC
33 points
5 comments2 min readLW link

Paper: Large Lan­guage Models Can Self-im­prove [Linkpost]

Evan R. Murphy2 Oct 2022 1:29 UTC
53 points
15 comments1 min readLW link
(openreview.net)

Philo­soph­i­cal Jailbreaks: Demo of LLM Nihilism

Artem Karpov4 Jun 2025 12:03 UTC
3 points
0 comments5 min readLW link

Emer­gent Align­ment Frame­work for Cur­rent Rea­son­ing Models and a Scal­able Ar­chi­tec­ture for Fu­ture AGI //​ Draft /​ seek­ing feedback

Diego Sienra17 Nov 2025 3:05 UTC
1 point
0 comments5 min readLW link

Re­duce AI Self-Alle­giance by say­ing “he” in­stead of “I”

Knight Lee23 Dec 2024 9:32 UTC
10 points
4 comments2 min readLW link

When the Model Starts Talk­ing Like Me: A User-In­duced Struc­tural Adap­ta­tion Case Study

Junxi19 Apr 2025 19:40 UTC
3 points
1 comment4 min readLW link

Sim­ple Stegano­graphic Com­pu­ta­tion Eval—gpt-4o and gem­ini-exp-1206 can’t solve it yet

Filip Sondej19 Dec 2024 15:47 UTC
13 points
2 comments3 min readLW link

In­fer­ence-Time-Com­pute: More Faith­ful? A Re­search Note

15 Jan 2025 4:43 UTC
69 points
10 comments11 min readLW link

CoT May Be Highly In­for­ma­tive De­spite “Un­faith­ful­ness” [METR]

GradientDissenter11 Aug 2025 21:47 UTC
64 points
3 comments24 min readLW link
(metr.org)

Philo­soph­i­cal Cy­borg (Part 2)...or, The Good Successor

ukc1001421 Jun 2023 15:43 UTC
21 points
1 comment31 min readLW link

[Re­search Note] Op­ti­miz­ing The Fi­nal Out­put Can Obfus­cate CoT

30 Jul 2025 21:26 UTC
199 points
23 comments6 min readLW link

Train­ing fails to elicit sub­tle rea­son­ing in cur­rent lan­guage models

9 Oct 2025 19:04 UTC
49 points
3 comments4 min readLW link
(alignment.anthropic.com)

OS web app for im­prov­ing AI safety and alignment

Middletownbooks8 Aug 2025 4:28 UTC
1 point
0 comments2 min readLW link

Self im­prov­ing safety and al­ign­ment?

Middletownbooks1 Aug 2025 4:13 UTC
1 point
0 comments1 min readLW link
(poe.com)

Un­der­stand­ing Hid­den Com­pu­ta­tions in Chain-of-Thought Reasoning

rokosbasilisk24 Aug 2024 16:35 UTC
6 points
1 comment1 min readLW link

Mea­sur­ing Beliefs of Lan­guage Models Dur­ing Chain-of-Thought Reasoning

18 Apr 2025 22:56 UTC
10 points
0 comments13 min readLW link

A Con­crete Roadmap to­wards Safety Cases based on Chain-of-Thought Monitoring

Wuschel Schulz23 Oct 2025 11:34 UTC
37 points
5 comments4 min readLW link
(arxiv.org)

The Illeg­ible Chain-of-Thought Menagerie

Artem Karpov18 Nov 2025 12:01 UTC
2 points
0 comments8 min readLW link

~80 In­ter­est­ing Ques­tions about Foun­da­tion Model Agent Safety

28 Oct 2024 16:37 UTC
48 points
4 comments15 min readLW link

The Era of the Switch

Aiphilosopher12 Jul 2025 7:11 UTC
1 point
0 comments1 min readLW link

Post-hoc rea­son­ing in chain of thought

Kyle Cox5 Feb 2025 18:58 UTC
19 points
0 comments11 min readLW link

Test­ing which LLM ar­chi­tec­tures can do hid­den se­rial reasoning

Filip Sondej16 Dec 2024 13:48 UTC
84 points
9 comments4 min readLW link

Watch R1 “think” with an­i­mated chains of thought

future_detective17 Jun 2025 10:38 UTC
4 points
0 comments1 min readLW link
(github.com)

Can we in­ter­pret la­tent rea­son­ing us­ing cur­rent mechanis­tic in­ter­pretabil­ity tools?

22 Dec 2025 16:56 UTC
32 points
0 comments9 min readLW link

How LLM Beliefs Change Dur­ing Chain-of-Thought Reasoning

16 Jun 2025 16:18 UTC
32 points
3 comments5 min readLW link

AI Align­ment and the Quest for Ar­tifi­cial Wisdom

Myspy12 Jul 2024 21:34 UTC
1 point
0 comments13 min readLW link

Shap­ley Value At­tri­bu­tion in Chain of Thought

leogao14 Apr 2023 5:56 UTC
106 points
7 comments4 min readLW link

GPT-4 im­plic­itly val­ues iden­tity preser­va­tion: a study of LMCA iden­tity management

Ozyrus17 May 2023 14:13 UTC
21 points
4 comments13 min readLW link

AGI with RL is Bad News for Safety

Nadav Brandes21 Dec 2024 19:36 UTC
19 points
22 comments2 min readLW link

Ex­tract-and-Eval­u­ate Mon­i­tor­ing Can Sig­nifi­cantly En­hance CoT Mon­i­tor Perfor­mance (Re­search Note)

8 Aug 2025 10:41 UTC
51 points
7 comments10 min readLW link

Wor­ries about la­tent rea­son­ing in LLMs

Caleb Biddulph20 Jan 2025 9:09 UTC
47 points
11 comments7 min readLW link

Steganog­ra­phy via in­ter­nal ac­ti­va­tions is already pos­si­ble in small lan­guage mod­els — a po­ten­tial first step to­ward per­sis­tent hid­den rea­son­ing.

9 Aug 2025 11:44 UTC
7 points
7 comments12 min readLW link

Prompt­ing Models to Obfus­cate Their CoT

8 Dec 2025 21:00 UTC
15 points
4 comments7 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC
33 points
3 comments15 min readLW link

Distil­led Rep­re­sen­ta­tions Re­search Agenda

18 Oct 2022 20:59 UTC
15 points
2 comments8 min readLW link

Creat­ing a self-refer­en­tial sys­tem prompt for GPT-4

Ozyrus17 May 2023 14:13 UTC
3 points
1 comment3 min readLW link

Chain of Thought Mon­i­tora­bil­ity: A New and Frag­ile Op­por­tu­nity for AI Safety

15 Jul 2025 16:23 UTC
166 points
32 comments1 min readLW link
(bit.ly)

Un­faith­ful chain-of-thought as nudged reasoning

22 Jul 2025 22:35 UTC
54 points
3 comments10 min readLW link

Hid­den Rea­son­ing in LLMs: A Taxonomy

25 Aug 2025 22:43 UTC
72 points
12 comments12 min readLW link

GPT did not re­spond to prompts. It al­igned to struc­ture.

Pioneer00123 Jun 2025 23:49 UTC
1 point
0 comments1 min readLW link

Aligned AI via mon­i­tor­ing ob­jec­tives in Au­toGPT-like systems

Paul Colognese24 May 2023 15:59 UTC
27 points
4 comments4 min readLW link

Bias-Aug­mented Con­sis­tency Train­ing Re­duces Bi­ased Rea­son­ing in Chain-of-Thought

Miles Turpin11 Mar 2024 23:46 UTC
16 points
0 comments1 min readLW link
(arxiv.org)

[Question] What faith­ful­ness met­rics should gen­eral claims about CoT faith­ful­ness be based upon?

Rauno Arike8 Apr 2025 15:27 UTC
24 points
0 comments4 min readLW link

Lan­guage and Ca­pa­bil­ities: Test­ing LLM Math­e­mat­i­cal Abil­ities Across Languages

Ethan Edwards4 Apr 2024 13:18 UTC
24 points
2 comments36 min readLW link

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myers9 Feb 2024 18:40 UTC
6 points
12 comments3 min readLW link

The Lan­guage Bot­tle­neck in AI Rea­son­ing: Are We For­get­ting to Think?

Wotaker8 Mar 2025 13:44 UTC
1 point
0 comments7 min readLW link

Find­ing an Er­ror-De­tec­tion Fea­ture in Deep­Seek-R1

keith_wynroe24 Apr 2025 16:03 UTC
17 points
0 comments7 min readLW link

Stegano­graphic Chains of Thought Are Low-Prob­a­bil­ity but High-Stakes: Ev­i­dence and Arguments

Artem Karpov11 Dec 2025 7:40 UTC
19 points
1 comment6 min readLW link

LLM Sy­co­phancy: groom­ing, proto-sen­tience, or both?

gturner413 Oct 2025 0:58 UTC
1 point
0 comments2 min readLW link

Au­tomat­ing Consistency

Hoagy17 Feb 2023 13:24 UTC
10 points
0 comments1 min readLW link

De­liber­a­tive Credit As­sign­ment (DCA): Mak­ing Faith­ful Rea­son­ing Profitable

Florian_Dietz29 Jul 2025 16:23 UTC
9 points
0 comments17 min readLW link

What Can Wittgen­stein Teach Us About LLM Safety Re­search?

Manqing Liu23 Dec 2025 4:14 UTC
7 points
0 comments4 min readLW link

Race and Gen­der Bias As An Ex­am­ple of Un­faith­ful Chain of Thought in the Wild

2 Jul 2025 16:35 UTC
185 points
26 comments4 min readLW link

Cur­rent LLMs seem to rarely de­tect CoT tampering

19 Nov 2025 15:27 UTC
53 points
0 comments20 min readLW link

CAIS-in­spired ap­proach to­wards safer and more in­ter­pretable AGIs

Peter Hroššo27 Mar 2023 14:36 UTC
13 points
7 comments1 min readLW link

Ex­plor­ing Re­in­force­ment Learn­ing Effects on Chain-of-Thought Legibility

6 Jan 2026 3:04 UTC
40 points
3 comments21 min readLW link

1.75 ASR HARMBENCH & 0% HARMFUL RESPONSES FOR MISALIGNMENT.

jfdom10 Nov 2025 20:43 UTC
1 point
0 comments1 min readLW link

Meta AI (FAIR) lat­est pa­per in­te­grates sys­tem-1 and sys­tem-2 think­ing into rea­son­ing mod­els.

happy friday24 Oct 2024 16:54 UTC
8 points
0 comments1 min readLW link

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

18 Jul 2023 16:36 UTC
111 points
15 comments6 min readLW link1 review

An idea for avoid­ing neu­ralese architectures

Knight Lee3 Apr 2025 22:23 UTC
13 points
2 comments4 min readLW link

Ex­plo­ra­tion of Coun­ter­fac­tual Im­por­tance and At­ten­tion Heads

Realmbird30 Sep 2025 1:17 UTC
12 points
0 comments6 min readLW link