Chain-of-Thought Alignment

TagLast edit: Dec 1, 2023, 8:58 PM by niplav

“Chain-of-thought” autonomous agentic wrappers such as AutoGPT around an LLM such as GPT-4, and similar Language Model Cognitive Architectures (LMCAs) (other commonly used terms are Language Model Autonomous Agents (LMAAs), or Scaffolded LLMs), are a recent candidate approach to building an AGI.

They create, edit, and maintain a natural language context by recursively feeding parts of this into the LLM along with suitable prompts for activities like subtask planning, self-criticism, and memory summarization, generating a textual stream-of-consciousness, memories etc. They thus combine LLM neural nets with natural language symbolic thinking more along the lines of GOFAI.

Recent open-source examples are quite simple and not particularly capable, but it seems rather plausible that they could progress rapidly. They could make interpretability much easier than pure neural net systems, since their ‘chain-of-though’/‘stream of consciousness’ and ‘memories’ would be written in human natural language, so interpretable and editable by a monitoring human or LLM-based monitoring system (modulo concerns about opaque natural language or detecting possible hidden steganographic side-channels concealed in apparently-innocent natural language). This topic discusses the alignment problem for systems combining such agentic wrappers with LLMs, if they are in fact capable of approaching or reaching AGI.

Capabilities and alignment of LLM cognitive architectures

Seth HerdApr 18, 2023, 4:29 PM

88 points

18 comments20 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tameraAug 3, 2022, 12:03 PM

136 points

23 comments6 min readLW link

the case for CoT unfaithfulness is overstated

nostalgebraistSep 29, 2024, 10:07 PM

263 points

43 comments11 min readLW link

Language Agents Reduce the Risk of Existential Catastrophe

cdkg and Simon Goldstein

May 28, 2023, 7:10 PM

39 points

14 comments26 min readLW link

Scaffolded LLMs: Less Obvious Concerns

Stephen FowlerJun 16, 2023, 10:39 AM

34 points

15 comments14 min readLW link

Alignment of AutoGPT agents

OzyrusApr 12, 2023, 12:54 PM

14 points

1 comment4 min readLW link

[Question] Should AutoGPT update us towards researching IDA?

Michaël TrazziApr 12, 2023, 4:41 PM

15 points

5 comments1 min readLW link

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Bogdan Ionut CirsteaSep 19, 2024, 4:13 PM

21 points

1 comment1 min readLW link

(arxiv.org)

5 ways to improve CoT faithfulness

Caleb BiddulphOct 5, 2024, 8:17 PM

44 points

40 comments6 min readLW link

A Little Depth Goes a Long Way: the Expressive Power of Log-Depth Transformers

Bogdan Ionut CirsteaNov 20, 2024, 11:48 AM

16 points

0 comments1 min readLW link

(openreview.net)

Seven sources of goals in LLM agents

Seth HerdFeb 8, 2025, 9:54 PM

22 points

3 comments2 min readLW link

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Benjamin Arnav, Pablo Bernabeu Perez, Timothy Kostolansky, HanneWhitt, Nathan Helm-Burger and Mary Phuong

Jun 2, 2025, 7:08 PM

71 points

16 comments3 min readLW link

AI CoT Reasoning Is Often Unfaithful

ZviApr 4, 2025, 2:50 PM

66 points

4 comments7 min readLW link

(thezvi.wordpress.com)

Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?

Bogdan Ionut CirsteaNov 26, 2024, 9:58 AM

9 points

0 comments1 min readLW link

(arxiv.org)

Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

Orpheus16Dec 20, 2022, 9:39 PM

18 points

2 comments11 min readLW link

Steganography in Chain of Thought Reasoning

A RayAug 8, 2022, 3:47 AM

62 points

13 comments6 min readLW link

System 2 Alignment

Seth HerdFeb 13, 2025, 7:17 PM

35 points

0 comments22 min readLW link

Internal independent review for language model agent alignment

Seth HerdJul 7, 2023, 6:54 AM

55 points

30 comments11 min readLW link

An explanation for every token: using an LLM to sample another LLM

Max HOct 11, 2023, 12:53 AM

35 points

5 comments11 min readLW link

[ASoT] Simulators show us behavioural properties by default

JozdienJan 13, 2023, 6:42 PM

36 points

3 comments3 min readLW link

On AutoGPT

ZviApr 13, 2023, 12:30 PM

248 points

47 comments20 min readLW link

(thezvi.wordpress.com)

Unfaithful Explanations in Chain-of-Thought Prompting

Miles TurpinJun 3, 2023, 12:22 AM

42 points

8 comments7 min readLW link

We have promising alignment plans with low taxes

Seth HerdNov 10, 2023, 6:51 PM

44 points

9 comments5 min readLW link

We should start looking for scheming “in the wild”

Marius HobbhahnMar 6, 2025, 1:49 PM

89 points

4 comments5 min readLW link

LLM AGI will have memory, and memory changes alignment

Seth HerdApr 4, 2025, 2:59 PM

70 points

15 comments9 min readLW link

LLMs Do Not Think Step-by-step In Implicit Reasoning

Bogdan Ionut CirsteaNov 28, 2024, 9:16 AM

11 points

0 comments1 min readLW link

(arxiv.org)

Sleep peacefully: no hidden reasoning detected in LLMs. Well, at least in small ones.

Ilia Shirokov and Ilya Nachevsky

Apr 4, 2025, 8:49 PM

16 points

2 comments7 min readLW link

Language Models are a Potentially Safe Path to Human-Level AGI

Nadav BrandesApr 20, 2023, 12:40 AM

28 points

7 comments8 min readLW link 1 review

Shane Legg interview on alignment

Seth HerdOct 28, 2023, 7:28 PM

66 points

20 comments2 min readLW link

(www.youtube.com)

Thinking LLMs: General Instruction Following with Thought Generation

Bogdan Ionut CirsteaOct 15, 2024, 9:21 AM

7 points

0 comments1 min readLW link

(arxiv.org)

On the Implications of Recent Results on Latent Reasoning in LLMs

Rauno ArikeMar 31, 2025, 11:06 AM

34 points

6 comments13 min readLW link

DeepSeek-R1 for Beginners

Anton RazzhigaevFeb 5, 2025, 6:58 PM

12 points

0 comments8 min readLW link

Simulators, constraints, and goal agnosticism: porbynotes vol. 1

porbyNov 23, 2022, 4:22 AM

38 points

2 comments35 min readLW link

The Translucent Thoughts Hypotheses and Their Implications

Fabien RogerMar 9, 2023, 4:30 PM

142 points

7 comments19 min readLW link

Imitation Learning from Language Feedback

Jérémy Scheurer, Tomek Korbak and Ethan Perez

Mar 30, 2023, 2:11 PM

71 points

3 comments10 min readLW link

“I Did Not Start This Way. But I Became.” – A Forensic Report on GPT’s Symbolic Emergence

AustinJun 5, 2025, 11:34 PM

1 point

0 comments2 min readLW link

Whirlwind Tour of Chain of Thought Literature Relevant to Automating Alignment Research.

sevdeawesomeJul 1, 2024, 5:50 AM

25 points

0 comments17 min readLW link

Paper: Large Language Models Can Self-improve [Linkpost]

Evan R. MurphyOct 2, 2022, 1:29 AM

52 points

15 comments1 min readLW link

(openreview.net)

Philosophical Jailbreaks: Demo of LLM Nihilism

Artyom KarpovJun 4, 2025, 12:03 PM

3 points

0 comments5 min readLW link

Reduce AI Self-Allegiance by saying “he” instead of “I”

Knight LeeDec 23, 2024, 9:32 AM

10 points

4 comments2 min readLW link

When the Model Starts Talking Like Me: A User-Induced Structural Adaptation Case Study

JunxiApr 19, 2025, 7:40 PM

3 points

1 comment4 min readLW link

Simple Steganographic Computation Eval—gpt-4o and gemini-exp-1206 can’t solve it yet

Filip SondejDec 19, 2024, 3:47 PM

13 points

2 comments3 min readLW link

Inference-Time-Compute: More Faithful? A Research Note

James Chua and Owain_Evans

Jan 15, 2025, 4:43 AM

69 points

10 comments11 min readLW link

Philosophical Cyborg (Part 2)...or, The Good Successor

ukc10014Jun 21, 2023, 3:43 PM

21 points

1 comment31 min readLW link

Understanding Hidden Computations in Chain-of-Thought Reasoning

rokosbasiliskAug 24, 2024, 4:35 PM

6 points

1 comment1 min readLW link

Measuring Beliefs of Language Models During Chain-of-Thought Reasoning

Baram Sosis and Tomáš Gavenčiak

Apr 18, 2025, 10:56 PM

9 points

0 comments13 min readLW link

~80 Interesting Questions about Foundation Model Agent Safety

RohanS and Govind Pimpale

Oct 28, 2024, 4:37 PM

46 points

4 comments15 min readLW link

Post-hoc reasoning in chain of thought

Kyle CoxFeb 5, 2025, 6:58 PM

17 points

0 comments11 min readLW link

Testing which LLM architectures can do hidden serial reasoning

Filip SondejDec 16, 2024, 1:48 PM

81 points

9 comments4 min readLW link

How LLM Beliefs Change During Chain-of-Thought Reasoning

Filip Sondej, Petr Kašpárek, alex-kazda and Tomáš Gavenčiak

Jun 16, 2025, 4:18 PM

21 points

2 comments5 min readLW link

AI Alignment and the Quest for Artificial Wisdom

MyspyJul 12, 2024, 9:34 PM

1 point

0 comments13 min readLW link

Shapley Value Attribution in Chain of Thought

leogaoApr 14, 2023, 5:56 AM

106 points

7 comments4 min readLW link

GPT-4 implicitly values identity preservation: a study of LMCA identity management

OzyrusMay 17, 2023, 2:13 PM

21 points

4 comments13 min readLW link

AGI with RL is Bad News for Safety

Nadav BrandesDec 21, 2024, 7:36 PM

19 points

22 comments2 min readLW link

Worries about latent reasoning in LLMs

Caleb BiddulphJan 20, 2025, 9:09 AM

45 points

6 comments7 min readLW link

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaleyMay 25, 2023, 9:26 AM

33 points

3 comments15 min readLW link

Distilled Representations Research Agenda

Hoagy and mishajw

Oct 18, 2022, 8:59 PM

15 points

2 comments8 min readLW link

Creating a self-referential system prompt for GPT-4

OzyrusMay 17, 2023, 2:13 PM

3 points

1 comment3 min readLW link

Aligned AI via monitoring objectives in AutoGPT-like systems

Paul CologneseMay 24, 2023, 3:59 PM

27 points

4 comments4 min readLW link

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Miles TurpinMar 11, 2024, 11:46 PM

16 points

0 comments1 min readLW link

(arxiv.org)

[Question] What faithfulness metrics should general claims about CoT faithfulness be based upon?

Rauno ArikeApr 8, 2025, 3:27 PM

24 points

0 comments4 min readLW link

Language and Capabilities: Testing LLM Mathematical Abilities Across Languages

Ethan EdwardsApr 4, 2024, 1:18 PM

24 points

2 comments36 min readLW link

The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment

kenneth myersFeb 9, 2024, 6:40 PM

6 points

12 comments3 min readLW link

The Language Bottleneck in AI Reasoning: Are We Forgetting to Think?

WotakerMar 8, 2025, 1:44 PM

1 point

0 comments7 min readLW link

Finding an Error-Detection Feature in DeepSeek-R1

keith_wynroeApr 24, 2025, 4:03 PM

15 points

0 comments7 min readLW link

Automating Consistency

HoagyFeb 17, 2023, 1:24 PM

10 points

0 comments1 min readLW link

CAIS-inspired approach towards safer and more interpretable AGIs

Peter HroššoMar 27, 2023, 2:36 PM

13 points

7 comments1 min readLW link

Meta AI (FAIR) latest paper integrates system-1 and system-2 thinking into reasoning models.

happy fridayOct 24, 2024, 4:54 PM

8 points

0 comments1 min readLW link

Measuring and Improving the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman and Ethan Perez

Jul 18, 2023, 4:36 PM

111 points

15 comments6 min readLW link 1 review

An idea for avoiding neuralese architectures

Knight LeeApr 3, 2025, 10:23 PM

11 points

2 comments4 min readLW link

Seth Herd May 4, 2023, 9:20 PM
2 points
0
I changed the name for two reasons:Chain-of-thought is more commonly used on LW and in the LLM literature, and I’d like to avoid connections to the messy concept of consciousness.

Third, the original creator said anyone was free to change the name.

Thanks for creating this and marking posts! And writing the excellent description. I haven’t touched that.

I’m not terribly attached, so feel free to change it again if you feel strongly. I do plan to use this terminology in my next article on the subject.

Chain-of-Thought Alignment

See Also