Chain-of-Thought Alignment

TagLast edit: 1 Dec 2023 20:58 UTC by niplav

“Chain-of-thought” autonomous agentic wrappers such as AutoGPT around an LLM such as GPT-4, and similar Language Model Cognitive Architectures (LMCAs) (other commonly used terms are Language Model Autonomous Agents (LMAAs), or Scaffolded LLMs), are a recent candidate approach to building an AGI.

They create, edit, and maintain a natural language context by recursively feeding parts of this into the LLM along with suitable prompts for activities like subtask planning, self-criticism, and memory summarization, generating a textual stream-of-consciousness, memories etc. They thus combine LLM neural nets with natural language symbolic thinking more along the lines of GOFAI.

Recent open-source examples are quite simple and not particularly capable, but it seems rather plausible that they could progress rapidly. They could make interpretability much easier than pure neural net systems, since their ‘chain-of-though’/​‘stream of consciousness’ and ‘memories’ would be written in human natural language, so interpretable and editable by a monitoring human or LLM-based monitoring system (modulo concerns about opaque natural language or detecting possible hidden steganographic side-channels concealed in apparently-innocent natural language). This topic discusses the alignment problem for systems combining such agentic wrappers with LLMs, if they are in fact capable of approaching or reaching AGI.

See Also

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth Herd18 Apr 2023 16:29 UTC
77 points
17 comments20 min readLW link

Lan­guage Agents Re­duce the Risk of Ex­is­ten­tial Catastrophe

28 May 2023 19:10 UTC
30 points
14 comments26 min readLW link

Align­ment of Au­toGPT agents

Ozyrus12 Apr 2023 12:54 UTC
14 points
1 comment4 min readLW link

Scaf­folded LLMs: Less Ob­vi­ous Concerns

Stephen Fowler16 Jun 2023 10:39 UTC
30 points
13 comments11 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tamera3 Aug 2022 12:03 UTC
126 points
23 comments6 min readLW link

On AutoGPT

Zvi13 Apr 2023 12:30 UTC
248 points
47 comments20 min readLW link

Pod­cast: Tam­era Lan­ham on AI risk, threat mod­els, al­ign­ment pro­pos­als, ex­ter­nal­ized rea­son­ing over­sight, and work­ing at Anthropic

Akash20 Dec 2022 21:39 UTC
18 points
2 comments11 min readLW link

[Question] Should Au­toGPT up­date us to­wards re­search­ing IDA?

Michaël Trazzi12 Apr 2023 16:41 UTC
15 points
5 comments1 min readLW link

Shane Legg in­ter­view on alignment

Seth Herd28 Oct 2023 19:28 UTC
65 points
20 comments2 min readLW link

Steganog­ra­phy in Chain of Thought Reasoning

A Ray8 Aug 2022 3:47 UTC
61 points
13 comments6 min readLW link

Lan­guage Models are a Po­ten­tially Safe Path to Hu­man-Level AGI

Nadav Brandes20 Apr 2023 0:40 UTC
28 points
6 comments8 min readLW link

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth Herd7 Jul 2023 6:54 UTC
49 points
26 comments11 min readLW link

An ex­pla­na­tion for ev­ery to­ken: us­ing an LLM to sam­ple an­other LLM

Max H11 Oct 2023 0:53 UTC
34 points
4 comments11 min readLW link

We have promis­ing al­ign­ment plans with low taxes

Seth Herd10 Nov 2023 18:51 UTC
30 points
9 comments5 min readLW link

Un­faith­ful Ex­pla­na­tions in Chain-of-Thought Prompting

miles3 Jun 2023 0:22 UTC
38 points
8 comments7 min readLW link

Philo­soph­i­cal Cy­borg (Part 2)...or, The Good Successor

ukc1001421 Jun 2023 15:43 UTC
21 points
1 comment31 min readLW link

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

18 Jul 2023 16:36 UTC
109 points
13 comments6 min readLW link

Distil­led Rep­re­sen­ta­tions Re­search Agenda

18 Oct 2022 20:59 UTC
15 points
2 comments8 min readLW link

Paper: Large Lan­guage Models Can Self-im­prove [Linkpost]

Evan R. Murphy2 Oct 2022 1:29 UTC
52 points
14 comments1 min readLW link

[ASoT] Si­mu­la­tors show us be­havi­oural prop­er­ties by default

Jozdien13 Jan 2023 18:42 UTC
33 points
2 comments3 min readLW link

Imi­ta­tion Learn­ing from Lan­guage Feedback

30 Mar 2023 14:11 UTC
71 points
3 comments10 min readLW link

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Fabien Roger9 Mar 2023 16:30 UTC
125 points
6 comments19 min readLW link

Si­mu­la­tors, con­straints, and goal ag­nos­ti­cism: por­bynotes vol. 1

porby23 Nov 2022 4:22 UTC
37 points
2 comments35 min readLW link

CAIS-in­spired ap­proach to­wards safer and more in­ter­pretable AGIs

Peter Hroššo27 Mar 2023 14:36 UTC
13 points
7 comments1 min readLW link

Shap­ley Value At­tri­bu­tion in Chain of Thought

leogao14 Apr 2023 5:56 UTC
101 points
5 comments4 min readLW link

Au­tomat­ing Consistency

Hoagy17 Feb 2023 13:24 UTC
10 points
0 comments1 min readLW link

GPT-4 im­plic­itly val­ues iden­tity preser­va­tion: a study of LMCA iden­tity management

Ozyrus17 May 2023 14:13 UTC
21 points
4 comments13 min readLW link

Creat­ing a self-refer­en­tial sys­tem prompt for GPT-4

Ozyrus17 May 2023 14:13 UTC
3 points
1 comment3 min readLW link

Aligned AI via mon­i­tor­ing ob­jec­tives in Au­toGPT-like systems

Paul Colognese24 May 2023 15:59 UTC
27 points
4 comments4 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC
32 points
3 comments15 min readLW link

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myers9 Feb 2024 18:40 UTC
6 points
12 comments3 min readLW link