Chain-of-Thought Alignment

“Chain-of-thought” autonomous agentic wrappers such as AutoGPT around an LLM such as GPT-4, and similar Language Model Cognitive Architectures (LMCAs) (other commonly used terms are Language Model Autonomous Agents (LMAAs), or Scaffolded LLMs), are a recent candidate approach to building an AGI.

They create, edit, and maintain a natural language context by recursively feeding parts of this into the LLM along with suitable prompts for activities like subtask planning, self-criticism, and memory summarization, generating a textual stream-of-consciousness, memories etc. They thus combine LLM neural nets with natural language symbolic thinking more along the lines of GOFAI.

Recent open-source examples are quite simple and not particularly capable, but it seems rather plausible that they could progress rapidly. They could make interpretability much easier than pure neural net systems, since their ‘chain-of-though’/​‘stream of consciousness’ and ‘memories’ would be written in human natural language, so interpretable and editable by a monitoring human or LLM-based monitoring system (modulo concerns about opaque natural language or detecting possible hidden steganographic side-channels concealed in apparently-innocent natural language). This topic discusses the alignment problem for systems combining such agentic wrappers with LLMs, if they are in fact capable of approaching or reaching AGI.

See Also

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Lan­guage Agents Re­duce the Risk of Ex­is­ten­tial Catastrophe

Align­ment of Au­toGPT agents

Scaf­folded LLMs: Less Ob­vi­ous Concerns

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

On AutoGPT

Pod­cast: Tam­era Lan­ham on AI risk, threat mod­els, al­ign­ment pro­pos­als, ex­ter­nal­ized rea­son­ing over­sight, and work­ing at Anthropic

[Question] Should Au­toGPT up­date us to­wards re­search­ing IDA?

Shane Legg in­ter­view on alignment

Steganog­ra­phy in Chain of Thought Reasoning

Lan­guage Models are a Po­ten­tially Safe Path to Hu­man-Level AGI

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

An ex­pla­na­tion for ev­ery to­ken: us­ing an LLM to sam­ple an­other LLM

We have promis­ing al­ign­ment plans with low taxes

Un­faith­ful Ex­pla­na­tions in Chain-of-Thought Prompting

Philo­soph­i­cal Cy­borg (Part 2)...or, The Good Successor

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

Distil­led Rep­re­sen­ta­tions Re­search Agenda

Paper: Large Lan­guage Models Can Self-im­prove [Linkpost]

[ASoT] Si­mu­la­tors show us be­havi­oural prop­er­ties by default

Imi­ta­tion Learn­ing from Lan­guage Feedback

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Si­mu­la­tors, con­straints, and goal ag­nos­ti­cism: por­bynotes vol. 1

CAIS-in­spired ap­proach to­wards safer and more in­ter­pretable AGIs

Shap­ley Value At­tri­bu­tion in Chain of Thought

Au­tomat­ing Consistency

GPT-4 im­plic­itly val­ues iden­tity preser­va­tion: a study of LMCA iden­tity management

Creat­ing a self-refer­en­tial sys­tem prompt for GPT-4

Aligned AI via mon­i­tor­ing ob­jec­tives in Au­toGPT-like systems

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

