I’ve written about this in Capabilities and alignment of LLM cognitive architectures. I didn’t go into quite as much depth on alignment, but my conclusions are essentially identical to yours: there are ways this could go wrong, but it has a lot of promise. I think this approach has large upsides from working primarily in natural language, and that all other approaches have the same downsides and more.
I’ve written about this in Capabilities and alignment of LLM cognitive architectures. I didn’t go into quite as much depth on alignment, but my conclusions are essentially identical to yours: there are ways this could go wrong, but it has a lot of promise. I think this approach has large upsides from working primarily in natural language, and that all other approaches have the same downsides and more.
This has also been addressed in Fabien Rogers’ The Translucent Thoughts Hypotheses and Their Implications
I’m working on another post that goes into more depth on the alignment implications, but the conclusions remain the same so far.
Thanks for the reponse, it’s useful to hear that we can to the same conclusions. I quoted your post in the first paragraph.
Thanks for bringing Fabien’s post to my attention! I’ll reference it.
Looking forward to your upcoming post.
Ooops, I hadn’t clicked those links so didn’t notice they were to my posts!
You’ve probably found this, since it’s the one tag on your post: the chain of thought alignment tag goes to some other related work.
There’s a new one up today that I haven’t finished processing.