Seth Herd comments on Aligned AI via monitoring objectives in AutoGPT-like systems

Seth Herd 24 May 2023 21:21 UTC
2 points
0
I’ve written about this in Capabilities and alignment of LLM cognitive architectures. I didn’t go into quite as much depth on alignment, but my conclusions are essentially identical to yours: there are ways this could go wrong, but it has a lot of promise. I think this approach has large upsides from working primarily in natural language, and that all other approaches have the same downsides and more.

This has also been addressed in Fabien Rogers’ The Translucent Thoughts Hypotheses and Their Implications

I’m working on another post that goes into more depth on the alignment implications, but the conclusions remain the same so far.
- Paul Colognese 24 May 2023 22:01 UTC
  1 point
  0
  Parent
  Thanks for the reponse, it’s useful to hear that we can to the same conclusions. I quoted your post in the first paragraph.
  
  Thanks for bringing Fabien’s post to my attention! I’ll reference it.
  
  Looking forward to your upcoming post.
  - Seth Herd 25 May 2023 19:43 UTC
    2 points
    0
    Parent
    Ooops, I hadn’t clicked those links so didn’t notice they were to my posts!
    
    You’ve probably found this, since it’s the one tag on your post: the chain of thought alignment tag goes to some other related work.
    
    There’s a new one up today that I haven’t finished processing.