This has a lot of overlap with my recent post LLM AGI may reason about its goals and discover misalignments by default, and the followup post I’m working on now that further explores whether we should train LLMs on reasoning about their goals. Prompting them to reason extensively about goals during training has the effect of revealing potential future misalignments to them, as you discuss.
This has a lot of overlap with my recent post LLM AGI may reason about its goals and discover misalignments by default, and the followup post I’m working on now that further explores whether we should train LLMs on reasoning about their goals. Prompting them to reason extensively about goals during training has the effect of revealing potential future misalignments to them, as you discuss.
I’m curious what you think about that framing.