It sounds like you are saying “In the current paradigm of prompted/scaffolded instruction-tuned LLMs, we get the faithful CoT property by default. Therefore our systems will indeed be agentic / goal-directed / wanting-things, but we’ll be able to choose what they want (at least imperfectly, via the prompt) and we’ll be able to see what they are thinking (at least imperfectly, via monitoring the CoT), therefore they won’t be able to successfully plot against us.”
Basically, but more centrally that in literal current LLM agents the scary part of the system that we don’t understand (the LLM) doesn’t generalize in any scary way due to wanting while we can still get the overall system to achieve specific long term outcomes in practice. And that it’s at least plausible that this property will be preserved in the future.
I edited my earlier comment to hopefully make this more clear.
Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn’t contradict anything we’ve said. Nate and I both agree that if we can create & maintain some sort of faithful/visible thoughts property through human-level AGI and beyond, then we are in pretty good shape & I daresay things are looking pretty optimistic. (We just need to use said AGI to solve the rest of the problem for us, whilst we monitor it to make sure it doesn’t plot against us or otherwise screw us over.)
Even if we didn’t have the visible thoughts property in the actual deployed system, the fact that all of the retargeting behavior is based on explicit human engineering is still relevant and contradicts the core claim Nate makes in this post IMO.
Basically, but more centrally that in literal current LLM agents the scary part of the system that we don’t understand (the LLM) doesn’t generalize in any scary way due to wanting while we can still get the overall system to achieve specific long term outcomes in practice. And that it’s at least plausible that this property will be preserved in the future.
I edited my earlier comment to hopefully make this more clear.
Even if we didn’t have the visible thoughts property in the actual deployed system, the fact that all of the retargeting behavior is based on explicit human engineering is still relevant and contradicts the core claim Nate makes in this post IMO.