I don’t understand your comment but it seems vaguely related to what I said in §5.1.1.
Yeah, if we make the (dubious) assumption that all AIs at all times will have basically the same ontologies, same powers, and same ways of thinking about things, as their human supervisors, every step of the way, with continuous re-alignment, then IMO that would definitely eliminate sharp-left-turn-type problems, at least the way that I define and understand such problems right now.
Of course, there can still be other (non-sharp-left-turn) problems, like maybe the technical alignment approach doesn’t work for unrelated reasons (e.g. 1,2), or maybe we die from coordination problems (e.g.), etc.
Modern ML systems use gradient descent with tight feedback loops and minimal slack
I’m confused; I don’t know what you mean by this. Let’s be concrete. Would you describe GPT-o1 as “using gradient descent with tight feedback loops and minimal slack”? What about AlphaZero? What precisely would control the “feedback loop” and “slack” in those two cases?
Thank you for being patient with me, I tend to live in my own head a bit with these things :/ Let me know if this explanation is clearer using the examples you gave:
Let me build on the discussion about optimization slack and sharp left turns by exploring a concrete example that illustrates the key dynamics at play.
Think about the difference between TD-learning and Monte Carlo methods in reinforcement learning. In TD-learning, we update our value estimates frequently based on small temporal differences between successive states. The “slack”—how far we let the system explore/optimize between validation checks—is quite tight. In contrast, Monte Carlo methods wait until the end of an episode to make updates, allowing much more slack in the intermediate steps.
This difference provides insight into the sharp left turn problem. When we allow more slack between optimization steps (like in Monte Carlo methods), the system has more freedom to drift from its original utility function before course correction. The divergence compounds particularly when we have nested optimization processes—imagine a base model with significant slack that then has additional optimization layers built on top, each with their own slack. The total divergence potential multiplies.
This connects directly to your point about GPT-style models versus AlphaZero. While the gradient descent itself may have tight feedback loops, the higher-level optimization occurring through prompt engineering or fine-tuning introduces additional slack. It’s similar to how cultural evolution, with its long periods between meaningful corrections, allowed for the emergence of inner optimizers that could significantly diverge from the original selection pressures.
I’m still working to formalize precisely what mathematical structure best captures this notion of slack—whether it’s best understood through the lens of utility boundaries, free energy, or some other framework.
I don’t understand your comment but it seems vaguely related to what I said in §5.1.1.
Yeah, if we make the (dubious) assumption that all AIs at all times will have basically the same ontologies, same powers, and same ways of thinking about things, as their human supervisors, every step of the way, with continuous re-alignment, then IMO that would definitely eliminate sharp-left-turn-type problems, at least the way that I define and understand such problems right now.
Of course, there can still be other (non-sharp-left-turn) problems, like maybe the technical alignment approach doesn’t work for unrelated reasons (e.g. 1,2), or maybe we die from coordination problems (e.g.), etc.
I’m confused; I don’t know what you mean by this. Let’s be concrete. Would you describe GPT-o1 as “using gradient descent with tight feedback loops and minimal slack”? What about AlphaZero? What precisely would control the “feedback loop” and “slack” in those two cases?
Thank you for being patient with me, I tend to live in my own head a bit with these things :/ Let me know if this explanation is clearer using the examples you gave:
Let me build on the discussion about optimization slack and sharp left turns by exploring a concrete example that illustrates the key dynamics at play.
Think about the difference between TD-learning and Monte Carlo methods in reinforcement learning. In TD-learning, we update our value estimates frequently based on small temporal differences between successive states. The “slack”—how far we let the system explore/optimize between validation checks—is quite tight. In contrast, Monte Carlo methods wait until the end of an episode to make updates, allowing much more slack in the intermediate steps.
This difference provides insight into the sharp left turn problem. When we allow more slack between optimization steps (like in Monte Carlo methods), the system has more freedom to drift from its original utility function before course correction. The divergence compounds particularly when we have nested optimization processes—imagine a base model with significant slack that then has additional optimization layers built on top, each with their own slack. The total divergence potential multiplies.
This connects directly to your point about GPT-style models versus AlphaZero. While the gradient descent itself may have tight feedback loops, the higher-level optimization occurring through prompt engineering or fine-tuning introduces additional slack. It’s similar to how cultural evolution, with its long periods between meaningful corrections, allowed for the emergence of inner optimizers that could significantly diverge from the original selection pressures.
I’m still working to formalize precisely what mathematical structure best captures this notion of slack—whether it’s best understood through the lens of utility boundaries, free energy, or some other framework.