aysja comments on Thomas Kwa’s Shortform

aysja 4 Oct 2025 22:09 UTC
4 points
2
I agree we’re in better shape all else equal than evolution was, though not by enough that I think this is no longer a disaster. Even with all these advantages, it still seems like we don’t have control in a meaningful sense—i.e., we can’t precisely instill particular values, and we can’t tell what values we’ve instilled. Many of the points here don’t bear on this imo, e.g., it’s unclear to me that having tighter feedback loops of the ~same crude process makes the crude process any more precise. Likewise, adapting our methods, data, and hyperparameters in response to problems we encounter doesn’t seem like it will solve those problems, since the issues (e.g., of proxies and unintended off-target effects) will persist. Imo, the bottom line is still that we’re blindly growing a superintelligence we don’t remotely understand, and I don’t see how these techniques shift the situation into one where we are in control of our future.
- Thomas Kwa 5 Oct 2025 1:17 UTC
  2 points
  0
  Parent
  Much of my hope is that by the time we reach a superintelligence level where we need to instill reflectively endorsed values to optimize towards in a very hands-off way rather than just constitutions, behaviors, or goals, we’ll have figured something else out. I’m not claiming the optimizer advantage alone is enough to be decisive in saving the world.
  To the point about tighter feedback loops, I see the main benefit as being in conjunction with adapting to new problems. Suppose that we notice AIs take some bad but non-world-ending action like murdering people; then we can add a big dataset of situations in which AIs shouldn’t murder people to the training data. If we were instead breeding animals, we would have to wait dozens of generations for mutations that reduce murder rate to appear and reach fixation. Since these mutations affect behavior through brain architecture, they would have a higher chance of deleterious effects. And if we’re also selecting for intelligence, they would be competing against mutations that increase intelligence, producing a higher alignment tax. All this means that we have less chance to detect whether our proxies hold up (capabilities researchers have many of these advantages too, but the AGI would be able to automate capabilities training anyway).
  If we expect problems to get worse at some rate until an accumulation of unsolved alignment issues culminates in disempowerment, it seems to me there is a large band of rates where we can stay ahead of them with AI training but evolution wouldn’t be able to.