habryka comments on Alignment remains a hard, unsolved problem

habryka 29 Nov 2025 20:04 UTC
LW: 27 AF: 14
23
AF
Do we then say that Claude’s extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?
But in that case, wouldn’t a rock that has “just ask Evan” written on it, be even better than Claude? Like, I felt confident that you were talking about Claude’s extrapolated volition in the absence of humans, since making Claude into a rock that when asked about ethics just has “ask Evan” written on it does not seem like any relevant evidence about the difficulty of alignment, or its historical success.
- evhub 30 Nov 2025 8:19 UTC
  LW: 5 AF: 3
  −1
  AF Parent
  I mean, to the extent that it is meaningful at all to say that such a rock has an extrapolated volition, surely that extrapolated volition is indeed to “just ask Evan”. Regardless, the whole point of my post is exactly that I think we shouldn’t over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.
  - habryka 30 Nov 2025 18:30 UTC
    LW: 5 AF: 4
    −1
    AF Parent
    Yes, to be clear, I agree that in as much this question makes sense, the extrapolated volition would indeed end up basically ideal by your lights.
    Regardless, the whole point of my post is exactly that I think we shouldn’t over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.
    Cool, that makes sense. FWIW, I interpreted the overall essay to be more like “Alignment remains a hard unsolved problem, but we are on pretty good track to solve it”, and this sentence as evidence for the “pretty good track” part. I would be kind of surprised if that wasn’t why you put that sentence there, but this kind of thing seems hard to adjudicate.