Richard Juggins comments on Problems with instruction-following as an alignment target

Richard Juggins 19 May 2025 9:40 UTC
3 points
0
Do you have any quick examples of value-shaped interpretations that conflict?
This gives us some wiggle room to imagine different ways of resolving those conflicts, some of which will look more like instruction-following and other that will look more like autonomous action.
So perhaps the level of initiative the AI takes? E.g. a maximally initiative-taking AI might respond to ‘fetch me coffee’ by reshaping the Principal’s life so they get better sleep and no longer want the coffee.
I think my original reference to ‘perfect’ value understanding is maybe obscuring these tradeoffs (maybe unhelpfully), as in theory that includes knowledge of how the Principal would want interpretative conflicts managed.
- Charlie Steiner 19 May 2025 14:56 UTC
  4 points
  0
  Parent
  Do you have any quick examples of value-shaped interpretations that conflict?
  1. Someone trying but failing to quit smoking. On one interpretation, they don’t really want to smoke, smoking is some sort of mistake. On another interpretation, they do want to smoke, the quitting-related behavior is some sort of mistake (or has a social or epistemological reason).
    This example stands in for other sorts of “obvious inconsistency,” biases that we don’t reflectively endorse, etc. But also consider cases where humans say they don’t want something but we (outside the thought experiment) think they actually do want that thing! A possible example is the people who say they would hate a post-work world, they want to keep doing work so they have purpose. Point is, the verbal spec isn’t always right.
  2. The interpretation “Humans want to follow the laws of physics,” versus an interpretation that’s a more filled-in version of “Humans want to do a bunch of human-scale things like talking to humans, eating good food, interacting with nature, learning about the world, etc.” The first is the limit of being more predictive at the cost of having a more complicated model of humans, and as you can tell, it sort of peters out into explaining everything but having no push towards good stuff.