williawa comments on Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.

williawa 25 Aug 2025 9:47 UTC
1 point
0
Hmm, I don’t entirely see your point. That’s a good proxy, but its still a proxy. The arguments in this post indicates evolution could get desires for surviving grandchildren into us, and I think it has done that to some degree.

Edit for clarity: And by “surviving grandchildren” I mean that up to what humans could detect in ancestral environment. Eg adopted children as babies or genetically engineered ones’d still count I presume
- Jeremy Gillen 26 Aug 2025 2:59 UTC
  4 points
  −1
  Parent
  and I think it has done that to some degree.
  Perhaps to some degree… but basically no. Introspecting, I perhaps feel nicer picturing a couple of grandchildren than I do picturing zero, but I don’t feel nicer picturing 100 grandchildren. 100 feels worse, for lots of reasons. One of them is my values related to calm and comfort. My desire to read books does far more to motivate my behaviour, on a day-to-day basis, than any thought of children or grandchildren.
  I do have a bunch of child-related proxies in my motivations: Particularly around watching children be confused, and overcoming confusion, or the cute way they misunderstand how things work. There’s also some parenting proxies in there, around being loved and respected by people who I’m responsible for.
  So I have all of these motivations. But they are a collection of non-robust proxies. I claim that if you replaced all these motivations with a counter of grandchildren, this version of me would have higher fitness (especially in my current environment, but probably also in the ancestral environment).
  I also have motivations around comfort, challenges, competition, curiosity and status, and these almost always swamp the ones that are more directly related to children. I would have more grandchildren if my ultimate motivations weren’t dominated by these other things.
  So, I claim that knowledge of IGF isn’t the important difference that stopped evolution from making robustly inner aligned humans. It’s kinda close, but not quite it. If it were the important difference, we would have evolved to directly plan for having lots of grandchildren (because it’s a far better approximation of IGF than the collection of proxies I talked about above, and wasn’t unknown to our ancestors).
  You’ve said:
  What this means is: there is no compact way for evolution to point to the “this human specimen values genetic fitness”-knob on the genome and turn it up.
  But this is not the case for modern AIs. They have some understanding of what kindness is, and there should in principle be a way to locate that inside the LLM and turn up the knob.
  You’re saying that the understanding of a concept means that there is a “knob” on the genome that allows evolution/training to “turn up” that concept as a motivation. I’m claiming that this is false (as demonstrated by the fact that evolution didn’t turn up the grandchild knob, and turn down all the far weaker proxies).
  So the natural next thing to think about is “what did stop evolution from turning up the grandchild count knob?”. Obvious guesses are:
  - Compute limitations: It wasn’t worth it for ancestral humans to spend thinking time working out e.g. that cultivating good friendships within their tribe will increase their E(grandchildren).
  - Intelligence limitations: Perhaps intelligence levels just weren’t up to that kind of long term planning.
  - Evolution path dependence: Often evolution does suboptimal things because it’s restricted to adapting existing machinery, perhaps that applied here.
  - Genome parameter space: The parameter space of evolution (genome) is related in a very complicated way to the goals of a human. Perhaps this mapping makes some kinds of goal adjustments impossible or very unlikely to be found by exploring local genome-space. Not everything has a turnable knob.
  So if we translate these issues back to the problem of aligning ML systems:
  - Compute limitations: These usually also apply to ML, agents tend to have limited compute for planning, and lots of more immediately useful ways to use planning time.
  - Intelligence limitations: Definitely applies currently. Maybe less so in future.
  - Training path dependence: Not super clear, but it wouldn’t surprise me if training often works by adapting previously useful machinery.
  - Parameter space: Far fewer levels of indirection between parameters and “understood concepts” than between genome and human thoughts and planning. Presumably “understanding and applying a concept” is something that happens in activation space, so there is one level of indirection (weights → activations). With CoT, maybe there’s two. There’s more indirection if more thinking and learning happens online, so if we expect more online thinking and learning in future AI then this issue gets worse.
    But separately from the level of indirection, there’s very likely some kinds of changes that are much “easier” for the search to find than others, because of how the search space is parameterised.
  So I expect these effects would add up RLHF creating agents that have a set of non-robust proxies around your intended target, analogous (but perhaps slightly less so) to how I have a set of non-robust proxies for IGF that motivate me.
  - williawa 29 Aug 2025 18:51 UTC
    1 point
    0
    Parent
    Okay, the “genetic knob” is maybe the right language. What I meant is that for evolution to be able to inner-align humans to IGF, you’d need
    Humans occasionally wanting IGF, and using that to inform plans
    Humans doing those plans
    This being a successful strategy (eg, humans planning for IGF didn’t just overthink stuff and die)
    This being “accessible to the genome” in some way. Eg you could turn up and down genetic knobs that increased peoples propensity for 1,2,3.
    I’m saying (1) was not present, so (1,2,3) were clearly not present.
    Its possible a proxy like seeing surviving grandkids was present, but that in that case (2,3) was not present.
    In that case, my theory is consistent with the evidence, but doesn’t necessarily explain it better than other theories. That’s fine.
    Wrt your “what actually caused it”
    Stuff like compute or intelligence limitations are subcomponents of (2,3)
    Path dependence this is what the whole post is about in some sense. Or the brain is conservative. Similarly, I think ML models are conservative, in the sense that: if you do “light” SFT/RL models will find the explanation of the samples you gave that fits with the base-model prior, and boost the underlying circuitry.
    Does this make sense?