Stephen McAleese comments on Stephen McAleese’s Shortform

Stephen McAleese 8 Dec 2024 10:42 UTC
10 points
1
Here’s an argument for why current alignment methods like RLHF are already much better than what evolution can do.
Evolution has to encode information about the human brain’s reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don’t generalize well like “sweet foods are good”.
In contrast, RLHF reward models are built from LLMs with around 25B^[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be much larger than the human genome (~2 orders of magnitude) and this advantage will probably increase in the future as models get larger.
1. ^
  Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback