Here’s an argument for why current alignment methods like RLHF are already much better than what evolution can do.
Evolution has to encode information about the human brain’s reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don’t generalize well like “sweet foods are good”.
In contrast, RLHF reward models are built from LLMs with around 25B[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be much larger than the human genome (~2 orders of magnitude) and this advantage will probably increase in the future as models get larger.
Here’s an argument for why current alignment methods like RLHF are already much better than what evolution can do.
Evolution has to encode information about the human brain’s reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don’t generalize well like “sweet foods are good”.
In contrast, RLHF reward models are built from LLMs with around 25B[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be much larger than the human genome (~2 orders of magnitude) and this advantage will probably increase in the future as models get larger.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback