I think simplicity/generality priors effectively have 0 effect on whether it’s pushed towards or away from human values, and is IMO kind of orthogonal to alignment-relevant questions.
I’d split it into how do we manage to instill in any goal/value that is ideally at least somewhat stable, ala inner alignment, and outer alignment, which is selecting a goal that is resistant to Goodharting.
Let’s focus on inner alignment.
By instill you presumably mean train. What values get trained is ultimately a learning problem which in many cases (as long as one can formulate approximately a boltzmann distribution) comes down to a simplicity-accuracy tradeoff.
I think simplicity/generality priors effectively have 0 effect on whether it’s pushed towards or away from human values, and is IMO kind of orthogonal to alignment-relevant questions.
I’d be curious how you would describe the core problem of alignment.
I’d split it into how do we manage to instill in any goal/value that is ideally at least somewhat stable, ala inner alignment, and outer alignment, which is selecting a goal that is resistant to Goodharting.
Let’s focus on inner alignment. By instill you presumably mean train. What values get trained is ultimately a learning problem which in many cases (as long as one can formulate approximately a boltzmann distribution) comes down to a simplicity-accuracy tradeoff.