That might work in a tiny world model with only two possible hypotheses. In a high-dimensional world model with exponentially many hypotheses, the weight on happy humans would be exponentially small.
Wouldn’t there also be exponentially many variants of the “happy humans” hypothesis? We’re really interested in the probability assigned to all hypotheses whose fulfillment leads to human happiness. Once you’ve trained on happy humans videos, I think there’s plausibly enough probability mass assigned to happy humans hypotheses that the AI will actually cause a fair amount of happiness.
There would, so long as the extra dimensions are irrelevant. If there are more relevant dimensions then the total space becomes larger much faster than the happy space. Even having lots of irrelevant dimensions can be risky because it makes the training data sparser in the space being modelled, thus making superexponentially many more alternative hypotheses viable.