Recently I’ve seen a bunch of high status people using “inner alignment” in the more general sense, so I’m starting to think it might be too late to stick to the narrow definition. E.g. this post.
Any AI that does a task well is in some sense optimizing for good performance on that task.
I disagree with this. To me there are two distinct approaches, one is to memorize which actions did well in similar training situations, and the other is to predict the consequences of each action and somehow rank each consequence.
for a sufficiently complex training objective, even a very powerful agent -y AI will have a “fuzzy” goal that isn’t an exact specification of what it should do (for example, humans don’t have clearly defined objectives that they consistently pursue).
I disagree with this, but I can’t put it into clear words. I’ll think more about it. It doesn’t seem true for model-based RL, unless we explicitly build in uncertainty over goals. I think it’s only true for humans for value-loading-from-culture reasons.
Recently I’ve seen a bunch of high status people using “inner alignment” in the more general sense, so I’m starting to think it might be too late to stick to the narrow definition. E.g. this post.
I disagree with this. To me there are two distinct approaches, one is to memorize which actions did well in similar training situations, and the other is to predict the consequences of each action and somehow rank each consequence.
I disagree with this, but I can’t put it into clear words. I’ll think more about it. It doesn’t seem true for model-based RL, unless we explicitly build in uncertainty over goals. I think it’s only true for humans for value-loading-from-culture reasons.