TristanTrim comments on A single principle related to many Alignment subproblems?

TristanTrim 20 May 2025 21:47 UTC
2 points
0

Could you reformulate the last paragraph

I’ll try. I’m not sure how your idea could be used to define human values. I think your idea might have a failure mode around places where people are dissatisfied with their current understanding. I.e. situations where a human wants a more articulate model of the world then they have.

The post is about corrigible task ASI

Right. That makes sense. Sorry for asking a bunch of off topic questions then. I worry that task ASI could be dangerous even if it is corrigible, but ASI is obviously more dangerous when it isn’t corrigible, so I should probably develop my thinking about corrigibility.