Alignment = an AI uses the correct model of interpretation of goals
Succeeding in this model design leads to something akin to CEV
Errors under this correct model (e.g. mesa-optimisation) are unlikely because high intelligence + a correct model of interpretation = correct extrapolation of goals
The choice of interpretation model seems almost unrelated to intelligence, as it seems equivalent to the chosen philosophies of meaning.
a) Is my model accurate?
b) Any recommendations for reading that explores alignment from a similar angle (for ex. which philosophy of meaning is most likely to emerge in LLM)? So far, Alex Flint’s posts come up the most. Which research agendas sound closest to this framing?
[Question] Is it correct to frame alignment as “programming a good philosophy of meaning”?
My model is that
Alignment = an AI uses the correct model of interpretation of goals
Succeeding in this model design leads to something akin to CEV
Errors under this correct model (e.g. mesa-optimisation) are unlikely because high intelligence + a correct model of interpretation = correct extrapolation of goals
The choice of interpretation model seems almost unrelated to intelligence, as it seems equivalent to the chosen philosophies of meaning.
a) Is my model accurate?
b) Any recommendations for reading that explores alignment from a similar angle (for ex. which philosophy of meaning is most likely to emerge in LLM)? So far, Alex Flint’s posts come up the most. Which research agendas sound closest to this framing?
Thanks!