Status: I’m new to ai alignment, went through AGI fundamentals curriculum, currently reading MIRI’s embedded agents series and I feel confused. I’m not familiar with decision theory.
So far the cases discussed in the series (and ai failure modes in general) felt solid in themselves (if follow its reasoning, with some proofs that I half understand etc) but I don’t yet see the connection between different cases or any low-complexity structure behind them.
For example: there are many kinds of bridges and many ways bridges can crash (failure modes), but they all boil down to the principles of structural mechanics and the structural integrity of bridges can be determined by a few physical measurements agnostic to detailed bridge designs.
Can we boil down decision-theoretic failure modes mentioned in the embedded agent series (and ai failure modes in general) to a more universal picture of predictive principles, as in the case of bridges and structural mechanics?
If there exist such a state of understanding what should I learn first to gain such understanding? Specific reading recommendation is most appreciated!
If not (or this isn’t a good question to ask), why is that so?
Clearly I’m ignorant, but am I ignorant in a specific/common way with specific/common path to get out of it?
Thank you all.
edit: formatting & slight wording, ask for specific recommendations