AI Safety Subprojects

There are the AI safety subprojects designed for elucidating “model splintering” and “learning the preferences of irrational agents”.

Im­mo­bile AI makes a move: anti-wire­head­ing, on­tol­ogy change, and model splintering

AI, learn to be con­ser­va­tive, then learn to be less so: re­duc­ing side-effects, learn­ing pre­served fea­tures, and go­ing be­yond conservatism

AI learns be­trayal and how to avoid it

Force neu­ral nets to use mod­els, then de­tect these

Prefer­ences from (real and hy­po­thet­i­cal) psy­chol­ogy papers

Find­ing the mul­ti­ple ground truths of CoinRun and image classification