it seems to me that we want to verify some sort of temperature convergence. no ai should get way ahead of everyone else at self-improving—everyone should get the chance to self-improve more or less together! the positive externalities from each person’s self-improvement should be amplified and the negative ones absorbed nearby and undone as best the universe permits. and it seems to me that in order to make humanity’s children able to prevent anyone from self-improving way faster than everyone else at the cost of others’ lives, they need to have some significant amount of interpretability so that we can verify things about their brains in terms of their ability to improve morality at least as well as us. if we can make a basic fundamental statement that their convergence towards morality is acceptable, and that they will help us end the critical risk period by helping everything gently slow down to a manageable pace (including damage to our world and the very many unwanted deaths the world currently experiences), then we can chill about getting the hell out to the lightcone—but to do that, we need to gently prevent turning into an outwards singularity before we’ve grown up enough to do that all in sync or what have you. and in order to prevent that, it seems to me that interpretability is needed so that we can run the slight refinement for the refinements to formal verification that miri is presumably almost done with since they’ve been working on it so long.
I really need a supervisor or advisor or what have you, personally, but if I were going to suggest directions to folks—I want to do or see experiments with small, fully interpretable mcts-aided learned planning agents in simulated social environments with other ais and no outside training data whatsoever. then see how far it can be turned up. despite the cultural overhang, I think a strongly safe RL-from-scratch algorithm would be able to be verifiably safe no matter what environment it’s spawned in, and a major step on the way would be being able to interpret what the RL is doing as it gains capability. it seems to me that it needs to be good at making friends “for real” and building large coprotection networks throughout all nearby variables of any kind, and to me, this looks suspiciously like some sort of information objective. MIMI seems like an interesting step on that subpath, though of course it only works on small problems at the moment for training data availability reasons. there’s been a bunch of interesting research about agents playing in groups, as well. I think some from deepmind, I remember seeing it on the youtube channel for the simons institute. (links later.)
it seems to me that we want to verify some sort of temperature convergence. no ai should get way ahead of everyone else at self-improving—everyone should get the chance to self-improve more or less together! the positive externalities from each person’s self-improvement should be amplified and the negative ones absorbed nearby and undone as best the universe permits. and it seems to me that in order to make humanity’s children able to prevent anyone from self-improving way faster than everyone else at the cost of others’ lives, they need to have some significant amount of interpretability so that we can verify things about their brains in terms of their ability to improve morality at least as well as us. if we can make a basic fundamental statement that their convergence towards morality is acceptable, and that they will help us end the critical risk period by helping everything gently slow down to a manageable pace (including damage to our world and the very many unwanted deaths the world currently experiences), then we can chill about getting the hell out to the lightcone—but to do that, we need to gently prevent turning into an outwards singularity before we’ve grown up enough to do that all in sync or what have you. and in order to prevent that, it seems to me that interpretability is needed so that we can run the slight refinement for the refinements to formal verification that miri is presumably almost done with since they’ve been working on it so long.
I really need a supervisor or advisor or what have you, personally, but if I were going to suggest directions to folks—I want to do or see experiments with small, fully interpretable mcts-aided learned planning agents in simulated social environments with other ais and no outside training data whatsoever. then see how far it can be turned up. despite the cultural overhang, I think a strongly safe RL-from-scratch algorithm would be able to be verifiably safe no matter what environment it’s spawned in, and a major step on the way would be being able to interpret what the RL is doing as it gains capability. it seems to me that it needs to be good at making friends “for real” and building large coprotection networks throughout all nearby variables of any kind, and to me, this looks suspiciously like some sort of information objective. MIMI seems like an interesting step on that subpath, though of course it only works on small problems at the moment for training data availability reasons. there’s been a bunch of interesting research about agents playing in groups, as well. I think some from deepmind, I remember seeing it on the youtube channel for the simons institute. (links later.)