It seems that a lot of the proposals for AGI under the current ML paradigm either utilizes oversight to get a second chance or to get an extra term in the loss-function to promote alignedness. How well either of these types of methods work seem to be dependent on the base rate of aligned AGI to any AGI that can emerge from a particular model and training environment. I’m thinking of it as roughly
This seems related to a thought I had when reading An overview of 11 proposals for building safe advanced AI. How much harder is it to find an environment that promotes aligned AGI compared to any AGI?
It seems that a lot of the proposals for AGI under the current ML paradigm either utilizes oversight to get a second chance or to get an extra term in the loss-function to promote alignedness. How well either of these types of methods work seem to be dependent on the base rate of aligned AGI to any AGI that can emerge from a particular model and training environment. I’m thinking of it as roughly
where M is some model and Ebase is the training environment without safeguards S to detect deceptive or otherwise catastrophic behavior.
This post seems to concern
how much does the environment compared to the model influence the emergence of AGI?
What I’m trying to get at is that I think a related important question is
how much does the alignedness of an emerging AGI depend on its environment compared to the model?