I don’t understand what you mean by “running a decision theory for one step”. Assume you give the system a problem (in the form of a utility function to maximize or a goal to achieve), and ask to find the best next action to take. This makes the system an Agent with the goal of finding the best next action (subject to all additional constraints you may specify, like maximal computation time, etc). If the system is really intelligent, and the problem (of finding the best next action) is hard so the system needs more resources, and there is any hole anywhere in the box, then the system will get out.
Regarding self-modification, I don’t think it is relevant to the safety issues by itself. It is only important in that using it, the system may become very intelligent very fast. The danger is intelligence, not self-modification. Also, a sufficiently intelligent program may be able to create and run a new program without your knowledge or consent, either by simulating an interpreter (slow, but if the new program makes exponential time-saving, this would still make a huge impact), or by finding and exploiting bugs in its own programming.
I don’t understand what you mean by “running a decision theory for one step”. Assume you give the system a problem (in the form of a utility function to maximize or a goal to achieve), and ask to find the best next action to take. This makes the system an Agent with the goal of finding the best next action (subject to all additional constraints you may specify, like maximal computation time, etc). If the system is really intelligent, and the problem (of finding the best next action) is hard so the system needs more resources, and there is any hole anywhere in the box, then the system will get out.
Regarding self-modification, I don’t think it is relevant to the safety issues by itself. It is only important in that using it, the system may become very intelligent very fast. The danger is intelligence, not self-modification. Also, a sufficiently intelligent program may be able to create and run a new program without your knowledge or consent, either by simulating an interpreter (slow, but if the new program makes exponential time-saving, this would still make a huge impact), or by finding and exploiting bugs in its own programming.