Specifically you could start by learning about the work already being done in the field of AGI and applying your x-risk ideas to that body of knowledge instead of reinventing the wheel (as AI safety people sadly have often done).
The “reinventing the wheel” I was referencing was the work based on AIXI as a general intelligence algorithm. AIXI does not scale. It is to AGI what photon mapping is to real-time rendering. It is already well known that AIXI will result in all sorts of wireheading like behavior. Yet the proofs of this are heavily dependent on the AIXI architecture, and hence my first issue: I don’t trust that these failure modes apply to other architectures unless they can be independently demonstrated there.
My second issue is what I engaged Vaniver on: these are results showing failure modes where the AI’s reward function doesn’t result in the desired behavior—it wireheads instead. That’s not a very interesting result. On the face it is basically just saying that AI software can be buggy. Yes software can be buggy, and we know how to deal with that. In my day job I manage a software dev team for safety critical systems. What is really being argued here is that AI has fundamentally different error modes than regular safety-critical software, because the AI could end up acting adversarialy and optimizing us out of existence, and being successful at it. That, I am arguing, is both an unjustified cognitive leap and not demonstrated by the examples here.
I replied here because I don’t think I really have more to say on the topic beyond this one post.
I just saw this link, maybe you have thoughts?
(Let’s move subsequent discussion over there)
Earlier in this thread:
The “reinventing the wheel” I was referencing was the work based on AIXI as a general intelligence algorithm. AIXI does not scale. It is to AGI what photon mapping is to real-time rendering. It is already well known that AIXI will result in all sorts of wireheading like behavior. Yet the proofs of this are heavily dependent on the AIXI architecture, and hence my first issue: I don’t trust that these failure modes apply to other architectures unless they can be independently demonstrated there.
My second issue is what I engaged Vaniver on: these are results showing failure modes where the AI’s reward function doesn’t result in the desired behavior—it wireheads instead. That’s not a very interesting result. On the face it is basically just saying that AI software can be buggy. Yes software can be buggy, and we know how to deal with that. In my day job I manage a software dev team for safety critical systems. What is really being argued here is that AI has fundamentally different error modes than regular safety-critical software, because the AI could end up acting adversarialy and optimizing us out of existence, and being successful at it. That, I am arguing, is both an unjustified cognitive leap and not demonstrated by the examples here.
I replied here because I don’t think I really have more to say on the topic beyond this one post.