It’s really nice to hear that the paper seems clear! Thanks for the comment.
I’ve been working on this since March, but at a very slow pace, and I took a few hiatuses. most days when I’d work on it, it was for less than an hour. After coming up with the initial framework to tie things together, the hardest part was trying and failing to think of interesting ways in which most of the achilles heels presented could be used as novel containment measures. I discuss this a bit in the discussion section.
For 2-3, I can give some thoughts, but these aren’t necessarily through through much more than many other people one could ask.
I would agree with this. From an agent to even have a notion of being turned off, it would need some sort or model that accounts for this but which isn’t learned via experience in a typical episodic learning setting (clearly because you can’t learn after you’re dead). This would all require a world model which would be more sophisticated than any sort of model-based RL techniques of which I know would be capable of by default.
I also would agree. The most straightforward way for these problems to emerge is if a predictor has access to source code. Though sometimes they can occur if the predictor has access to some other means of prediction which cannot be confounded by the choice of what source code the agent runs. I write a little about this in this post. https://www.lesswrong.com/posts/xoQRz8tBvsznMXTkt/dissolving-confusion-around-functional-decision-theory
It’s really nice to hear that the paper seems clear! Thanks for the comment.
I’ve been working on this since March, but at a very slow pace, and I took a few hiatuses. most days when I’d work on it, it was for less than an hour. After coming up with the initial framework to tie things together, the hardest part was trying and failing to think of interesting ways in which most of the achilles heels presented could be used as novel containment measures. I discuss this a bit in the discussion section.
For 2-3, I can give some thoughts, but these aren’t necessarily through through much more than many other people one could ask.
I would agree with this. From an agent to even have a notion of being turned off, it would need some sort or model that accounts for this but which isn’t learned via experience in a typical episodic learning setting (clearly because you can’t learn after you’re dead). This would all require a world model which would be more sophisticated than any sort of model-based RL techniques of which I know would be capable of by default.
I also would agree. The most straightforward way for these problems to emerge is if a predictor has access to source code. Though sometimes they can occur if the predictor has access to some other means of prediction which cannot be confounded by the choice of what source code the agent runs. I write a little about this in this post. https://www.lesswrong.com/posts/xoQRz8tBvsznMXTkt/dissolving-confusion-around-functional-decision-theory