Fair enough, in that case I’d at least admit that it’s very possible to my limited knowledge to have a dangerous situation occur if the security ever failed/got hacked.
Which implies that for AI control efforts, an underrated issue is how to figure out a way to prevent the AI from easily seizing the reward button, which is usually not considered specifically.
Inadequate for getting AI to do alignment research (because the AI would ultimately care about producing outputs that convince me, rather than outputs that actually solve the problem, and we have abundant proof that humans can be convinced by incorrect arguments about alignment, otherwise the field wouldn’t have such long-running disagreements) (i.e. I’m taking John Wentworth’s side here)
This is easily my biggest crux here, as I’m generally on the opposite side of John Wentworth here, in that verification is (relatively) easy compared to solving problems, in a lot of domains.
Indeed, I think this is a big reason why research taste is possible at all, because there’s a vast gap between verifying if a solution is correct, and actually finding a correct solution.
That said, the conversation has been very productive, and while I’m tapping out for now, I thank you for letting me have the discussion, because we found a crux between our worldview models.
Fair enough, in that case I’d at least admit that it’s very possible to my limited knowledge to have a dangerous situation occur if the security ever failed/got hacked.
Which implies that for AI control efforts, an underrated issue is how to figure out a way to prevent the AI from easily seizing the reward button, which is usually not considered specifically.
This is easily my biggest crux here, as I’m generally on the opposite side of John Wentworth here, in that verification is (relatively) easy compared to solving problems, in a lot of domains.
Indeed, I think this is a big reason why research taste is possible at all, because there’s a vast gap between verifying if a solution is correct, and actually finding a correct solution.
That said, the conversation has been very productive, and while I’m tapping out for now, I thank you for letting me have the discussion, because we found a crux between our worldview models.