I claim there’s some overlap with brain-like AGI safety; No coincidence, since he’s explicitly inspired by how the brain works. :)
Interpretability becomes much easier
I would state that in a more pessimistic way, by saying “Interpretability seems extraordinarily hard if not impossible in this approach, but it would be even worse in other approaches”. See discussion here & here for example.
Most of the “intelligence” in the system (the world model) is aimed at increasing predictive accuracy, and the agent is motivated by relatively simple hard-coded drives; whether its intelligent behaviors are safe or dangerous will not be predictable in advance.
I think that’s unduly pessimistic. I think it’s a hard problem, but that it’s at least premature to say that it’s impossible. For example, we get to pick the “relatively simple hard-coded drives”, we get to pick the training data / environment, we get to invent other tricks, etc.
Whether an AI deployment leads to catastrophic outcomes will mostly be a function not of the agent’s properties, but of the safety affordances implemented by the people deploying it
My opinion is that by far the most important determinant, and most important intervention point, is whether the agent is trying to bring about catastrophic outcomes (which, again, I see as a hard but not knowably-doomed thing to intervene on). See here.
I claim there’s some overlap with brain-like AGI safety; No coincidence, since he’s explicitly inspired by how the brain works. :)
I would state that in a more pessimistic way, by saying “Interpretability seems extraordinarily hard if not impossible in this approach, but it would be even worse in other approaches”. See discussion here & here for example.
I think that’s unduly pessimistic. I think it’s a hard problem, but that it’s at least premature to say that it’s impossible. For example, we get to pick the “relatively simple hard-coded drives”, we get to pick the training data / environment, we get to invent other tricks, etc.
My opinion is that by far the most important determinant, and most important intervention point, is whether the agent is trying to bring about catastrophic outcomes (which, again, I see as a hard but not knowably-doomed thing to intervene on). See here.