not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback
Are you mostly looking for where there is useful empirical feedback? That sounds like a shot in the dark.
Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs
A concern I have: I cannot conceptually distinguish these continued empirical investigations of methods to build maybe-aligned AGI, from how medieval researchers tried to build perpetual motion machines. It took sound theory to finally disprove the possibility once and for all that perpetual motion machines were possible.
I agree with Charbel-Raphaël that the push for mechanistic interpretability is in effect promoting the notion that there must be possibilities available here to control potentially very dangerous AIs to stay safe in deployment. It is much easier to spread the perception of safety, than to actually make such systems safe.
That, while there is no sound theoretical basis for claiming that scaling mechanistic interpretability could form the basis of such a control method. Nor for that any control method could keep “AGI” safe.
Rather, mechint is fundamentally limited in the extent it could be used to safely control AGI. See posts:
Besides theoretical limits, there are plenty of practical arguments (as listed in Charbel-Raphaël’s post) for why scaling the utilisation of mechint would be net harmful.
So no rigorous basis for that the use of mechint would “open up possibilities” to long-term safety. And plenty of possibilities for corporate marketers – to chime in on mechint’s hypothetical big breakthroughs.
In practice, we may help AI labs again – accidentally – to safety-wash their AI products.
It does seem like a large proportion of disagreements in this space can be explained by how hard people think alignment will be. It seems like your view is actually more pessimistic about the difficulty of alignment than Eliezer’s, because he at least thinks it’s possible for mechinterp to help in principle.
I think that being confident in this level of pessimism is wildly miscalibrated, and such a big disagreement that it’s probably not worth discussing much further. Though I reply indirectly to your point here.
I personally think pessimistic vs. optimistic misframes it, because it frames a question about the world in terms of personal predispositions.
I would like to see reasoning.
Your reasoning in the comment thread you linked to is:
“history is full of cases where people dramatically underestimated the growth of scientific knowledge, and its ability to solve big problems”
That’s a broad reference-class analogy to use. I think it holds little to no weight as to whether there would be sufficient progress on the specific problem of “AGI” staying safe over the long-term.
I wrote why that specifically would not be a solvable problem.
Are you mostly looking for where there is useful empirical feedback?
That sounds like a shot in the dark.
A concern I have:
I cannot conceptually distinguish these continued empirical investigations of methods to build maybe-aligned AGI, from how medieval researchers tried to build perpetual motion machines. It took sound theory to finally disprove the possibility once and for all that perpetual motion machines were possible.
I agree with Charbel-Raphaël that the push for mechanistic interpretability is in effect promoting the notion that there must be possibilities available here to control potentially very dangerous AIs to stay safe in deployment. It is much easier to spread the perception of safety, than to actually make such systems safe.
That, while there is no sound theoretical basis for claiming that scaling mechanistic interpretability could form the basis of such a control method. Nor for that any control method could keep “AGI” safe.
Rather, mechint is fundamentally limited in the extent it could be used to safely control AGI.
See posts:
The limited upside of interpretability by Peter S. Park
Why mechanistic interpretability does not and cannot contribute to long-term AGI safety by me
Besides theoretical limits, there are plenty of practical arguments (as listed in Charbel-Raphaël’s post) for why scaling the utilisation of mechint would be net harmful.
So no rigorous basis for that the use of mechint would “open up possibilities” to long-term safety.
And plenty of possibilities for corporate marketers – to chime in on mechint’s hypothetical big breakthroughs.
In practice, we may help AI labs again – accidentally – to safety-wash their AI products.
It does seem like a large proportion of disagreements in this space can be explained by how hard people think alignment will be. It seems like your view is actually more pessimistic about the difficulty of alignment than Eliezer’s, because he at least thinks it’s possible for mechinterp to help in principle.
I think that being confident in this level of pessimism is wildly miscalibrated, and such a big disagreement that it’s probably not worth discussing much further. Though I reply indirectly to your point here.
I personally think pessimistic vs. optimistic misframes it, because it frames a question about the world in terms of personal predispositions.
I would like to see reasoning.
Your reasoning in the comment thread you linked to is: “history is full of cases where people dramatically underestimated the growth of scientific knowledge, and its ability to solve big problems”
That’s a broad reference-class analogy to use. I think it holds little to no weight as to whether there would be sufficient progress on the specific problem of “AGI” staying safe over the long-term.
I wrote why that specifically would not be a solvable problem.