The last section felt like it lost contact most severely. It says
What are the main objections to AI for AI safety?
It notably does not say “What are the main ways AI for AI safety might fail?” or “What are the main uncertainties?” or “What are the main bottlenecks to success of AI for AI safety?”. It’s worded in terms of “objections”, and implicitly, it seems we’re talking about objections which people make in the current discourse. And looking at the classification in that section (“evaluation failures, differential sabotage, dangerous rogue options”) it indeed sounds more like a classification of objections in the current discourse, as opposed to a classification of object-level failure modes from a less-social-reality-loaded distribution of failures.
I do also think the frame in the earlier part of the essay is pretty dubious in some places, but that feels more like object-level ontological troubles and less like it’s anchoring too much on social reality. I ended up writing a mini-essay on that which I’ll drop in a separate reply.
I agree it’s generally better to frame in terms of object-level failure modes rather than “objections” (though: sometimes one is intentionally responding to objections that other people raise, but that you don’t buy). And I think that there is indeed a mindset difference here. That said: your comment here is about word choice. Are there substantive considerations you think that section is missing, or substantive mistakes you think it’s making?
The last section felt like it lost contact most severely. It says
It notably does not say “What are the main ways AI for AI safety might fail?” or “What are the main uncertainties?” or “What are the main bottlenecks to success of AI for AI safety?”. It’s worded in terms of “objections”, and implicitly, it seems we’re talking about objections which people make in the current discourse. And looking at the classification in that section (“evaluation failures, differential sabotage, dangerous rogue options”) it indeed sounds more like a classification of objections in the current discourse, as opposed to a classification of object-level failure modes from a less-social-reality-loaded distribution of failures.
I do also think the frame in the earlier part of the essay is pretty dubious in some places, but that feels more like object-level ontological troubles and less like it’s anchoring too much on social reality. I ended up writing a mini-essay on that which I’ll drop in a separate reply.
I agree it’s generally better to frame in terms of object-level failure modes rather than “objections” (though: sometimes one is intentionally responding to objections that other people raise, but that you don’t buy). And I think that there is indeed a mindset difference here. That said: your comment here is about word choice. Are there substantive considerations you think that section is missing, or substantive mistakes you think it’s making?