Nice post! I think these are good criticisms that don’t justify the title. Points 1 through 4 are all (specific, plausible) examples of ways we may interpret the activation space incorrectly. This is worth keeping in mind, and I agree that just looking at the activation space of a single layer isn’t enough, but it still seems like a very good place to start.
A layer’s activation is a relatively simple space, constructed by the model, that contains all the information that the model needs to make its prediction. This makes it a great place to look if you’re trying to understand how the model’s thinking.
I really disagree with this piece and others like it. I think there’s a selectively applied fatalism about frontier labs that is entirely unwarranted. Some examples of this selective fatalism:
> Each lab’s emphasis on alignment varies, but none are on track to solve the hard problems, or to prevent these machines from growing irretrievably incompatible with human life.
The entire argument for avoiding frontier labs falls apart if you admit even a 20% likelihood that frontier labs will create aligned superintelligence, because that 20% likelihood implies that a motivated person joining could push it upwards, which would then be an incomprehensibly beneficial and heroic thing for that person to do.
> I don’t expect the marginal extra researcher to substantially improve these odds, even if they manage to resist the oppressive weight of subtle and unsubtle incentives.
Why not? And, why would they have to substantially improve these odds? Pushing the odds from 20% to 20.01% would be an incredible accomplishment for one person.
> The claim: Working within a lab can position a safety-conscious individual to influence the course of that lab’s decisions.
> My assessment: I admit I have a hard time steelmanning this case. It seems straightforwardly true that no individual entering the field right now will be meaningfully positioned to slow the development of superhuman AI from inside a lab.
A group is composed of people. The specific beliefs of the people in that group will be important for deciding what that group does.
If you shake off the fatalism and look at things clearly, you should realize: joining a frontier lab is an incredible opportunity to make things go better. If anyone has the skills to go for it I highly recommend they gather their courage and do so.