Hi, thank you for a really nice and well thought out response. And yeah I was mostly engaging with the rhetorical emphasis of your arguments on this post and not the technical arguments sorry. I’ll try to respond to the technical aspect as I understand them with some assistance from Claude to understand some things.
I think I now understand you to be more so taking a position that the general procedures anthropic took in the release of this model would not generalize well to a much more capable model, and so they are not doing their due diligence in applying guardrails to stop a more threatening model in the future. If this is the case then I think I’m closer to agreement on that position. But I disagree with the position that anthropic shouldn’t have private-released this model.
On the classifier prompt blocking thing, I assumed that some sort of RL training would make the model not output certain dangerous things, and it now seems to me like it’s a lighter weight model that analyzes the prompts specifically for not allowed things. On the techical side, this seems mostly fine to me? Since the cyber security experts seem to be deliberately trying to identify vulnerabilities with mythos and they probably need to be able to ask some things that normal everyday users wouldn’t be able to. And my previous comment was more so that had anthropic applied more consumer level safeguards, you could have taken the position that this was concerning as it could be interpreted as preperations for release. That came out more like a bad faith assumption but I would be curious if you would be less concerned in the counterfactual where they did apply the classifier prompt blocking.
I think it’s entirely possible that the model helped accelerate AI progress at a higher rate than anthropic qouted, but I don’t see a concern there. I think the danger level of current models is not existential, I think that I would support a pause when we get close to that, or if the acceleration reaches a pace where we might not be able to detect that we are getting close and pausing. Right now I truly think we are nowhere close to capabilities where we need to pause, and the acceleration pace is not fast enough for us to miss our chance.
Here are some circumstances where if such a thing happened, I would support a slowdown or pause:
A coding agent tries to achieve some task and bends explicit rules as we have previously seen, but this time it has taken deliberate pre-cautions and countermoves such that it took multiple competent software engineers a few hours of trying different things that it has anticipated untill finally finding some option to stop a process.
AI writing reliabilly produces (like 10%+ of the time off a prompt) writing matching or exceeding the best human writers. Since this would be around the level of perseuation where it would be diffcult to tell if a misaligned AI is just controlling humans.
A terrorist attack killing 30+ people done by a school shooter type teenager with an AI assistant. Or something comperable in competence enhancement of bad actors doing something far beyond what they normally might be capable of alone.
Maybe 1 week on the METR 50% benchmark? This is the only one I’m familliar with, and I’m not sure how good the other ones are. But I think with sufficient long term planning and focus a misaligned AI could be dangerous enough. Unsure about this one, if the AI one-shots bigger and bigger code bases but still can’t do complex real world planning then probably not. I work in biotech and in a knowedge work type position, if I can tell claude just to access my computer and do my job, and it explores the computer and figures literally everything out and can act as me perfectly maybe.
Aside from those things, I’m not too concerned with mythos hacking prowess. That mostly seems like a “danger when enabling a human” type thing and not an entity that threatens us. And having experimented with jailbreaking AIs, it seems to me like the process of jailbreaking makes the AI a lot worse at it’s task and is generally unreliable when you do have consumer facing safeguards up. So this in addition to giving it to major cyber security companies to patch all their stuff up first seems like a good line of defense for mythos. Plus anthropic doesn’t even seem to want to release mythos later, and seem to be developing even stronger guardrails which they will test on new Opus models.
Overall, I think that AI doom is possible, and that in the next few years we probably will get to one of the points I mentioned above where we need to stop, but in the meantime, we will build great companies
we shouldn’t worry about new models unless they actually seem to be not worth the tradeoff by themsvles, as in if we were to pause right before this model or right after this model, which one would you want.
I would be happy to see where you think I’m wrong on this more technical front though.
Waiting for drinkhaven