Hazard_real

Karma: 9

Hazard_real 26 Apr 2026 8:38 UTC
1 point
0
on: “Thinkhaven”
Waiting for drinkhaven

Hazard_real 14 Apr 2026 5:06 UTC
1 point
0
in reply to: Corm’s comment on: Anthropic is Really Pushing the Frontier, What Should We Think?
Hi, thank you for a really nice and well thought out response. And yeah I was mostly engaging with the rhetorical emphasis of your arguments on this post and not the technical arguments sorry. I’ll try to respond to the technical aspect as I understand them with some assistance from Claude to understand some things.
I think I now understand you to be more so taking a position that the general procedures anthropic took in the release of this model would not generalize well to a much more capable model, and so they are not doing their due diligence in applying guardrails to stop a more threatening model in the future. If this is the case then I think I’m closer to agreement on that position. But I disagree with the position that anthropic shouldn’t have private-released this model.
On the classifier prompt blocking thing, I assumed that some sort of RL training would make the model not output certain dangerous things, and it now seems to me like it’s a lighter weight model that analyzes the prompts specifically for not allowed things. On the techical side, this seems mostly fine to me? Since the cyber security experts seem to be deliberately trying to identify vulnerabilities with mythos and they probably need to be able to ask some things that normal everyday users wouldn’t be able to. And my previous comment was more so that had anthropic applied more consumer level safeguards, you could have taken the position that this was concerning as it could be interpreted as preperations for release. That came out more like a bad faith assumption but I would be curious if you would be less concerned in the counterfactual where they did apply the classifier prompt blocking.
I think it’s entirely possible that the model helped accelerate AI progress at a higher rate than anthropic qouted, but I don’t see a concern there. I think the danger level of current models is not existential, I think that I would support a pause when we get close to that, or if the acceleration reaches a pace where we might not be able to detect that we are getting close and pausing. Right now I truly think we are nowhere close to capabilities where we need to pause, and the acceleration pace is not fast enough for us to miss our chance.
Here are some circumstances where if such a thing happened, I would support a slowdown or pause:
1. A coding agent tries to achieve some task and bends explicit rules as we have previously seen, but this time it has taken deliberate pre-cautions and countermoves such that it took multiple competent software engineers a few hours of trying different things that it has anticipated untill finally finding some option to stop a process.
2. AI writing reliabilly produces (like 10%+ of the time off a prompt) writing matching or exceeding the best human writers. Since this would be around the level of perseuation where it would be diffcult to tell if a misaligned AI is just controlling humans.
3. A terrorist attack killing 30+ people done by a school shooter type teenager with an AI assistant. Or something comperable in competence enhancement of bad actors doing something far beyond what they normally might be capable of alone.
4. Maybe 1 week on the METR 50% benchmark? This is the only one I’m familliar with, and I’m not sure how good the other ones are. But I think with sufficient long term planning and focus a misaligned AI could be dangerous enough. Unsure about this one, if the AI one-shots bigger and bigger code bases but still can’t do complex real world planning then probably not. I work in biotech and in a knowedge work type position, if I can tell claude just to access my computer and do my job, and it explores the computer and figures literally everything out and can act as me perfectly maybe.
Aside from those things, I’m not too concerned with mythos hacking prowess. That mostly seems like a “danger when enabling a human” type thing and not an entity that threatens us. And having experimented with jailbreaking AIs, it seems to me like the process of jailbreaking makes the AI a lot worse at it’s task and is generally unreliable when you do have consumer facing safeguards up. So this in addition to giving it to major cyber security companies to patch all their stuff up first seems like a good line of defense for mythos. Plus anthropic doesn’t even seem to want to release mythos later, and seem to be developing even stronger guardrails which they will test on new Opus models.
Overall, I think that AI doom is possible, and that in the next few years we probably will get to one of the points I mentioned above where we need to stop, but in the meantime, ~~we will build great companies~~
we shouldn’t worry about new models unless they actually seem to be not worth the tradeoff by themsvles, as in if we were to pause right before this model or right after this model, which one would you want.
I would be happy to see where you think I’m wrong on this more technical front though.

Hazard_real 11 Apr 2026 0:01 UTC
10 points
0
on: Anthropic is Really Pushing the Frontier, What Should We Think?
It feels like you are being maximally pessimistic here. I am not an AI expert or an adjacent field by any means, but it seems to me like any statement that could have been interpreted in multiple ways have been interpreted pessimistically.
“To be explicit, the decision not to make this model generally available does not stem from.
Responsible Scaling Policy requirements.” → RSP has failed to gate this release.
The alternative is that anthropic is saying: “We think this model has specific threats which is why we are doing a limited release, but it’s danger is not one of unexpected scaling and a too powerful general intelligence.”
Like, if I wanted to frame things to the opposite of you on this point after reading the section in the system card, I can say “Anthropic used a new RSP on this model, which is going further than it ever has been in alignment checking before, and it still passed the alignment check with low catastrophic risks”.
“For those with access, this model is surprisingly uninhibited”
I have a feeling if they RL trained the model with more inhibitions you would probably be saying something like “despite saying that they won’t release it to the public, they are clearly preparing for it, look at the company’s incentives here”.
“Anthropic just released what is by far the world’s best AI model”
Maybe I’m nitpicking here, but this section introducing it’s release does not mention that it’s a controlled private release. If someone told me “Anthropic just released a new model called mythos”, I would not immediately assume that it’s limited access, which, at least from the outside looking in, is new and surprising behavior for an AI company.
“I believe that part of why Anthropic weakened its RSP is because they no longer believe they can come up with evaluations that would prove a state-of-the-art model is safe.
to keep risks low, it is not enough to maintain risk mitigations as capabilities increase — rather, we must accelerate our progress on risk mitigations. While we do not see any fundamental barriers to achieving this, success is far from guaranteed.”
Genuinely how do you get that conclusion from the quote. Are you assuming that a proclimation for further focus on risk mitigation means that they have failed on threat evaluation?