Yoav

Karma: 62

Yoav 2 Mar 2026 3:00 UTC
3 points
0
in reply to: Drake Thomas’s comment on: Responsible Scaling Policy v3
This was Pliny’s response when I asked them if they can get around the classifiers. I’m not fully confident this counts, but Pliny seems to think so

Yoav 2 Mar 2026 2:28 UTC
5 points
0
on: Responsible Scaling Policy v3
My notes on the RSP:
- I like the shift to conditional risk-based commitments over unilateral ones. Unilateral absolute-risk based commitments were never decision-theoretically or morally optimal IMO.
- I also like that there’s a lot of built-in transparency, it feels like keeping the public informed was a core design principle and I appreciate it. I like much of the roadmap, and I especially like the details of upholding Claude’s Constitution.
- The RSP requires risk reports to include a risk-benefit determination, and says the CEO and RSO make the ultimate call on deployment. But it doesn’t say the CEO and RSO must decide *based* on that determination, leaving open the possibility that a risk report finds the risks outweigh the benefits, and the CEO and RSO move forward anyway. I’m also wary of the implied “benefits outweigh the risks” criterion and would be more comfortable with something like “limit marginal risk to a negligible amount.”. I’d want to know more about how the CEO and RSO intend on making these decisions, and I think that should be highlighted clearly
- An equivalent to ASL-5 / a superintelligence threshold seems to also be completely missing. Does Anthropic believe the effects of arbitrarily advanced AI are captured by “automated R&D”?
- In section 1, the “Mitigations—our plan as a company” column and the “Mitigations—ambitious industry-wide recommendations” use different ontologies, and are hard to compare. What bad things should we expect to happen by default, if only the current plans are achieved? For instance re “Novel chemical/biological weapons”: do I understand correctly that we should actively expect “catastrophic damages far beyond … COVID-19“ unless other AI labs implement better mitigations than Anthropic is currently planning to by the time they reach this threshold? Do the mitigations in the roadmap, such as the “moonshot R&D for security” projects, cover “roughly in line with RAND SL4”?
- I’m a bit fuzzy on training and deployment decisions between risk reports. It’s unclear whether risk reports include proactive assessment of hypothetical next-gen models, whether that covers internally deployed models that get a discussion 30 days after the fact, and whether the board and LTBT reliably get a say when a deployment’s risk analysis is marginal-based.
- Will we find out when Anthropic actually chooses to delay training or deployment? If so we could credit them if they do, and otherwise know that this possibility hasn’t yet materialized.
- As acknowledged in the RSP, the under-specified thresholds are a significant weakness.
This is a lightly edited version of my tweet thread

Yoav 30 Nov 2025 12:57 UTC
42 points
7
on: Claude 4.5 Opus’ Soul Document
I got Opus to translate sections to Hebrew (from memory), and found it really interesting what details it modified/dropped. For instance it never outputted anything about senior Anthropic employees, I find it plausible it didn’t internalize that very strongly.
Would be a cool experiment to run this many times in different languages, maybe get sonnet to compare to the original and highlight more/less consistent points.

Yoav 13 Aug 2025 7:36 UTC
1 point
0
on: How much novel security-critical infrastructure do you need during the singularity?
For example, instead of working on separate computers, they might want to work as different users (or the same user!) on the same Linux server.
Oh god… hope you’re wrong about this providing a significant efficiency boost. If working on the same server is viable then presumably there’s a pretty homogenous software stack, which would also mean docker images can often be reused?
The AI company builds new security-critical infra really fast for the sake of security
This is basically in our SL5 recommendations, with the assumption that the model you’d use for this didn’t merit being trained in an SL5 data center and so is not a catastrophic risk. But maybe it could already misaligned and sneaky enough to introduce a backdoor even if it’s not smart enough to be a catastrophic risk? Do we have a prediction for number of model generations between models still being dumb enough to trust, and models being smart enough to be a catastrophic risk? (with the best case being 1). If we buy the AI-2027 distinction between superhuman coder and superhuman AI researcher, it could even be the case that the big scary model doesn’t write any code, and only comes up with ideas which are then implemented by the superhuman coder, or whatever coder you have that is the smartest one you trust.
At some point, I expect the AI company will want to use its smartest models to create the infrastructure necessary to serve itself over the internet while remaining SL5. Hopefully this only happens after they’re justifiably confident that their model is aligned.
Another idea is to require that any infra created by an untrusted model has to be formally verified against a specification written by a human or trusted model, though I don’t know enough about formal verification to know if that’s good enough.

Yoav 19 Jul 2025 20:08 UTC
3 points
0
on: Agents lag behind AI 2027′s schedule
SWEBench Verified—Anthropic claims 80.2% with parallel test time compute, I think we should count it?

Yoav 28 Jun 2025 20:45 UTC
2 points
−2
on: Help the AI 2027 team make an online AGI wargame
I would like to register that I think this is very vibe-codeable given the right tools, and if there’s an interested game designer that doesn’t have full stack experience I would be willing to help them get up to speed on vibe coding best practices for this project.

Yoav 24 May 2025 7:40 UTC
3 points
0
on: Orienting Toward Wizard Power
Hey this sounds a bit like Arbor Summer Camp. Yeah this is a beautiful vision

Yoav 2 May 2025 2:58 UTC
1 point
0
in reply to: Oliver Daniels’s comment on: Evaluating Oversight Robustness with Incentivized Reward Hacking
I appreciate the praise and insights!
I hadn’t thought of the sandbagging version of adversarial evals and it sounds interesting, although I’m a bit confused about the specifics of the reward function. It sounds to me like in order to catch sandbagging you need an example of the same base model performing better on the task?
Asymmetries in the reward structure—If I understood correctly, I think this is covered by the biased overseer? A critique is too good if the overseer overestimates how similar a bad word is to the clue.
Open to hearing more ideas, I agree there’s more than can be done with this set up

Evaluating Oversight Robustness with Incentivized Reward Hacking

Yoav, Juan V, julianjm and deus_ex_maki

20 Apr 2025 16:53 UTC

9 points

2 comments15 min readLW link

Yoav 16 Sep 2024 5:08 UTC
3 points
0
in reply to: johnswentworth’s comment on: When Hindsight Isn’t 20/20: Incentive Design With Imperfect Credit Allocation
I had the impression collective punishment was disallowed in the IDF, but as far as I can tell by googling this only applies to keeping soldiers from their vacations (including potentially a weekend). I couldn’t find anything about the origin but I bet collectively keeping a unit from going home was pretty common before it was disallowed in 2015, and I think it still happens today sometimes even though it’s disallowed.
source: https://www.idf.il/%D7%90%D7%AA%D7%A8%D7%99-%D7%99%D7%97%D7%99%D7%93%D7%95%D7%AA/%D7%90%D7%AA%D7%A8-%D7%94%D7%A4%D7%A7%D7%95%D7%93%D7%95%D7%AA/%D7%A4%D7%A7%D7%95%D7%93%D7%95%D7%AA-%D7%9E%D7%98%D7%9B-%D7%9C/%D7%9E%D7%A9%D7%98%D7%A8-%D7%95%D7%9E%D7%A9%D7%9E%D7%A2%D7%AA-33/%D7%A9%D7%99%D7%A4%D7%95%D7%98-03/%D7%9E%D7%A0%D7%99%D7%A2%D7%AA-%D7%97%D7%95%D7%A4%D7%A9%D7%94-33-0352/

Yoav 30 May 2023 13:41 UTC
1 point
0
in reply to: jbash’s comment on: DeepMind: Model evaluation for extreme risks
I disagree with almost everything you wrote, here are some counter-arguments:
1. Both OpenAI and Anthropic have demonstrated that they have discipline to control at least when they deploy. GPT-4 was delayed to improve its alignment, and Claude was delayed purely to avoid accelerating OpenAI (I know this from talking to Anthropic employees). From talking to an ARC Evals employee, it definitely sounds like OpenAI and Anthropic are on board with giving as many resources as necessary to these dangerous evaluations, and are on board with stopping deployments if necessary.
2. I’m unsure if ‘selectively’ refers to privileged users, or the evaluators themselves. My understanding is that if the evaluators find the model dangerous, then no users will get access (I could be wrong about this). I agree that protecting the models from being stolen is incredibly important and is not trivial, but I expect that the companies will spend a lot of resources trying to prevent it (Dario Amodei in particular feels very strongly about investing in good security).
3. I don’t think people are expecting the models to be extremely useful without also developing dangerous capabilities.
4. Everyone is obviously aware that ‘alignment evals’ will be incredibly hard to do correctly, without risk of deceptive alignment. And preventing jailbreaks is very highly incentivized regardless of these alignment evals.
5. From talking to an ARC Evals employee, I know that they are doing a lot of work to ensure they have a buffer with regard to what the users can achieve. In particular, they are:
  1. Letting the model use whatever tools might help it achieve dangerous outcomes (but in a controlled way)
  2. Finetuning the models to be better at dangerous things (I believe that users won’t have finetuning access to the strongest models)
  3. Running experiments to check if prompt engineering can achieve results similar to finetuning, or if finetuning will always be ahead
6. If I understood the paper correctly, by ‘stakeholder’ they most importantly mean government/regulators. Basically—if they achieve dangerous capabilities, it’s really good if the government knows, because it will inform regulation.
7. No idea what you are referring to, I don’t see any mention in the paper of letting certain people safe access to a dangerous model (unless you’re talking about the evaluators?)
That said, I don’t claim that everything is perfect and we’re all definitely going to be fine. Particularly, I agree that it will be hard or impossible to get everyone to follow this methodology, and I don’t yet see a good plan to enforce compliance. I’m also afraid of what will happen if we get stuck on not being able to confidently align a system that we’ve identified as dangerous (in this case it will get increasingly more likely that the model gets deployed anyway, or that other less compliant actors will achieve a dangerous model).
Finally—I get the feeling that your writing is motivated by your negative outlook, and not by trying to provide good analysis, concrete feedback, or an alternative plan. I find it unhelpful.

Yoav

The AI company builds new security-critical infra really fast for the sake of security

Eval­u­at­ing Over­sight Ro­bust­ness with In­cen­tivized Re­ward Hacking

Evaluating Oversight Robustness with Incentivized Reward Hacking