jacquesthibs comments on We should try to automate AI safety work asap

jacquesthibs 26 Apr 2025 21:56 UTC
16 points
8
Thanks for publishing this! @Bogdan Ionut Cirstea, @Ronak_Mehta, and I have been pushing for it (e.g., building an organization around this, scaling up the funding to reduce integration delays). Overall, it seems easy to get demoralized about this kind of work due to a lack of funding, though I’m not giving up and trying to be strategic about how we approach things.
I want to leave a detailed comment later, but just quickly:
- Several months ago, I shared an initial draft proposal for a startup I had been working towards (still am, though under a different name). At the time, I did not make it public due to dual-use concerns. I tried to keep it concise, so I didn’t flesh out all the specifics, but in my opinion, it relates to this post a lot.
  - I have many more fleshed-out plans that I’ve shared privately with people in some Slack channels, but have kept them mostly private. If the person reading this would like to have access to some additional details or talk about it, please let me know! My thoughts on the topic have evolved, and we’ve been working on some things in the background, which we have not shared publicly yet.
- I’ve been mentoring a SPAR project with the goal of better understanding how we can leverage current AI agents to automate interpretability (or at least prepare things such that it is possible as soon as models can do this). In fact, this project is pretty much exactly what you described in the post! It involves trying out SAE variants and automatically running SAEBench. We’ll hopefully share our insights and experiments soon.
- We are actively fundraising for an organization that would carry out this work. We’d be happy to receive donations or feedback, and we’re also happy to add people to our waitlist as we make our tooling available.
- purple fire 27 Apr 2025 0:52 UTC
  2 points
  0
  Parent
  Do you think building a new organization around this work would be more effective than implementing these ideas at a major lab?
  - jacquesthibs 27 Apr 2025 1:04 UTC
    7 points
    3
    Parent
    Anthropic is already trying out some stuff. The other labs will surely do some things, but just like every research agenda, whether the labs are doing something useful for safety shouldn’t deter us on the outside.
    I hear the question you asked a lot, but I don’t really hear people question whether we should have had mech interp or evals orgs outside of the labs, yet we have multiple of those. Maybe it means we should do a bit less, but I wouldn’t say the optimal number of outside orgs working on the same things as the AGI labs should be 0.
    Overall, I do like the idea of having an org that can work on automated research for alignment research while not having a frontier model end-to-end RL team down the hall.
    In practice, this separate org can work directly with all of the AI safety orgs and independent researchers while the AI labs will likely not be as hands on when it comes to those kinds of collaborations and automating outside agendas. At the very least, I would rather not bet on that outcome.
    - purple fire 27 Apr 2025 1:07 UTC
      2 points
      0
      Parent
      That makes sense. However, I do see this as being meaningfully different from evals or mech interp in that to make progress you really kind of want access to frontier models and lots of compute/tooling, so for individuals who want to prioritize this approach it might make sense to try to join a safety team at a lab first.
      - jacquesthibs 27 Apr 2025 1:33 UTC
        5 points
        2
        Parent
        We can get compute outside of the labs. If grantmakers, government, donated compute from service providers, etc are willing to make a group effort and take action, we could get an additional several millions in compute spent directly towards automated safety. An org that works towards this will be in a position to absorb the money that is currently inside the war chests.
        This is an ambitious project that makes it incredibly easy to absorb enormous amounts of funding directly for safety research.
        There are enough people who work in AI safety who want to go work at the big labs. I personally do not need or want to do this. Others will try by default, so I’m personally less inclined. Anthropic has a team working on this and they will keep working on it (I hope it works and they share the safety outputs!).
        
        What we need is agentic people who can make things happen on the outside.
        I think we have access to frontier models early enough and our current bottleneck to get this stuff off the ground is not the next frontier model (though obviously this helps), but literally setting up all of the infrastructure/scaffolding to even make use of current models. This could take over 2 years to set everything up. We can use current models to make progress on automating research, but it’s even better if we set everything up to leverage the next models that will drop in 6 months and get a bigger jump in automated safety research than what we get from the raw model (maybe even better than what the labs have as a scaffold).
        I believe that a conscious group effort in leveraging AI agents for safety research, it could allow us to make current models as good (or better) than the next generations models. Therefore, all outside orgs could have access to automated safety researchers that are potentially even better than the lab’s safety researchers due to the difference in scaffold (even if they have a generally better raw model).
- danielms 27 Apr 2025 4:44 UTC
  1 point
  0
  Parent
  Interested in chatting about this! Will send you a pm :D.