MichaelDickens comments on Kabir Kumar’s Shortform

MichaelDickens 10 Oct 2025 15:21 UTC
7 points
1
This plan is pretty abstract (which is necessary because it’s short) but in some ways I think it’s better than any of the AI companies’ published 400-page plans. From what I’ve seen, AI companies don’t care enough about trying to break their own evals, and they don’t care enough about theory work.

Maybe this is too in the weeds but I’m skeptical that we can create robust alignment evals using anything resembling current methods. A superintelligent AI will be better at breaking evals than humans, so I expect there is a big gap between “our top researchers have tried and failed to find any loopholes in our alignment evals” and “a superintelligence will not be able to find any loopholes”.

AI companies would probably answer that they only need to align weaker models and then bootstrap to an aligned superintelligence. You didn’t talk about that in your meta plan though. I think it would be good for you to talk about that in your meta-plan, since it’s what every AI company (with an alignment team) currently plans on doing.
- Kabir Kumar 10 Oct 2025 16:35 UTC
  3 points
  4
  Parent
  
  AI companies would probably answer that they only need to align weaker models and then bootstrap to an aligned superintelligence. You didn’t talk about that in your meta plan though.
  I’ve seen that and the claims to do that. This seems to essentially be horseshit though. Also, what i read of the ‘superalignment’ paper claiming this seemed to be basically a method to get better synthetic data while vibing out about safety to make employees and recruits feel good.
  - MichaelDickens 10 Oct 2025 18:48 UTC
    3 points
    0
    Parent
    Yeah I agree that it’s not a good plan. I just think that if you’re proposing your own plan, your plan should at least mention the “standard” plan and why you prefer to do something different. Like give some commentary on why you don’t think alignment bootstrapping is a solution. (And I would probably agree with your commentary.)
- Kabir Kumar 10 Oct 2025 16:33 UTC
  1 point
  0
  Parent
  A superintelligent AI will be better at breaking evals than humans, so I expect there is a big gap between “our top researchers have tried and failed to find any loopholes in our alignment evals” and “a superintelligence will not be able to find any loopholes”
  Agreed. This is why this plan is ultimately just a tool to get good/useful theory work done faster, more efficiently, etc.
- Kabir Kumar 10 Oct 2025 16:27 UTC
  1 point
  0
  Parent
  Maybe this is too in the weeds but I’m skeptical that we can create robust alignment evals using anything resembling current methods.
  I agree. This is why we need to make new methods. e.g. in Jan, a team in our hackathon made one of the first interp based evals for llms: https://github.com/gpiat/AIAE-AbliterationBench/
  they were pretty cracked (researcher at jp morgan chase, ai phd, very smart high schooler), but i think its doable by others to do this and make other new methods too.
  very unfinished, but making an evals course to make this easier, which will be used in an evals course at an ivy league uni https://docs.google.com/document/d/1_95M3DeBrGcBo8yoWF1XHxpUWSlH3hJ1fQs5p62zdHE/edit?tab=t.0
  - Kabir Kumar 10 Oct 2025 16:32 UTC
    1 point
    0
    Parent
    lots of flaws in the plan is that we need to specify what things mean and are going to need to improve, iterate on and better explain and justify the specification. e.g. wtf does it actually mean for an eval to be red teamed. what counts??
    btw, cool thing to see is that we’re inspiring others to also red team evals: https://anloehr.github.io/ai-projects/salad/RedTeaming.SALAD.pdf