This plan is pretty abstract (which is necessary because it’s short) but in some ways I think it’s better than any of the AI companies’ published 400-page plans. From what I’ve seen, AI companies don’t care enough about trying to break their own evals, and they don’t care enough about theory work.
Maybe this is too in the weeds but I’m skeptical that we can create robust alignment evals using anything resembling current methods. A superintelligent AI will be better at breaking evals than humans, so I expect there is a big gap between “our top researchers have tried and failed to find any loopholes in our alignment evals” and “a superintelligence will not be able to find any loopholes”.
AI companies would probably answer that they only need to align weaker models and then bootstrap to an aligned superintelligence. You didn’t talk about that in your meta plan though. I think it would be good for you to talk about that in your meta-plan, since it’s what every AI company (with an alignment team) currently plans on doing.
AI companies would probably answer that they only need to align weaker models and then bootstrap to an aligned superintelligence. You didn’t talk about that in your meta plan though.
I’ve seen that and the claims to do that. This seems to essentially be horseshit though. Also, what i read of the ‘superalignment’ paper claiming this seemed to be basically a method to get better synthetic data while vibing out about safety to make employees and recruits feel good.
Yeah I agree that it’s not a good plan. I just think that if you’re proposing your own plan, your plan should at least mention the “standard” plan and why you prefer to do something different. Like give some commentary on why you don’t think alignment bootstrapping is a solution. (And I would probably agree with your commentary.)
A superintelligent AI will be better at breaking evals than humans, so I expect there is a big gap between “our top researchers have tried and failed to find any loopholes in our alignment evals” and “a superintelligence will not be able to find any loopholes”
Agreed. This is why this plan is ultimately just a tool to get good/useful theory work done faster, more efficiently, etc.
they were pretty cracked (researcher at jp morgan chase, ai phd, very smart high schooler), but i think its doable by others to do this and make other new methods too.
lots of flaws in the plan is that we need to specify what things mean and are going to need to improve, iterate on and better explain and justify the specification. e.g. wtf does it actually mean for an eval to be red teamed. what counts??
This plan is pretty abstract (which is necessary because it’s short) but in some ways I think it’s better than any of the AI companies’ published 400-page plans. From what I’ve seen, AI companies don’t care enough about trying to break their own evals, and they don’t care enough about theory work.
Maybe this is too in the weeds but I’m skeptical that we can create robust alignment evals using anything resembling current methods. A superintelligent AI will be better at breaking evals than humans, so I expect there is a big gap between “our top researchers have tried and failed to find any loopholes in our alignment evals” and “a superintelligence will not be able to find any loopholes”.
AI companies would probably answer that they only need to align weaker models and then bootstrap to an aligned superintelligence. You didn’t talk about that in your meta plan though. I think it would be good for you to talk about that in your meta-plan, since it’s what every AI company (with an alignment team) currently plans on doing.
I’ve seen that and the claims to do that. This seems to essentially be horseshit though. Also, what i read of the ‘superalignment’ paper claiming this seemed to be basically a method to get better synthetic data while vibing out about safety to make employees and recruits feel good.
Yeah I agree that it’s not a good plan. I just think that if you’re proposing your own plan, your plan should at least mention the “standard” plan and why you prefer to do something different. Like give some commentary on why you don’t think alignment bootstrapping is a solution. (And I would probably agree with your commentary.)
Agreed. This is why this plan is ultimately just a tool to get good/useful theory work done faster, more efficiently, etc.
I agree. This is why we need to make new methods. e.g. in Jan, a team in our hackathon made one of the first interp based evals for llms: https://github.com/gpiat/AIAE-AbliterationBench/
they were pretty cracked (researcher at jp morgan chase, ai phd, very smart high schooler), but i think its doable by others to do this and make other new methods too.
very unfinished, but making an evals course to make this easier, which will be used in an evals course at an ivy league uni https://docs.google.com/document/d/1_95M3DeBrGcBo8yoWF1XHxpUWSlH3hJ1fQs5p62zdHE/edit?tab=t.0
lots of flaws in the plan is that we need to specify what things mean and are going to need to improve, iterate on and better explain and justify the specification. e.g. wtf does it actually mean for an eval to be red teamed. what counts??
btw, cool thing to see is that we’re inspiring others to also red team evals: https://anloehr.github.io/ai-projects/salad/RedTeaming.SALAD.pdf