they were pretty cracked (researcher at jp morgan chase, ai phd, very smart high schooler), but i think its doable by others to do this and make other new methods too.
lots of flaws in the plan is that we need to specify what things mean and are going to need to improve, iterate on and better explain and justify the specification. e.g. wtf does it actually mean for an eval to be red teamed. what counts??
I agree. This is why we need to make new methods. e.g. in Jan, a team in our hackathon made one of the first interp based evals for llms: https://github.com/gpiat/AIAE-AbliterationBench/
they were pretty cracked (researcher at jp morgan chase, ai phd, very smart high schooler), but i think its doable by others to do this and make other new methods too.
very unfinished, but making an evals course to make this easier, which will be used in an evals course at an ivy league uni https://docs.google.com/document/d/1_95M3DeBrGcBo8yoWF1XHxpUWSlH3hJ1fQs5p62zdHE/edit?tab=t.0
lots of flaws in the plan is that we need to specify what things mean and are going to need to improve, iterate on and better explain and justify the specification. e.g. wtf does it actually mean for an eval to be red teamed. what counts??
btw, cool thing to see is that we’re inspiring others to also red team evals: https://anloehr.github.io/ai-projects/salad/RedTeaming.SALAD.pdf