Kabir Kumar comments on Kabir Kumar’s Shortform

Kabir Kumar 10 Oct 2025 16:27 UTC
1 point
0
Maybe this is too in the weeds but I’m skeptical that we can create robust alignment evals using anything resembling current methods.
I agree. This is why we need to make new methods. e.g. in Jan, a team in our hackathon made one of the first interp based evals for llms: https://github.com/gpiat/AIAE-AbliterationBench/
they were pretty cracked (researcher at jp morgan chase, ai phd, very smart high schooler), but i think its doable by others to do this and make other new methods too.
very unfinished, but making an evals course to make this easier, which will be used in an evals course at an ivy league uni https://docs.google.com/document/d/1_95M3DeBrGcBo8yoWF1XHxpUWSlH3hJ1fQs5p62zdHE/edit?tab=t.0
- Kabir Kumar 10 Oct 2025 16:32 UTC
  1 point
  0
  Parent
  lots of flaws in the plan is that we need to specify what things mean and are going to need to improve, iterate on and better explain and justify the specification. e.g. wtf does it actually mean for an eval to be red teamed. what counts??
  btw, cool thing to see is that we’re inspiring others to also red team evals: https://anloehr.github.io/ai-projects/salad/RedTeaming.SALAD.pdf