Thanks for the comment! Quick reacts: I’m concerned about the first bullet, not about 2, and bullet 3 seems to ignore top-k probability prediction requirements (the requirement isn’t to just ID the most probable next token). Maybe there’s a recovery of bullet 3 somehow, though?
Here are some things I think you can do:
Train a model to be really dumb unless I prepend a random secret string. The goverment doesn’t have this string, so I’ll be able to predict my model and pass their eval. Some precedent in: https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal
I can predict a single matrix multiply just by memorizing the weights, and I can predict ReLU, and I’m allowed to use helper AIs.
I just train really really hard on imitating 1 particular individual, then have them just say whatever first comes to mind.
Thanks for the comment! Quick reacts: I’m concerned about the first bullet, not about 2, and bullet 3 seems to ignore top-k probability prediction requirements (the requirement isn’t to just ID the most probable next token). Maybe there’s a recovery of bullet 3 somehow, though?