It’s too bad the BALROG benchmark isn’t being updated with the newest models. Nethack is both really hard, gives a floating point score, and is text-based, so if a model is vision-impaired (like the Claudes) there’s less contamination through “the model just can’t see where it is”.
It’s too bad the BALROG benchmark isn’t being updated with the newest models. Nethack is both really hard, gives a floating point score, and is text-based, so if a model is vision-impaired (like the Claudes) there’s less contamination through “the model just can’t see where it is”.