I reached out to one of the authors for a call and spent two hours poring over the code for the LLM judge line by line to look for obvious bugs in the implementation. I failed to find any errors, though there still might be some in the sections I could not get around to reviewing.
I reached out to one of the authors for a call and spent two hours poring over the code for the LLM judge line by line to look for obvious bugs in the implementation. I failed to find any errors, though there still might be some in the sections I could not get around to reviewing.