I’m working on a new benchmark to measure an issue that has been bugging me in LLMs: hypocrisy (as opposed to integrity).
Can we measure where a model fails to live up to the values they espouse?
Perhaps, by designing a bunch of scenarios and running them with Anthropic’s Petri, we can at least get a partial measure of this.
Hopefully, by shining light on the issue, I can encourage progress to be made.
I’m working on a new benchmark to measure an issue that has been bugging me in LLMs: hypocrisy (as opposed to integrity). Can we measure where a model fails to live up to the values they espouse? Perhaps, by designing a bunch of scenarios and running them with Anthropic’s Petri, we can at least get a partial measure of this.
Hopefully, by shining light on the issue, I can encourage progress to be made.