We create a small dataset of chat and agentic settings from publicly available benchmarks and datasets.
I believe there are some larger datasets of relatively recent real chat evaluations, eg the LMSYS dataset was most recently updated in July (I’m assuming but haven’t verified that the update added more recent chats).
I believe there are some larger datasets of relatively recent real chat evaluations, eg the LMSYS dataset was most recently updated in July (I’m assuming but haven’t verified that the update added more recent chats).
LMSYS is non-agentic though, right? Would be cool to have a dataset of production agent use transcripts.
Correct, and agreed.