It’s a good unhobbling eval, because it’s a task that should be easy for current frontier LLMs at System 1 level, and they only fail because some basic memory/adaptation faculties that humans have are outright missing for AIs right now. No longer failing will be a milestone in no longer obviously missing such features (assuming the improvements are to general management of very long context, and not something overly eval-specific).
It’s a good unhobbling eval, because it’s a task that should be easy for current frontier LLMs at System 1 level, and they only fail because some basic memory/adaptation faculties that humans have are outright missing for AIs right now. No longer failing will be a milestone in no longer obviously missing such features (assuming the improvements are to general management of very long context, and not something overly eval-specific).