ryan_greenblatt comments on Thomas Kwa’s Shortform

ryan_greenblatt 21 May 2025 0:05 UTC
LW: 11 AF: 9
3
AF
I’m skeptical and/or confused about the video MME results:
- You show Gemini 2.5 Pro’s horizon length as ~5000 minutes or 80 hours. However, the longest videos in the benchmark are 1 hour long (in the long category they range from 30 min to 1 hr). Presumably you’re trying to back out the 50% horizon length using some assumptions and then because Gemini 2.5 Pro’s performance is 85%, you back out a 80-160x multiplier on the horizon length! This feels wrong/dubious to me if it is what you are doing.
- Based on how long these horizon lengths are, I’m guessing you assumed that answering a question about a 1 hour long video takes a human 1 hr. This seems very wrong to me. I’d bet humans can typically answer these questions much faster by panning through the video looking for where the question might be answered and then looking at just that part. Minimally, you can sometimes answer the question by skimming the transcript and it should be possible to watch at 2x/3x speed. I’d guess the 1 hour video tasks take more like 5-10 min for a well practiced human, and I wouldn’t be surprised by much shorter.
- For this benchmark, (M)LLM performance seemingly doesn’t vary much with video duration, invalidating that horizon length (at least horizon length based on video length) is a good measure on this dataset!
- Thomas Kwa 21 May 2025 0:21 UTC
  LW: 4 AF: 2
  0
  AF Parent
  There was a unit conversion mistake, it should have been 80 minutes. Now fixed.
  Besides that, I agree with everything here; these will all be fixed in the final blog post. I already looked at one of the 30m-1h questions and it appeared to be doable in ~3 minutes with the ability to ctrl-f transcripts but would take longer without transcripts, unknown how long.
  In the next version I will probably use the no-captions AI numbers and measure myself without captions to get a rough video speed multiplier, then possibly do better stats that separate out domains with strong human-time-dependent difficulty from domains without (like this and SWE-Lancer).
  - ryan_greenblatt 21 May 2025 0:52 UTC
    LW: 5 AF: 2
    0
    AF Parent
    No captions feels very unnatural because both llms and humans could first apply relatively dumb speech to text tools.