but bad at acting coherently. Most work requires agency like OSWorld, which may be why AIs can’t do the average real-world 1-hour task yet.
I’d have guessed that poor performance on OSWorld is mostly due to poor vision and mouse manipulation skills, rather than insufficient ability to act coherantly.
I’d guess that typical self-contained 1-hour task (as in, a human professional could do it in 1 hour with no prior context except context about the general field) also often require vision or non-text computer interaction and if they don’t, I bet the AIs actually do pretty well.
I’d have guessed that poor performance on OSWorld is mostly due to poor vision and mouse manipulation skills, rather than insufficient ability to act coherantly.
I’d guess that typical self-contained 1-hour task (as in, a human professional could do it in 1 hour with no prior context except context about the general field) also often require vision or non-text computer interaction and if they don’t, I bet the AIs actually do pretty well.