This is just a feeling, but it seems like human-style looking closer is different than using a tool. Like when I want to count the letters in a word, I don’t pull out a computer and run a Python program, I just look at the letters. What LLM’s are doing seems different since they both can’t see the letters, and can’t really ‘take another look’ (attention is in parallel). Although reasoning sometimes works like taking another look.
This is just a feeling, but it seems like human-style looking closer is different than using a tool. Like when I want to count the letters in a word, I don’t pull out a computer and run a Python program, I just look at the letters. What LLM’s are doing seems different since they both can’t see the letters, and can’t really ‘take another look’ (attention is in parallel). Although reasoning sometimes works like taking another look.
Its not entirely clear to me. LLMs most immanent and direct action is outputting tokens. And you can have tool calls with singular tokens.
I think you can train LLMs to use tools where they’re best thought of as humans moving their arm or focusing their eyes places.
I don’t know if it can reach the same level of integration as human attention, but again, I think thats not really what we need here.