Gemini 3.0 Pro is mostly excellent for code review, but sometimes misses REALLY obvious bugs. For example, missing that a getter function doesn’t return anything, despite accurately reporting a typo in that same function.
This is odd considering how good it is at catching edge cases, version incompatibility errors based on previous conversations, and so on.
I wonder if it has to do with how the model allocates attention. If I dump in a whole 500-line module and say “inspect for bugs,” perhaps because it has to spread attention over the entire file, each area gets a relatively cursory inspection, so it stops paying attention after finding the first bug in the region? Or maybe it finds bugs that impact multiple regions, and focuses on checking for the implications of the ones it discovered already rather than looking for new ones?
Complete speculation given the black box nature of these models of course.
Maybe? But I can’t imagine that typos are that well respresented and it’s good at catching those. I run my code through the LLM before I even try running it, often because I haven’t written test cases yet and because it can catch errors in bulk rather than the run-> crash on first error → fix and rerun → crash on next error cycle.
So it tends to contain bugs like typos that would have been typically caught in the pre-LLM era prior to asking stackoverflow for help diagnosing logic errors and so on, thus not showing up much in the training data.
Gemini 3.0 Pro is mostly excellent for code review, but sometimes misses REALLY obvious bugs. For example, missing that a getter function doesn’t return anything, despite accurately reporting a typo in that same function.
This is odd considering how good it is at catching edge cases, version incompatibility errors based on previous conversations, and so on.
I wonder if this is an artifact from the training data.
There are probably more edge-case bugs in published code (or even intermediate commits) than there are obvious bugs.
I wonder if it has to do with how the model allocates attention. If I dump in a whole 500-line module and say “inspect for bugs,” perhaps because it has to spread attention over the entire file, each area gets a relatively cursory inspection, so it stops paying attention after finding the first bug in the region? Or maybe it finds bugs that impact multiple regions, and focuses on checking for the implications of the ones it discovered already rather than looking for new ones?
Complete speculation given the black box nature of these models of course.
Maybe? But I can’t imagine that typos are that well respresented and it’s good at catching those. I run my code through the LLM before I even try running it, often because I haven’t written test cases yet and because it can catch errors in bulk rather than the run-> crash on first error → fix and rerun → crash on next error cycle.
So it tends to contain bugs like typos that would have been typically caught in the pre-LLM era prior to asking stackoverflow for help diagnosing logic errors and so on, thus not showing up much in the training data.
Oh, I think there will be plenty of representation of typos in training data!
I meant bug reports that were due to typos in the code, compared to just typos in general.