I think what’s going on is that large language models are trained to “sound smart” in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.
This matches my experience, but I’d be interested in seeing proper evals of this specific point!
This matches my experience, but I’d be interested in seeing proper evals of this specific point!