I have similar concerns regarding the ligand sets used to test Alphafold3. I’ve had a cursory look at them and it seemed to me there were a lot phosphate containing molecules, a fair few sugars, and also some biochemical co-factors. I haven’t done a detailed analysis, so some caveats. But if true, there are two points here. Firstly there will be a lot of excellent crystallographic training material available on these essentially biochemical entities, so AlphaFold3 is more likely to get these particular ones right. Secondly, these are not drug-like molecules and docking programs are generally parameterized to dock drug-like molecules correctly, so are likely to have a lower success rate on these structures than on drug-like molecules.
I think a more in-depth analysis of performance of AF3 on the validation data is required, as the OP suggests. The problem here is that biochemical chemical space, which is very well represented by experimental 3D structure, is much smaller than potential drug-like chemical space, which is poorly represented by experimental 3D structure comparatively speaking. So inevitably AF3 will often be operating beyond the zone of applicability, for any new drug series. There are ways of getting round this data restriction, including creating physics compliant hybrid models (and thereby avoiding clashing atoms). I’d be very surprised if such approaches are not currently being pursued.
There is an additional important point that needs to be made. Alphafold3 is using predominantly “positive” data. By this I mean the training data encapsulates considerable knowledge of favourable atom-atom or group-group interactions and relative propensities can be deduced. But “negative” data, in other words repulsive electrostatic or Van der Waals interactions, are only encoded by absence because these are naturally not often found in stable biochemical systems. There are no relative propensities available for these interactions. So AF3 can be expected to not perform as well when applied to real-world drug design problems where such interactions have to be taken into account and balanced against each other and against favourable interactions. Again, this issue can be mitigated by creating hybrid physics compliant models.
It is worth also noting that ligand docking is not generally considered a high accuracy technique and, these days is often used to 1st pass screen large molecular databases. The hits from docking are then further assessed using an accurate physics-based method such as Free Energy Perturbation.