We won’t, but we can get a general sense of whether it might be doing something at all using a bunch of proxies like how robust and secure the system is to human attackers with much more time than the model has and trying to train the model to attack the defenses in a controlled setting.
Can the extent of this ‘control’ be precisely and unambiguously measured?
No
How do we know if it’s working then?
We won’t, but we can get a general sense of whether it might be doing something at all using a bunch of proxies like how robust and secure the system is to human attackers with much more time than the model has and trying to train the model to attack the defenses in a controlled setting.