Not sure if this was deliberate on your part, but:
As a simple example, consider a graph G that calculates whether 1x0<1x1, where x is an input array of length 2 (x=[x0,x1]). … We create a hypothesis that claims that the graph calculates x0>x1. … This is technically a correct hypothesis as this is indeed what the graph computes.
Only correct as long as x0,x1 are both >0 or both <0. Which illustrates your point that
So while all the intensionally different implementations behave identically on the test set, they may behave very differently on out-of-distribution samples.
Not sure if this was deliberate on your part, but:
Only correct as long as x0,x1 are both >0 or both <0. Which illustrates your point that