I’ve done some baselining on MirrorCode, which is similar. I reckon I am (or at least was) a decent software engineer. Turns out it’s really hard to re-implement a byte-for-byte copy of a program which isn’t trivial or really well specified. And even those have quirks. My approach was usually to set up hypothesis as a test fuzzer in order to find as many edge cases as possible. This helps a lot. Still, I don’t think I ever got a full 100% correct on the final score, and a large portion of time spent on creating MirrorCode was a vetting process to filter out any programs that are “unfair”.
Most programs are buggy. A faithful copy of such a program requires both understanding the intent behind the program, and also working out what exactly is causing a given bug, and then reproducing it. A couple of the less polished programs I baselined caused me a lot of frustration when it turned out I had to break my beautiful abstractions and flows to make a monstrosity that spits out the same values as the program being tested. Another fun source of differences are system differences. Floating point differences are one source. But also e.g. i18n of the underlying system. All in all, loads of opportunity for fun.
I’ve done some baselining on MirrorCode, which is similar. I reckon I am (or at least was) a decent software engineer. Turns out it’s really hard to re-implement a byte-for-byte copy of a program which isn’t trivial or really well specified. And even those have quirks. My approach was usually to set up hypothesis as a test fuzzer in order to find as many edge cases as possible. This helps a lot. Still, I don’t think I ever got a full 100% correct on the final score, and a large portion of time spent on creating MirrorCode was a vetting process to filter out any programs that are “unfair”.
Most programs are buggy. A faithful copy of such a program requires both understanding the intent behind the program, and also working out what exactly is causing a given bug, and then reproducing it. A couple of the less polished programs I baselined caused me a lot of frustration when it turned out I had to break my beautiful abstractions and flows to make a monstrosity that spits out the same values as the program being tested. Another fun source of differences are system differences. Floating point differences are one source. But also e.g. i18n of the underlying system. All in all, loads of opportunity for fun.