Does the discriminator get access to the symbolic representation of f, or just its behavior?
If the discriminator only gets access to the behavior of f, I might have a solution. Define g(y,z) = 1 if y = y * z, 0 otherwise. So g(y,z) is 1 if z is 1 or y is 0, which is two different mechanisms.
Pick some way to encode two numbers into one, so we have a one to one mapping E between pairs (y,z) and numbers x. Define f1(x) = g(E(x)).
Now pick a cryptographic hash H that might as well be sha256 except some fresh algorithm not known to the discriminator or guessable by the discriminator. Define f(x) = f1(H(x)).
Does that do what you want? If not, how does it fail?
f in the original problem stands for the AI’s estimate of “are the humans getting what they want?”, and the two mechanisms are the humans really getting what they want vs something insane and unwanted that manages to look that way. It seems unfair for the humans to encrypt their minds to make this difficult. The AI will have access to brain scans, so the discriminator should have access to the symbolic representation of f, and therefore if we assume an unguessable H we are solving the wrong problem. Agreed?
Another problem is that the f I stated is going to be hard for an AI to guess an x such that f(x) = 1. That is easily fixable. Redefine g so it clears all but the low 5 bits of y and z before doing the multiplication. Now the AI can guess x by trial and error with 1000-ish trials.
Eby hasn’t updated his blog https://thinkingthingsdone.com/ in 10 years. He didn’t even put a post there describing his wonderful next job or project, or his wonderful retirement. He is posting to Twitter, so he’s not dead.
If his advice worked for him, he wouldn’t be in that situation.