Second insight:
If you can find Luigi and Waluigi in the behavior vector space, then you have a helpful direction to nudge the AI towards. You nudge it in the direction ofLuigi - Waluigi
.
You need to do this for all (x,y) pairs of Luigis and Waluigis. How do you enumerate all the good things in the world with their evil twins, and then somehow compare the internal embedding shift against all of these directions? Is that even feasible? You probably would just get stuck.
Disclaimer: I have not read the Wentworth’s post or the linked one but I know (little) about finite-sample and asymptotic bounds.
I think the key point of the statement is “any finite-entropy function f(X)”. This makes sure that the “infinity” in the sampling goes away. That being said, it should be possible to extend the proof to non-independent samples, Cosma Shalizi has done a ton of work on this.