Actually nevermind, I think I see a counterexample to the exact formulation in the OP. I haven’t fully formalized it, so there may be holes, but here it is:
Imagine a “toxic piece of shared information”: some type of information such that it inextricably ties together some shared bits of information S with unique bits of information U, making it impossible to define a variable that contains only the shared bits from it. If all variables in a set X are “toxic” in this way, then any redund R, even ones that straight-up copy one of the Xj, would necessarily have I(R;X−i|Xi)≥U for all Xi except Xj.
Suppose we fix a specific ϵ2, i. e., state that only variables with fewer unique bits than that qualify as redunds.
Now consider a set of variables An each variable Ani in which is a set of n independent clones of toxic-information-containing variables {X1i,…,Xni}. Any variable which is a redund over only one set of clones is a valid redund over An, and so there are n redunds like this, all mutually independent.
What would a maximal redund Ωn look like? Well, to properly screen all other redunds off, it would need to include most of the information from each individual independent redund, and so its I(Ωn;An−i|Ani) would scale with n (probably roughly equaling nU?). We can then set n arbitrarily high, so that this quantity shoots well past any fixed ϵ2. And if we discard some D bits from each redund to try and manage this, this would increasingly work against our ability to screen off redunds from An−i, and we can again scale n to make any Ωn that tries to keep its I(Ω;An−i|Ani) under control fail arbitrarily hard at the screen-off task for all-but-a-few redunds.
Concrete example: I think a binary symmetric channel should work here? I. e., set X={X1,X2}, where X2=XOR(X1,Z), X1 is a fair coin, and Z is a biased coin with bias p. We get H(X1)=H(X2)=1, I(X1;X2)=plog2p+(1−p)log2(1−p)=K, H(X1|X2)=H(X2|X1)=1−K. Then an R=X1 has I(X1;X2|R)=0, so it contains K shared bits, and I(R;X1|X2)=1−K, so it qualifies as a redund if we set ϵ2≥1−K. Then, per the construction above, we design a set whose variables are sets of those basic variables’ independent clones, and watch the maximal redund explode.
The subtle caveat here is that I think we need to prove that for any variable Ω, decreasing the amount of unique information (i. e., driving I(Ω;X1|X2) down) trades off against shared information (i. e., I(X1;X2|Ω) ends up increasing), which leads to I(X1;R|Ω) growing. Specifically, we need to prove that this happens at some nontrivial, non-asymptoting rate that leads Ωn’s ϵ1 in the cloned construction to scale with n, well past any small constant multiple of ϵ2.
The counter here would be a continuum of Ωs which can drive I(Ω;X1|X2) arbitrarily low at flatlining decrease in I(X1;X2|Ω). The full counter would prove that this is possible in the general case, not only in the BSC setting.
That it’s impossible seems intuitively obvious (consider if that worked, what possibilities it’d unlock! the universe isn’t that kind), but maybe annoying to prove and the result wouldn’t be very exciting, so it’s a low priority to me.[1] Still, throwing this out here in case you find this reasoning convincing.
(My guess is that “maximal redunds are possible for any sets of random variables” is too strong and we need to explicitly limit it to sets of variables that are approximately structured as sparse hierarchical graphs or something. This is probably okay, see my previous thoughts about the universe’s physics selecting for that.)
Here’s o3′s crack at it if you want a starting point. I’ve found it’s sometimes helpful, but I’m currently fatigued from trying to paranoidally spot the place where it’s trying to slip crap past me, so I’ve not verified it. Looks right at a glance, if you ignore some weird parts.
One thing here is that, in the relative sense, Ωn isn’t any more “guilty” of caring about specific variables than any one Xij-copying R, given that it captures much more shared information about Anis than those Rs do. So, to suit, maybe we should allow ϵ3 to scale with the amount of shared information in the max-redund? Which would be uhh SH(Ω)=H(Ω)−H(Ω|X)−maxiI(Ω;X−i|Xi) or something.
Generalizing, maybe it should be “permitted” for redunds to contain more unique information the more shared information they have? Which would make the tuning parameter for “how pure of a redund R is” not maxiI(R;X−i|Xi) (aka your ϵ2), but SH(R)H(R), i. e., the fraction of the redund’s total entropy[1] that is shared information. That makes intuitive sense to me.
Actually nevermind, I think I see a counterexample to the exact formulation in the OP. I haven’t fully formalized it, so there may be holes, but here it is:
Imagine a “toxic piece of shared information”: some type of information such that it inextricably ties together some shared bits of information S with unique bits of information U, making it impossible to define a variable that contains only the shared bits from it. If all variables in a set X are “toxic” in this way, then any redund R, even ones that straight-up copy one of the Xj, would necessarily have I(R;X−i|Xi)≥U for all Xi except Xj.
Suppose we fix a specific ϵ2, i. e., state that only variables with fewer unique bits than that qualify as redunds.
Now consider a set of variables An each variable Ani in which is a set of n independent clones of toxic-information-containing variables {X1i,…,Xni}. Any variable which is a redund over only one set of clones is a valid redund over An, and so there are n redunds like this, all mutually independent.
What would a maximal redund Ωn look like? Well, to properly screen all other redunds off, it would need to include most of the information from each individual independent redund, and so its I(Ωn;An−i|Ani) would scale with n (probably roughly equaling nU?). We can then set n arbitrarily high, so that this quantity shoots well past any fixed ϵ2. And if we discard some D bits from each redund to try and manage this, this would increasingly work against our ability to screen off redunds from An−i, and we can again scale n to make any Ωn that tries to keep its I(Ω;An−i|Ani) under control fail arbitrarily hard at the screen-off task for all-but-a-few redunds.
Concrete example: I think a binary symmetric channel should work here? I. e., set X={X1,X2}, where X2=XOR(X1,Z), X1 is a fair coin, and Z is a biased coin with bias p. We get H(X1)=H(X2)=1, I(X1;X2)=plog2p+(1−p)log2(1−p)=K, H(X1|X2)=H(X2|X1)=1−K. Then an R=X1 has I(X1;X2|R)=0, so it contains K shared bits, and I(R;X1|X2)=1−K, so it qualifies as a redund if we set ϵ2≥1−K. Then, per the construction above, we design a set whose variables are sets of those basic variables’ independent clones, and watch the maximal redund explode.
The subtle caveat here is that I think we need to prove that for any variable Ω, decreasing the amount of unique information (i. e., driving I(Ω;X1|X2) down) trades off against shared information (i. e., I(X1;X2|Ω) ends up increasing), which leads to I(X1;R|Ω) growing. Specifically, we need to prove that this happens at some nontrivial, non-asymptoting rate that leads Ωn’s ϵ1 in the cloned construction to scale with n, well past any small constant multiple of ϵ2.
The counter here would be a continuum of Ωs which can drive I(Ω;X1|X2) arbitrarily low at flatlining decrease in I(X1;X2|Ω). The full counter would prove that this is possible in the general case, not only in the BSC setting.
That it’s impossible seems intuitively obvious (consider if that worked, what possibilities it’d unlock! the universe isn’t that kind), but maybe annoying to prove and the result wouldn’t be very exciting, so it’s a low priority to me.[1] Still, throwing this out here in case you find this reasoning convincing.
(My guess is that “maximal redunds are possible for any sets of random variables” is too strong and we need to explicitly limit it to sets of variables that are approximately structured as sparse hierarchical graphs or something. This is probably okay, see my previous thoughts about the universe’s physics selecting for that.)
Here’s o3′s crack at it if you want a starting point. I’ve found it’s sometimes helpful, but I’m currently fatigued from trying to paranoidally spot the place where it’s trying to slip crap past me, so I’ve not verified it. Looks right at a glance, if you ignore some weird parts.
Some more spitballing regarding how to fix this:
One thing here is that, in the relative sense, Ωn isn’t any more “guilty” of caring about specific variables than any one Xij-copying R, given that it captures much more shared information about Anis than those Rs do. So, to suit, maybe we should allow ϵ3 to scale with the amount of shared information in the max-redund? Which would be uhh SH(Ω)=H(Ω)−H(Ω|X)−maxiI(Ω;X−i|Xi) or something.
Generalizing, maybe it should be “permitted” for redunds to contain more unique information the more shared information they have? Which would make the tuning parameter for “how pure of a redund R is” not maxiI(R;X−i|Xi) (aka your ϵ2), but SH(R)H(R), i. e., the fraction of the redund’s total entropy[1] that is shared information. That makes intuitive sense to me.
Or its X-relevant entropy? Subtract H(R|X) from the denominator too then.