The Power of Noise
Recently Luke Muelhauser posed the question, “Can noise have power?”, which basically asks whether randomization can ever be useful, or whether for every randomized algorithm there is a better deterministic algorithm. This question was posed in response to a debate between Eliezer Yudkowsky and Scott Aaronson, in which Eliezer contends that randomness (or, as he calls it, noise) can’t ever be helpful, and Scott takes the opposing view. My goal in this essay is to present my own views on this question, as I feel many of them have not yet been brought up in discussion.
I’ll spare the reader some suspense and say that I basically agree with Scott. I also don’t think – as some others have suggested – that this debate can be chalked up to a dispute about the meaning of words. I really do think that Scott is getting at something important in the points he makes, which may be underappreciated by those without a background in a field such as learning theory or game theory.
Before I start, I’d like to point out that this is really a debate about Bayesianism in disguise. Suppose that you’re a Bayesian, and you have a posterior distribution over the world, and a utility function, and you are contemplating two actions A and B, with expected utilities U(A) and U(B). Then randomly picking between A and B will have expected utility , and so in particular at least one of A and B must have higher expected utility than randomizing between them. One can extend this argument to show that, for a Bayesian, the best strategy is always deterministic. Scott in fact acknowledges this point, although in slightly different language:
“Randomness provably never helps in average-case complexity (i.e., where you fix the probability distribution over inputs) -- since given any ensemble of strategies, by convexity there must be at least one deterministic strategy in the ensemble that does at least as well as the average.” -Scott Aaronson
I think this point is pretty iron-clad and I certainly don’t wish to argue against it. Instead, I’d like to present several examples of scenarios where I will argue that randomness clearly is useful and necessary, and use this to argue that, at least in these scenarios, one should abandon a fully Bayesian stance. At the meta level, this essay is therefore an argument in favor of maintaining multiple paradigms (in the Kuhnian sense) with which to solve problems.
I will make four separate arguments, paralleling the four main ways in which one might argue for the adoption or dismissal of a paradigm:
-
Randomization is an appropriate tool in many concrete decision-making problems (game theory and Nash equilibria, indivisible goods, randomized controlled trials).
-
Worst case analyses (which typically lead to randomization) are often important from an engineering design perspective (modularity of software).
-
Random algorithms have important theoretical properties not shared by deterministic algorithms (P vs. BPP).
-
Thinking in terms of randomized constructions has solved many problems that would have been difficult or impossible without this perspective (probabilistic method, sampling algorithms).
1. Concrete usefulness
Two people are playing rock-paper-scissors. After a while one of them starts to lose on average. What should they do? Clearly randomizing will make them stop losing so they should probably do that. I want to contrast this with one of Eliezer’s points, which, roughly, is, “clearly you should randomize if you’re up against some adversarial superintelligence, but in this case we should label it as ‘adversarial superintelligence’ rather than worst case”. Many have taken this as evidence that Scott and Eliezer are really just arguing about the definition of words. I find it difficult to support this conclusion in light of the above rock-paper-scissors example: you don’t need to be playing an adversarial superintelligence, you just need to be playing anyone who is better than you at rock-paper-scissors, which is a reasonably common occurrence.
A related situation shows up in many trade- and bargaining-related situations. For instances, suppose that I have a good that I value at $800, you value the same good at $1,000, but you only have $500 available to spend. We can still make a trade by agreeing that I will give you the good with probability 50%, in exchange for you paying me $450, leading us both to profit by $50 in expectation. This illustrates how randomness can be used to improve market liquidity. You might argue that we could have made the same trade without randomization by finding an event that we both agree has 50% subjective probability mass. But even if you are willing to do that, it is easy to see why another agent may not wish to—they may not wish to divulge information about their beliefs, they may not trust you to accurately divulge your own beliefs, they may not even represent their beliefs as probabilities in the first place, etc. Since agents with these properties are fairly natural, it seems necessary to randomize unless one is willing to give up on profitable trading opportunities.
As yet another example of concrete usefulness, let us consider randomized controlled trials—in other words, using randomness in an experimental design to ensure that we have a representative subset of a population in order to infer a particular causal mechanism. When it is possible to do so, randomization is an incredibly useful aspect of experimental design, freeing us to make strong inferences about the data that would be much more difficult or impossible to make with the same level of confidence without using randomization. For example, say that we want to know whether a job training program improves participants’ earnings. If we separate our sample into a treatment and control group using randomization, and we find a real difference in earnings between the two, there is a strong case that the job training program is the reason for the difference in earnings. If we use any other method to separate the two groups, we run the risk that our method for separating treatment from control was correlated with some other factor that can explain a difference in earnings.
If we follow the “randomness never helps” mantra here, then presumably we should carefully model all of the possible confounding effects in order to make inferences about the main effect—but often the purpose of experimentation is to better understand phenomena about which we currently know little, so that it is unreasonable to think that we can actually accurately model every possible confound in this way. In this situation, “randomness never helps” requires us to pretend to know the answer even in situations where we clearly do not, which seems to me to be problematic. This is despite the fact that there doesn’t seem to be any “superintelligence”—or even intelligent adversary—involved in such situations.
2. Design properties
As noted in the introduction, if we know the probability distribution of the environment, then the optimal strategy is always deterministic (for now, I ignore the issue of whether finding this optimal strategy is feasible, since we are talking about whether randomness helps “in principle”). On the other hand, if we do not know the distribution of the environment, there can be good reasons to randomize, since this can help to “smooth out” potential unexpected structure (examples include randomized quicksort as well as certain load-balancing algorithms). In other words, if we want to argue that randomness shouldn’t help, the most obvious alternative to using randomization would be to reason about the probability distribution of the environment, which seems to match what Eliezer has in mind, as far as I can tell. For instance, Eliezer says:
I certainly don’t say “it’s not hard work”, and the environmental probability distribution should not look like the probability distribution you have over your random numbers—it should contain correlations and structure. But once you know what your probability distribution is, then you should do your work relative to that, rather than assuming “worst case”. Optimizing for the worst case in environments that aren’t actually adversarial, makes even less sense than assuming the environment is as random and unstructured as thermal noise.
Note that reasoning about the probability distribution of the environment requires assuming a particular distribution in the first place. I want to argue that this is often bad from a software design perspective. First, software is supposed to be modular. By this, I mean that software should be broken into components that:
1. Have a clearly specified and verifiable input-output contract.
2. Are re-usable in a wide variety of situations (for the most part, if the input-output contract can only be satisfied in one instance ever then it isn’t a very useful contract).
Pretty much everyone who works on software agrees with this. However, requiring that the inputs to a piece of software follow some probability distribution is the opposite of being modular. Why is this? First, it is impossible to verify from a single input whether the input contract is satisfied, since the contract is now over the distribution of inputs, rather than a single input. Second, requiring a certain distribution over inputs limits the software to the cases where the inputs actually follow that distribution (or a sufficiently similar distribution), thus greatly limiting the scope of applicability. Even worse, such a distributional requirement propagates backward through the entire system: if A calls B, which calls C, and C requires its inputs to follow a certain probability distribution, then B’s contract will probably also have to call for a certain probability distribution over inputs (unless all inputs to C from B are generated without reference to B’s inputs). On the other hand, software operating under worst-case analyses will (by definition) work regardless of input, and will not generate assumptions that propagate back to the rest of the system. Therefore it is important to be willing to adopt a worst-case perspective (or at the very least, something weaker than assuming a full distribution over inputs), at least in this instance. If one wants to argue that the randomness in this scenario is only useful practically but that something deterministic would “in principle” be better, then they would have to also argue that modularity of software is useful practically but something non-modular would “in principle” be better.
3. Theoretical Properties
The P vs. BPP question is the question of whether, whenever there is a polynomial-time randomized algorithm that behaves correctly on every individual input with probability 99.9999999999%, there is also a polynomial-time deterministic algorithm that behaves correctly on every input. If randomness was never useful, we would expect the answer to this question to be straightforwardly yes—after all, whatever the random algorithm was doing, we should be able to improve upon it, and since a deterministic algorithm is either right or wrong, the only way to improve upon 99.9999999999% is to go all the way up to 100%. In reality, however, this is a long-standing open problem in complexity theory that is generally believed to be true, but which is far from proven.
This point was already raised by Scott in his original thread, but did not seem to gain much traction. Several people were suspicious of the fact that the randomized algorithm only had to work almost all of the time on each input, while the deterministic algorithm had to work all of the time on each input. I have several responses to this. The first, somewhat flippant, response, is that we could of course only require that the deterministic algorithm work almost all of the time for each input, as well; but a deterministic algorithm can only get the answer right or wrong, so “almost always” and “always” amount to the same thing for a determistic algorithm.
Secondly, let us consider what it should mean for an algorithm to work “for all practical purposes”. If I have a randomized algorithm, and its probability of failure is so small as to be negligible compared to, say, cosmic rays hitting the computer and corrupting the output, then this seems to meet the bar of “for all practical purposes” (thus motivating the definition of BPP). On the other hand, what should “all practical purposes” mean for a deterministic algorithm? Some might argue that if, on average across inputs to the algorithm, the output is very likely to be correct, then that should satisfy the requirement—but this raises the question, what average should we use? An addition algorithm that thinks that 1+1=3 but is otherwise correct should surely not meet the “all practical purposes” bar, so the uniform distribution over inputs certainly shouldn’t be used. So, perhaps I should be figuring out what I expect my input distribution to look like, and measure according to that? But now, you see, we have added baggage that wasn’t in the original problem definition. All I said is that I had some abstract algorithmic problem that I wanted to solve for (some suitable definition of) all practical purposes. You’ve now told me that in order to do this, I need to figure out what the input distribution to the algorithm is going to be. But this is pretty unreasonable! Can you imagine an undergraduate computer science textbook full of algorithms that would work, but only if its inputs followed such-and-such a distribution? (I’m not claiming that such a guarantee is never useful, only disputing the notion that it should be raised up as the gold standard of algorithms on the same level as P and BPP.) I don’t think anyone would learn much from such a textbook, and computer science would be in a pretty woeful state if this was all we could offer. I therefore contend that the only really reasonable notion of deterministically solving a problem for all practical purposes is that it should always work correctly on all possible inputs, thus leading to the complexity class P. This is all a long-winded way of saying that the P vs. BPP question really is the right way to formally ask whether randomness can help from an algorithmic standpoint.
I’d also like to specially examine what happens if we use a Solomonoff prior (assigning probability to programs of length k). If there is any program of length L that solves the problem correctly, then any deterministic algorithm of length M that is incorrect with probability much less than with respect to the Solomonoff prior must always work (since we can take the program that runs the two algorithms against each other until it finds a discrepancy, then uses that as the input). Therefore, “almost always” with respect to Solomonoff is more or less the same as “always”, so we haven’t really done anything different than redefine P in a very roundabout way. Crucially, this means that even if we are fully Bayesian, under a Solomonoff prior the question of whether every randomized polynomial-time algorithm has a deterministic counterpart still boils down to the unresolved and deep P vs. BPP question.
[Addendum: it looks like Peter de Blanc and Wei Dai have made this last point already, see here and here.)
4. Usefulness in problem-solving
There are important combinatorial properties, such as expansion properties of graphs, which are satisfied by random graphs but which we don’t have any deterministic constructions for at all (or in some cases, we do, but they came much later and were harder to come by). Many important problems in graph theory, Ramsey theory, etc. were solved by considering random combinatorial objects (this was one of the great triumphs of Paul Erdos) and thinking in purely deterministic terms seems very unlikely to have solved these problems. Taking a random/pseudorandom perspective has even led to some of the modern triumphs of mathematics such as the Green-Tao theorem (there are arbitrarily long arithmetic progressions) and is fundamental to theoretical cryptography, game theory, convex analysis, and combinatorial approximation algorithms. If we abandon the use of random constructions, then we also abandon some of the most important work in modern computer science, economics, and applied mathematics, which seems unacceptable to me.
On the algorithmic side, even if you are Bayesian, you want to somehow represent your posterior distribution, and one of the simplest ways to do so is by random samples. An alternative way is by a probability density function, but under extremely widely held complexity-theoretic assumptions there are many probability distributions that can be efficiently represented by samples but whose probability density function cannot be efficiently written down. More practically, the view of representing distributions by random samples has led to many important algorithms such as sequential Monte Carlo (SMC) and Markov Chain Monte Carlo (MCMC), which form some of the cornerstones of modern statistical inference (MCMC in particular has been called one of the 10 greatest algorithms of the 20th century). Again, abandoning this work on philosophical grounds seems unacceptable.
Of course, it is theoretically possible that all of the above gains will eventually be reproduced without using randomness. However, it seems important to note that many of these are among the most beautiful and elegant results in their respective fields. I would object to any characterization of them as “hacks” that come down to a poorly-understood, lucky, but useful result.
-
The obstacles to “derandomizing” such things appear to be quite deep, as extremely intelligent people have been working on them for decades without success. If one wants to say that a randomization-free, Bayesian approach is “correct even if not practical,” this becomes true for a very expansive definition of “practical,” and one must concede that being a Bayesian is often completely impractical even when a great deal of time and resources are available for reasoning.
-
Thinking in terms of randomness is extremely useful in arriving at these deep and important results, and one of the main purposes of a paradigm is to provide modes of thought that lead to such results. I would therefore argue that randomization should be a fundamental lens (among many) with which to approach problems.
Conclusion
I have given several examples of how randomness is useful in both commonplace and fundamental ways. A possible objection is that yes, randomness might be useful from a pragmatic perspective, but it is still the case that in theory we can always do better without randomness. I don’t think that such an objection answers the above arguments, except perhaps in some very weak sense. We saw how, in rock-paper-scissors, randomness is always going to be useful for one of the two players (unless both players are equally strong); we saw how using randomness is necessary to make certain profitable trades; and we saw how, even if we adopt a Solomonoff prior, the question of whether or not deterministic algorithms can compete with randomized algorithms is still open (and the likely resolution will rely upon constructing sufficiently pseudorandom numbers, rather than reasoning carefully about the distribution of inputs). Even if we discard all pragmatic concerns whatsoever, the above examples at the very least show that randomization is useful in a very deep and fundamental way. And importantly, at least from my perspective, we have also seen how randomization has been essential in resolving important problems across a variety of fields. This, if nothing else, is a compelling case that it should be adopted as a feature in one’s worldview, rather than discarded on philosophical grounds.
Acknowledgements
Thanks to Holden Karnofsky for reading an early version of this essay and suggesting the randomized controlled trials example. Thanks also to Sindy Li for suggesting the idea of using randomness to improve market liquidity.
- 24 Dec 2021 17:27 UTC; 7 points) 's comment on Worst-case thinking in AI alignment by (
- 14 Jan 2017 19:24 UTC; 4 points) 's comment on Building Cooperative Epistemology (Response to “EA has a Lying Problem”, among other things) by (EA Forum;
- 29 Oct 2015 4:08 UTC; 4 points) 's comment on The trouble with Bayes (draft) by (
- 2 Jun 2015 10:43 UTC; 0 points) 's comment on Approximating Solomonoff Induction by (
My economics PhD dissertation related to this. I proposed that if asymmetric information is a cause of parties failing to settle a lawsuit they could both improve their position using lotteries. Consider a simple example: I know that if we went to trial I would win at least $1 million, but trial would cost us both 200k. You don’t accept my settlement offer of $1 million because you think I’m lying about the strength of my case. So I propose the following, we flip a coin and if the coin comes up heads we settle for $1 million, whereas if it comes up tails we go to trial and if I fail to win at least $1 million I pay you a big penalty.
But then my dissertation concluded by saying that our failure to observe litigants using such lotteries is evidence against asymmetric information being a cause of parties failure to settle lawsuits.
My inner Yudkowsky says “or maybe it just hasn’t occurred to them”.
But I published my result in a prestigious journal in 1997 and told lots of high status people about it, and still no lotteries.
Did you ever ask anyone in a position to use a lottery why they wouldn’t? “People aren’t trying my idea” is evidence that it’s a bad idea, but weak evidence, preferably replaced by “People say they aren’t trying my idea because X” or “People aren’t trying my idea but can’t articulate why not” when possible.
I asked judge Richard Posner (one of my dissertation advisers) if he would be willing to use lotteries as a judge and he said no, it would get him impeached.
Interesting idea. Brazilian law explicitly admits lottery as a form of settling, but I’m not sure if that example with a penalty for not winning a lawsuit would be admissible.
The legal system is based on the legal fiction that the judge can infallibly make a decision. If the judge makes a decision in a way which is guaranteed to be fallible in a certain percentage of cases, he violates this assumption, even if the guaranteed fallibility from randomness is less than his normal fallibility when not using randomness.
Replace “adversarial superintelligence” with “adversarial game”, and I think you’ll get more agreement among the participants. There are plenty of cases where a “mixed strategy” is optimal. Note that this is not noise, and true randomness isn’t necessary—it only needs to be unpredictable by the opponent, not necessarily random.
Where you don’t have an opponent (or at least one that makes predictions), I’m with Eliezer: noise never helps at a fundamental level.
I do believe that randomness has a place in thinking about problems, and it’s easier (for humans) to reason about randomness than insanely-complex deterministic calculations. But that’s a problem with the reader of the map, not with the map nor the territory.
To my thinking, this is essentially equivalent to conjecturing that P = BPP, which is plausible but still might be false.
ETA: Didn’t read the post before replying to the parent (saw it in the sidebar). Now I see that a good quarter of the post is about P = BPP. Egg on my face!
There is another case where noise helps- threshold effects. If you have as signal below a threshold, a bit of noise can push the signal up into the detectable region.
Do you mean stochastic resonance? If so, good example!
(If not, it’s still a good example—it’s just my good example. ;-)
Also, dithering.
Several of my examples did not have opponents. To list three: the bargaining example, the randomized controlled trials example, and P vs. BPP.
Thanks for writing this! This debate was bugging me too; I don’t have the background to dive into it in detail as in this post but the fact that Eliezer was implicitly taking a pretty strong stance on P vs. BPP bugged me ever since I read the original LW post. This is also a great example of why I think LW needs more discussion of topics like computational complexity.
From a Bayesian perspective, a probability is a cognitive object representing the known evidence about a proposition, flattened into a number. It wouldn’t make sense to draw conclusions about e.g. the existence of certain graphs, just because we in particular are uncertain about the structure of some graph.
The “probabilistic method”, IMHO, is properly viewed as the “measure-theoretic method”, which is what mathematicians usually mean by “probabilistic” anyway. That is, constructions involving random objects can usually (always?) be thought of as putting a measure on the space of all objects, and then arguing about sets of measure 0 and 1 and etc. (I would be interested in seeing examples where this transformation is (a) not relatively straightforward or (b) impossible.) Although the math is the same up to a point, these are different conceptual tools. From Jaynes, Probability Theory:
Whether thinking in terms of randomness is a useful conceptual tool is a different question; personally, I try to separate the intuitions into Bayesian (for cognition) and measure theory (for everything else, e.g. randomized algorithms, quantum mechanics, etc.). It would be nice if these were one and the same, i.e. if Bayesian probability was just measure theoretic probability over sets of hypotheses, but I don’t know of a good choice for the hypothesis space. The Cox theorems work from basic desiderata of a probabilistic calculus, independent of any measure theory; that is the basis of Bayesian probability theory (see Jaynes, Chapter 2).
It seems that in the rock-scissors-paper example the opponent is quite literally an adversarial superintelligence. They are more intelligent than you (at this game), and since they are playing against you, they are adversarial. The RCT example also has a lot of actors with different conflicts of interests, especially money- and career-wise, and some can come pretty close to adversarial.
“adversarial superintelligence” sounds like something you don’t have to worry about facing pre-singularity. “someone who’s better than you at rock-paper-scissors” sounds rather more mundane. Using the former term makes the situation look irrelevant by sneaking in connotations.
What? There is very little software that doesn’t require inputs to follow some probability distribution. When provided with input that doesn’t match that (often very narrow) distribution programs will throw it away, give up, or have problems.
You seem to have put a lot more thought into your other points, could you expand upon this a little more?
Let me try to express it more clearly here:
I agree that it is both reasonable and common for programs to require that their inputs satisfy certain properties (or in other words, for the inputs to lie within a certain set). But this is different than requiring that the inputs be drawn from a certain probability distribution (in other words, requiring 5% of the inputs to be 0000, 7% to be 0001, 6% to be 0010, etc.). This latter requirement makes the program very non-modular because invoking a method in one area of the program alters the ways in which I am allowed to invoke the method in other areas of the program (because I have to make sure that the total fraction of inputs that are 0000 remains 5% across the program as a whole).
Does this make more sense or is it still unclear? Thanks for the feedback.
Well, I don’t know of a single piece of software which requires that its inputs come from specific probability distributions in addition to satisfying some set of properties.
In fact, wouldn’t that software object have to maintain some running counts to even be able to estimate from which distribution do its inputs come?
This is in some sense my point. If you want to be so hardcore Bayesian that you even look down on randomized algorithms in software, then presumably the alternative is to form a subjective probability distribution over the inputs to your algorithm (or perhaps there is some other obvious alternative that I’m missing). But I don’t know of many pieces of code that require their inputs to conform to such a subjective probability distribution; rather, the code should work for ALL inputs (i.e. do a worst-case rather than average-case analysis, which will in some cases call for the use of randomness). I take this to be evidence that the “never use randomness” view would call for software that most engineers would consider poorly-designed, and as such is an unproductive view from the perspective of designing good software.
Any software that uses randomness requires you to meet a probability distribution over its inputs, namely that the random input needs to be random. I assume that you’re not claiming that this breaks modularity, as you advocate the use of randomness in algorithms. Why?
Based on the discusssions with you and Lumifer, I updated the original text of that section substantially. Let me know if it’s clearer now what I mean.
EDIT: Also, thanks for the feedback so far!
The probability distribution part is better, though I still don’t see how software that uses randomness doesn’t fall under that (likewise: compression, image recognition, signal processing, and decision making algorithms).
If the software generates its own randomness (for instance, in JAVA you can do this by creating a new instance of a Random object and calling nextInt(), etc.) then I don’t see how this breaks modularity.
Compression algorithms like bzip don’t promise to make things smaller as part of their contract, they only promise to be lossless. To the extent that a user relies on them becoming smaller consistently, this can lead to trouble, for instance if I expect by inputs of size 10 kilobytes to compress down to 2 kilobytes, and then many of the inputs in a row stay at 10 kilobytes, I can get unexpected load that could create issues.
I don’t know of many image recognition algorithms that are integrated into a large system without a human in the loop, except perhaps Google Image Search which arguably has a human (the end-user) that filters the results at the end. This is precisely due to their non-modularity: they fail in unexpected and difficult-to-characterize ways, such that any attempt to enforce a contract or abstraction would be probably misguided at this point. The best they can currently promise is “we correctly implement the fast Fourier transform” and leave it up to the programmer to decide whether an FFT is what is merited in a given case.
ETA: Another way to put the bzip example which might fit more with your language, is that yes the guarantee of bzip is that it is lossless and that it will make your data smaller as long as the data fit the properties that bzip attempts to exploit, and if that is what we want the contract to be then I would agree that bzip is non-modular.
nitpick: Java default PRNG is a linear congruential generator. It’s not as bad as the infamous RANDU, but I won’t suggest to use it for anything that needs more than a hundred or so pseudo-random bits.
bzip is a great source of random bits.
:)
Do such people exist? I don’t know anyone who fits such a description.
Well, of course. But again, has anyone been arguing against that? The original dispute was talking about abstract optimality and whether particular solutions exist. No one suggested that in real-world practice at the current level of technology rand() Call Considered Harmful.
What about Eliezer?
Well, let’s put it this way. If you can implement a non-randomized solution that works within the specified constraints (time, etc.), it is preferable to a probabilistic one. The real question is what happens when you can’t. I doubt that Eliezer would refuse to implement a probabilistic solution on the grounds that it’s not pure enough and so no solution at all is better than a version tainted by its contact with the RNG.
An exception might be when you need to be absolutely sure what the software does or does not do and any randomness introduces an unacceptable risk. But such a case is very rare.
Sure, I agree with this. But he still seems to think that in principle it is always possible to improve over a randomized algorithm. But doing so requires having some knowledge of the distribution over the environment, and that would break modularity.
Whether or not Eliezer himself is basing this argument on Bayesian grounds, it certainly seems to be the case that many commenters are, e.g.:
Will_Pearson:
DonGeddis:
And some comments by Eliezer:
Eliezer_Yudkowsky:
Eliezer_Yudkowsky:
Another way of swapping around the question is to ask under what circumstances Jacob Steinhardt would refuse to use a PRNG rather than an RNG because the PRNG wasn’t random enough, and whether there’s any instance of such that doesn’t involve an intelligent adversary (or that ancient crude PRNG with bad distributional properties that everyone always cites when this topic comes up, i.e., has that happened more recently with an OK-appearing PRNG).
Obviously I don’t intend to take a stance on the math-qua-math question of P vs. BPP. But to the extent that someone has to assert that an algorithm’s good BPP-related properties only work for an RNG rather than a PRNG, and there’s no intelligent adversary of any kind involved in the system, I have to question whether this could reasonably happen in real life. Having written that sentence it doesn’t feel very clear to me. What I’m trying to point at generally is that unless I have an intelligent adversary I don’t want my understanding of a piece of code to depend on whether a particular zero bit is “deterministic” or “random”. I want my understanding to say that the code has just the same effect once the zero is generated, regardless of what factors generated the zero; I want to be able to screen off the “randomness” once I’ve looked at the output of that randomness, and just ask about the effectiveness of using a zero here or a one there. Furthermore I distrust any paradigm which doesn’t look like that, and reject it as something I could really-truly believe, until the business about “randomness” has been screened off and eliminated from the analysis. Unless I’m trying to evade a cryptographic adversary who really can predict me if I choose the wrong PRNG or write down my random bits someplace that someone else can see them, so that writing down the output of an RNG and then feeding it into the computation as a deterministic constant is genuinely worse because my adversary might sneak a look at the RNG’s output if I left it written down anywhere. Or I’m trying to randomize a study and prevent accidental correlations with other people’s study, so I use an RNG just in case somebody else used a similar PRNG.
But otherwise I don’t like my math treating the same bit differently depending on whether it’s “random” or “deterministic” because its actual effect on the algorithm is the same and ought to be screened off from its origins once it becomes a 1 or 0.
(And there’s also a deep Bayesian issue here regarding, e.g., our ability to actually look at the contents of an envelope in the two-envelope problem and update our prior about amounts of money in envelopes to arrive at the posterior, rather than finding it intuitive to think that we picked an envelope randomly and that the randomized version of this algorithm will initially pick the envelope containing the larger amount of money half the time, which I think is a very clear illustration of the Bad Confused Thoughts into which you’re liable to be led down a garden-path, if you operate in a paradigm that doesn’t find it intuitive to look at the actual value of the random bit and ask about what we think about that actual value apart from the “random” process that supposedly generated it. But this issue the margins are too small to contain.)
Is that helpful?
If your question is “Is there a known practical case not involving an intelligent adversarial environment where the use of a cryptographic PRNG or even a true RNG rather than a non-cryptographic PRNG is warranted?” Then the answer is no.
In fact, this is the reason why it is conjectured that P = BPP.
However, given the rest of your comment, it seems that you are referring to how we understand the theoretical properties of algorithms:
If we are talking about understanding the theoretical properties of many useful randomized algorithms, I’d say that we can’t “screen off” randomness.
Even if the algorithm is implemented using a PRNG with a constant seed, and is thus fully deterministic at the level of what actually runs on the machine, when we reason about its theoretical properties, whether it is a formal analysis or a pre-formal intuitive analysis, we need to abstract away the PRNG and assume that the algorithm has access to true randomness.
As it was already pointed out, if we were perfectly rational Bayesian agents, then, we would always be able to include the PRNG into our analysis:
For instance, given a machine learning problem, a Bayesian agent would model it as a subjective probability distribution, and it may conclude that for that particular distribution the optimal algorithm is an implementation of Random Forest algorithm with the Mersenne Twister PRNG initialized with seed 42.
For a slightly different subjective distribution, the optimal algorithm may be an implementation of a Neural Network trained with Error Backpropagation with weights initialized by a Linear Congruentlial PRNG with seed 12345.
For another slightly different subjective distribution, the optimal algorithm may be some entirely different deterministic algorithm.
In practice, we can’t reason this way. Therefore we assume true randomness in order to say meaningful things about many practically useful algorithms.
A more involved post about those Bad Confused Thoughts and the deep Bayesian issue underlying it would be really interesting, when and if you ever have time for it.
in principle are the key words.
As the old saying goes, in theory there is no difference between theory and practice but in practice there is.
You are using the word “modularity” in a sense weird to me. From my perspective, in the software context “modularity” refers to interactions of pieces of software, not inputs or distributions over the environment.
Based on the discusssions with you and trist, I updated the original text of that section substantially. Let me know if it’s clearer now what I mean.
Also, thanks for the feedback so far!
I think an example of what jsteinhardt is referring to would be quicksort. It can take an arbitrary list as an argument, but for many perversely ordered inputs it takes Omega(n^2). However, it does have an efficient average-case complexity of O(n log n). In other words, if the input is sampled from the uniform distribution over permutations the algorithm is guaranteed to finish in O(n log n) time.
Many of the other examples that were considered are similar, in that the algorithm doesn’t give an error when given an input outside of the expected distribution, but rather silently works less effectively
Quicksort as a piece of software does not require any particular distribution. It is perfectly happy to work with “perversely ordered inputs”.
A human running quicksort with certain expectations about its performance might require a particular distribution, but that’s not a characteristic of software.
Recall that the original claim was that expecting a particular distribution breaks modularity.
If a particular function was advertised as having O(n log(n)) running time, and it turns out that it actually runs in O(n^2) on a certain class of inputs, I would not feel it to be behaving in a very modular way. From an abstraction perspective, breaking time or resource assumptions on an input can be as bad or worse than rejecting it.
Granted, it’s pretty easy to get quicksort to behave on all but the most pathological data sets.
If you have been misled with respect to the behavior of an algorithm on a specific class of inputs, what does it have to do with modularirty??
Modularity refers to the independence of a piece of software from the particulars of other pieces of software. It has nothing to do with performance metrics.
When picking which algorithms to use for particular pieces of your task, performance metrics are a significant factor. If you are told that your algorithm for Section A of the task runs in O(n log n) time and you know that the Section B portion of the program, which requires a stream on inputs from Section A, will run in O(n (log n)^2), you can assume that Section B will be constantly supplied with inputs and the program can busy-wait for the brief portions where it does not have input. If it turns out that for your specific input distribution Section A is taking O(n^2), that’s a critical distinction with far-reaching effects on the design of the rest of your algorithm.
It’s somewhat harder to provide an example for non-networked, non-parallelized code, but I can do so on request.
I understand all that, but I’m reading it as “you should not make mistakes about the expected performance of your code and your application architecture should reflect your use cases.”
There are many ways to screw up a software system.
I think this may be a distinction without a difference; modularity can also be defined as human expectations about software, namely that the software will be relatively easy to hook into a larger system.
I don’t find this a useful definition, but YMMV, de gustibus, and all that...
So you’re differentiating between properties where the probability of [0 1 2 3] is 1-ɛ while >3 is ɛ and probability distributions where the probability of 0 is 0.01, 1 is 0.003, etc? Got it. The only algorithms that I can think of that require the latter are those that require uniformly random input. I don’t think those violate modularity though, as any are of the program that interfaces with that module must provide independently random input (which would be the straightforward way to meet that requirement with an arbitrary distribution).
There’s a difference between requiring and being optimized for though, and there are lots of algorithms that are optimized for particular inputs. Sort algorithms are a excellent example, if most of your lists are almost already sorted, there algorithms that are cheaper on average, but might take a long time with a number of rare orderings.
The randomized control trial is a great example where a superintelligence actually could do better by using a non-random strategy. Ideally, an AI could take its whole prior into account and do a value of information calculation. Even if it had no useful prior, that would just mean that any method of choosing is equally “random” under the the AI’s knowledge.
AI != perfectly rational agent
Ideally = perfectly rational agent
so why did you mention an ‘AI’?
Bayesian adaptive clinical trial designs place subjects in treatment groups based on a posterior distribution. (Clinical trials accrue patients gradually, so you don’t have to assign the patients using the prior: you assign new patients using the posterior conditioned on observations of the current patients.)
These adaptive trials are, as you conjecture, much more efficient than traditional randomized trials.
Example: I-SPY 2. Assigns patients to treatments based on their “biomarkers” (biological measurements made on the patients) and the posterior derived from previous patients.
When I heard one of the authors explain adaptive trials in a talk, he said they were based on multi-armed bandit theory, with a utility function that combines accuracy of results with welfare of the patients in the trial.
However, unlike in conventional multi-armed bandit theory, the trial design still makes random decisions! The trials are still sort of randomized: “adaptively randomized,” with patients having a higher chance of being assigned to certain groups than others, based on the current posterior distribution.
I don’t understand this comment at all.
JWW suggests that an AI could partition trial subjects into control and experimental groups such that expected number of events in both was equal, and presumably also such that cases involving assumptions were distributed equally, to minimize the impact of assumptions. For instance, an AI doing a study of responses to an artificial sweetener could do some calculations to estimate the impact of each gene on sugar metabolism, then partition subjects so as to balance their allele frequencies for those genes.
(A more extreme interpretation would be that the AI is partitioning subjects and performing the experiment not in a way designed to test a single hypothesis, but to maximize total information extracted from the experiment. This would be optimal, but a radical departure from how we do science. Actually, now that I think of it, I wrote a grant proposal suggesting this 7 years ago. My idea was that molecular biology must now be done by interposing a layer of abstraction via computational intelligence in between the scientist and the data, so that the scientist is framing hypotheses not about individual genes or proteins, but about causes, effects, or systems. It was not well-received.)
There’s another comment somewhere countering this idea by noting that this almost requires omniscience; the method one uses to balance out one bias may introduce another.
This doesn’t require omniscience, or AI: people do this now (based on info they have). If you have more info, we know how to use it (there is theory). Why are we talking about AI, this is a math problem.
Slightly harshly worded suggestion (not to you specifically): maybe more reading, less invocation of robo-Jesus in vain.
Eliezer claims that randomness is always bad; many other people claim that one way randomness is good is that it is unbiased. Partitioning subjects into experimental conditions must be unbiased. Using an algorithm and knowing that its biases are orthogonal to the phenomenon being investigated requires omniscience. Besides, if you knew in advance what was relevant, you wouldn’t need to do the experiment.
That is what the comment means. The use of the term “AI” is just to show that the claim is that no real-world agent can be smart enough to do unbiased partitioning in all cases, not just that we’re not smart enough to do it.
In practice, a possibly biased but intelligent partitioning is better when the sample size is small.
We know what property we want (that randomization will give you), good balance in relevant covariates between two groups. I can use a deterministic algorithm for this, and in fact people do, e.g. matching algorithms. Another thing people do is try all possible assignments (see: permutation tests for the null).
Discussion of AI and omniscience is a complete red herring, you don’t need that to show that you don’t need randomness for this. We aren’t randomizing for the sake of randomizing, we are doing it because we want some property that we can directly target deterministically.
I don’t think EY can possibly know enough math to make his claim go through, I think this is an “intellectual marketing” claim. People do this a lot, if we are talking about your claim, you won the game.
If you sort all the subjects on one criteria, it may be correlated in an unexpected way with another criteria you’re unaware of. Suppose you want to study whether licorice causes left-handedness in a population from Tonawanda, NY. So you get a list of addresses from Tonawanda New York, sort them by address, and go down the list throwing them alternately into control and experimental group. Then you mail the experimental group free licorice for a ten years. Voila, after 10 years there are more left-handers in the experimental group.
But even and odd addresses are on opposite sides of the street. And it so happens that in Tonawanda, NY, the screen doors on the front of every house are hinged on the west side, regardless of which way the house faces, because the west wind is so strong it would rip the door off its hinges otherwise. So people on the north side of the street, who are mostly in your experimental group, open the door with their left hand, getting a lot of exercise from this (the wind is very strong), while people on the south side open the screen door with their right hand.
It seems unlikely to me that many hidden correlations would survive alternating picks from a sorted list like this rigged example, but if the sample size is large enough, you’d still be better off randomizing than following any deterministic algorithm, because “every other item from a list sorted on X” has low Kolmogorov complexity and can be replicated by an unknown correlate of your observable variable by chance.
This is perhaps a useful place to illustrate the “randomness hath no power” argument: randomness is unbiased in expectation but we actually expect the absolute amount of biasedness for a randomly selected assignment to be nonzero. When biasing factors are known ahead of time, we do better by controlling for it directly (with, say, a paired assignment).
Exactly! This is a math problem! And it becomes a very complicated math problem very quickly as the prior information gets interesting.
There’s nothing magical about an AI; it can’t figure out anything a human couldn’t figure out in principle. The difference is the “superintelligence” bit: a superintelligent AI could efficiently use much more complicated prior information for experiment design.
I don’t understand the improvement you think is possible here. In a lot of cases, the math isn’t the problem, the theory is known. The difficulty is usually finding a large enough sample size,etc.
There is a lot of statistical literature on optimal experimental design, and it’s used all the time. Years ago at Intel, we spent a lot of time on optimal design of quality control measurements, and I have no doubt a lot of industrial scientists in other companies spend their time thinking about such things.
The problem is, information is a model dependent concept (derivatives of log-likelihood depend on the likelihood), so if your prior isn’t fairly strong, there isn’t a lot of improvement to be had. A lot of science is exploratory, trying to optimize the experimental design is premature.
Either way, this isn’t stuff you need an AI for at all, it’s stuff people talk about and think about now, today, using computer assisted human intellect.
Designing experiments to get more information than just evidence for a single hypothesis is old hat.
I’m going to commit a social faux pas and respond to my own comment, because multiple subthreads are all saying essentially the same thing: this is just math, the theory is known, humans can already do it (often with some help from computers to get through the math).
As I’ve read it, one of the major takeaways of lesswrong is that AI is not magical. If humans cannot possibly figure out the theory, neither can an AI. If humans cannot possibly do the math (possibly with some help from a computer), neither can an AI. Anything an AI can do, a human can also do in principle. They differ only in the degree: AIs will eventually be able to do much more complicated math, solve much more complicated problems, and self-improve much faster and more reliably.
So if you look at my original suggestion and think “that’s nothing special, a human can do that in theory” then you’re completely correct. Things humans can do IN THEORY are EXACTLY the things with which an AI can help.
I’ve read this 3 times, and still can’t tell what it’s supposed to mean. For starters, what is the “that” in “uses that as the input”? Grammatically, it should be “a discrepancy”, but a discrepancy is a pair of outputs that are different, and I don’t see how to use that as an input. I’m also dismayed by the phrase “almost always … is more or less the same as always”, which is a tautology, and so should not be a conclusion.
Something funny has happened to the font of this post, making it difficult for me to read.
Should be fixed now, I think.
It seems to have to do with the fact that the font is sometimes Times, and sometimes whatever the default font is. I don’t see an easy way of fixing this that doesn’t involve me changing lots of HTML tags by hand. If anyone knows how to fix it easily then please let me know.
Um, load your post into a text editor and do a global search-and-replace..?
Luke’s solution
I’ve had some success in the past cut-n-pasting text into Notepad (which automatically strips all formatting) and then back into the edit window.
Won’t that kill things like bullets, links, images, etc.?
Yeah, I guess so. Still might be worth it relative to changing lots of HTML tags by hand. By my (in all likelihood, none too accurate) count, you have four links, three latex expressions (only one of which needs to be latex), two numbered lists, one bullet list, and a bunch of section titles. Not too bad to do by hand...
Sure. But, in order to check the communication channel, which of those examples do you think I (as one of the people in the previous post arguing for seeing the debate as one over word-interpretation) would disagree with either your content or your presentation? How about Eliezer?
I think I should once again push the point that the language and communication difficulties are the primary drivers of this discussion, not the mathematical difficulties: Eliezer’s original post, to me*, is that there is a limited but large case of problems where randomness shouldn’t help, and it’s disappointing that academics in that field don’t know this, and that the terminology they use lends itself to this confusion.
*I should point out I misinterpreted him in one summarization, at least, by shortening a ’kill phrase X and replace it with phrase Y” as “use phrase X as shorthand for phrase Y”
Which field are we talking about? What people? The weighted majority algorithm (the topic of the post that started this all) is one of the cornerstones of statistical learning theory. I would guess that pretty much everyone who knows statistical learning theory well already knows that pure strategies are optimal given complete (probabilistic) knowledge of the environment.
Assuming the existence of closed-form solutions which is not a given.
If your environment is sufficiently complex, you may not be able discover the optimal pure strategy in reasonable time.
I mean yes, I did just write a quarter of my post on this topic :).
Apparently the easiest way to construct an expander graph is via randomness. Deterministic constructions are very difficult.
While an interesting idea, I believe most people just call this “gambling”.
I don’t understand this. It’s perfectly fine for a module (an object) to declare what kind of inputs it will accept. Modularity in software basically means “write to the declared interface and treat internals as a black box” and I don’t see why requiring a particular set of inputs is a problem.
Overall, though, I agree—randomness is highly useful in the real world where even if an optimal solution is known to exist it is often the case that you can’t spend the resources (notably, time) to figure it out.
I’m not sure what you’re driving at here. A gambling system where everybody has a net expected gain is still a good use of randomness.
So what’s the difference between a market where you can buy a chance to get some asset for cash, and a casino where you can buy a chance to get some asset for cash? You have non-coerced humans doing voluntary transactions in both of them.
Don’t forget that you still need a counterparty in the markets.
Attempted a clearer explanation in this comment. Any feedback you have would be great!