My name is pronounced “YOO-ar SKULL-se”.
I’m a DPhil Scholar at the Future of Humanity Institute in Oxford.
My name is pronounced “YOO-ar SKULL-se”.
I’m a DPhil Scholar at the Future of Humanity Institute in Oxford.
I should clarify that I’m not necessarily saying that there can’t be cases in which a system that is believed or intended to be an optimizer_1 might become or turn out to be an optimizer_2 – I have not really argued for or against this. What I want to do is enable clearer thinking about issue, so that one does not slide between these two concepts without noticing.
I don’t think that I’m assuming the existence of some sort of Cartesian boundary, and the distinction between these two senses of “optimizer” does not seem to disappear if you think of a computer as an embedded, causal structure. Could you state more precisely why you think that this is a Cartesian distinction?
A linear program solver is a system that maximises or minimises a linear function subject to non-strict linear constraints.
Many SAT-solvers are implemented as optimizers. For example, they might try to find an assignment that satisfies as many clauses as possible, or they might try to minimise the size of the clauses using resolution.
The argument I quoted does not mention evolution. I’m not saying that the argument can’t be patched, I’m saying that the argument is inadequate as stated. I should note, however, that evolution selects organisms based on their ability to do optimisation_2, not their ability to do optimisation_1. It’s therefore not clear when and how you can simply “generalise from evolution”.
I know these things. Nothing you have said contradicts my point, as far as I can see. The point I am making here is one of conceptual clarification, which the intent of enabling more clear thinking and reasoning.
You seem to be talking about a system that outputs “plans that, if implemented, would achieve X” (roughly), and your point seems to be that such a system would be likely to be or behave like an optimizer_2. I find this claim quite plausible (and fully compatible with the point I’m making).
“It would require a selective blindness to make the superintelligence assume that it is disembodied, and that its computations will continue and produce effects in real world even if its body is destroyed.”
Unclear, if anything it seems like it might be easier to make a Cartesian AI than a non-Cartesian one. But that’s a side note.
I agree with everything you have said.
The fact that a superintelligent AI contains an optimization algorithm does not necessarily mean that this optimization algorithm is itself superintelligent (or that it has access to the world model of the overall system, etc). It might, it might not – it depends on the design of the system.
”the Cartesian boundary is an issue once we talk about self-improving AI.”
This presumably depends on a lot of specific facts about how the system is designed.
I have already (sort of) addressed this point at the bottom of the post. There is a perspective from which any optimizer_1 can (kind of) be thought of as an optimizer_2, but its unclear how informative this is. It is certainly at least misleading in many cases. Whether or not the distinction is “leaky” in a given case is something that should be carefully examined, not something that should be glossed over.
I also agree with what ofer said.
“even if we can make something safe in optimizer_1 terms, it may still be dangerous as an optimizer_2 because of unexpected behavior where it “breaks” the isomorphism and does something that might still keep the isomorphism in tact but also does other things you didn’t think it would do if the isomorphism were strict”
I agree. Part of the reason why it’s valuable to make the distinction is to enable more clear thinking about these sorts of issues.
As you may know, CDT has a lot of fans in academia. It might be interesting to consider what they have to say about Newcomb’s Problem (and other supposed counter-examples to CDT).
In “The Foundations of Causal Decision Theory”, James Joyce argues that Newcomb’s Problem is “unfair” on the grounds that it treats EDT and CDT agents differently. An EDT agent is given two good choices ($1,000,000 and $1,001,000) whereas a CDT agent is given two bad choices ($0 and $1,000). If you wanted to represent Newcomb’s Problem as a Markov Decision Process then you would have to put EDT and CDT agents in different MDPs. Lo and behold, the EDT agent gets more money, but this is (according to Joyce) just because it is given an unfair advantage. Hence Newcomb’s Problem isn’t really too different from the obviously unfair “decision” problem you gave above, the unfairness is just obfuscated. The fact that EDT outperforms CDT in a situation in which EDT agents are unconditionally given more money than CDT agents is not an interesting objection to CDT, and so Newcomb’s Problem is not an interesting objection to CDT (according to Joyce).
It might be worth thinking about this argument. Note that this argument operates at the level of individual decision problems, and doesn’t say anything about whether its worth taking into account the possibility that different sorts of agents might tend end up in different sorts of situations. It also presumes a particular way of answering the question of whether two decision problems are “the same” problem.
I also want to note that you don’t need perfect predictors, or anything even close to that, to create Newcomblike situations. Even if the Predictor’s accuracy is only somewhat better than a coin flip this is sufficient to make the causal expected utility different from the evidential expected utility. The key property is that which action you take constitutes evidence about the state of the environment, which can happen in many ways.
Yes, as Adam Shimi suggested below, I’m here talking about “input-output relations” when I say “functions”.
The claim can be made formal in a few different ways, but one way is this: let’s say we have a classification task with 10 classes, 50k training examples, and 10k test examples. If we consider the domain to be these 60k instances (and the codomain to be the 10 classes), then we have 10^60k functions defined in this space, 10^10k of which fit the training data perfectly. Of these 10^10k functions, the vast vast majority will have very poor accuracy on the test set. Therefore, if we tried to do inference by picking a random function that fits the training data, we would almost certainly get very poor generalisation (in fact, we would expect to get the same accuracy as with random guessing).
We can sometimes ensure that a learning algorithm will generalise well if it can only express a restricted set of functions (see VC dimensionality). However, a neural network is too expressive for this explanation to work (because they can express all those 10^10k functions). Therefore, there must be some further explanation for why neural networks tend to learn functions that generalise well (ie, why they tend to learn low-complexity functions, when they in principle could learn a high-complexity function instead).
We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.
It’s true that we don’t have a rigorous mathematical proof for “why” simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.
First of all, we can look at the claim “backwards”; suppose a function has a large volume in parameter-space. This means that a lot of the parameters in a sense are redundant, so it should be possible to compress the representation of this function. Conversely, if a function has a small volume in parameter-space, then all the parameters of the network are necessary to pinpoint that function, and so writing out the entire structure of the network might be one of the shortest ways to represent that function. Therefore, we should expect functions with large volumes in parameter-space to be simple, and functions with small volumes to be complex.
There is a mathematical theorem, called the Levin bound, that roughly formalises this intuition. This bound says essentially that if we have a space of functions F, and a space of parameters P, and a parameter-function map m : P → F, where m itself has low complexity relative to the most complex functions in F (and F is overparameterised by P), then m will be exponentially biased towards simple functions (ie, if an element f of F has low complexity, then there is an exponentially larger number of elements of P that map to f via m).
The Levin bound doesn’t apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.
Also, check out this.
We do have empirical data which shows that the neural network “prior” is biased towards low-complexity functions, and some arguments for why it would make sense to expect this to be the case—see this new blog post, and my comment here. Essentially, low-complexity functions correspond to larger volumes in the parameter-space of neural networks. If functions with large volumes also have large basins of attraction, and if using SGD is roughly equivalent to going down a random basin (weighted by its size), then this would essentially explain why neural networks work.
I haven’t seen the paper you link, so I can’t comment on it specifically, but I want to note that the claim “SGD is roughly Bayesian” does not imply “Bayesian inference would give better generalisation than SGD”. It can simultaneously be the case that the neural network “prior” is biased towards low-complexity functions, that SGD roughly follows the “prior”, and that SGD provides some additional bias towards low-complexity functions (over and above what is provided by the “prior”). For example, if you look at Figure 6 in the post I link, you can see that different versions of SGD do provide a slightly different inductive bias. However, this effect seems to be quite small relative to what is provided by the “prior”.
I believe Chris has now updated the Towards Data Science blog post to be more clear, based on the conversation you had, but I’ll make some clarifications here as well, for the benefit of others:
The key claim, that , is indeed not (meant to be) dependent on any test data per se. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take “the space of functions” to be all the functions that a given neural network can express on the entire vector space in which it is defined, then there would be an uncountably infinite number of such functions, and any given function would never show up more than once in any kind of experiment we could do. We therefore need a way to lump the functions together into sensible buckets, and we decided to do that by looking at what output the function gives on a set of images not used in training. Stated differently, we look at the partial function that the network expresses on a particular subset of the input vector space, consisting in a bunch of points sampled from the underlying data distribution. So, basically, your optimistic guess is correct—the test data is only used as a way to lump functions together into a finite number of sensible buckets (and the test labels are not used).
Yes, I agree with this.
I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It’s also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and batch size you use, etc. However, these effects seem to be quite small compared to the effect that comes from the parameter-function map.
I should also note that I think that the fact that Gaussian processes even work at all already in itself gives us a fairly good reason to expect them to capture most of what makes NNs work in practice. For any given function approximator, if that function approximator is highly expressive then the “null hypothesis” should be that it basically won’t generalise at all. The fact that NNs and GPs both work, and the fact that there is a fairly strong correspondence between them, means that it would be kind of weird if they worked for completely different reasons.
I’m not sure I agree with this—I think Kolmogorov complexity is a relevant notion of complexity in this context.
Ah, I certainly agree with this.
I do not wish to claim that all functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the identity function certainly has a low Kolmogorov complexity, but it’s very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training on only odd numbers then the network will not robustly generalise to even numbers (or vice versa). Similarly, if you train using only number in a certain range then the network will not robustly generalise outside this range. This is because a pattern such as “the n’th input neuron is equal to the n’th output neuron” lacks a simple representation in a neural network (and hence this function has a small parameter-space volume, even though it has low Kolmogorov complexity). The same goes for the function that recognises palindromes, and etc.
So, I agree that there are certain functions with low Kolmogorov complexity that a neural network normally cannot “see” properly. I also think one could frame a lot of the research on developing new neural network architectures as being about making neural networks able to “see” more kinds of functions. For example, NALUs give neural networks the ability to “see” arithmetic relations more easily. I hence certainly think it’s a very relevant question which complexity measure best describes the bias in neural networks (and I think this actually matters for practical problems). Note that the identity function is very smooth.
This is a bit of a tangent, but the Levin bound is actually about Kolmogorov complexity. It’s a fairly simple theorem; the proof is constructive, and basically shows that a given function f which corresponds to many parameters in the parameter-space cannot be too complex, by constructing a simple program which computes f. Very roughly, if the parameter-space is finite and discrete, then we could construct a Huffman code for the function space (where the distribution over the function-space is the distribution that corresponds to the uniform distribution over the parameter-space). We can then make a computer program that computes f by concatenating the Huffman code of f and the parameter-function map m (which gives an upper bound on the Kolmogorov complexity of functions with large volumes). Of course, this theorem does not per se actually apply to neural networks, since it assumes that the parameter-space is finite and discrete, so in this context it’s essentially just an intuition pump.
There is a proof of it in “An Introduction to Kolmogorov Complexity and Its Applications” by Ming Li & Paul Vitanyi.
Yes, it does of course apply in that sense.
I guess the question then basically is which level of abstraction we think would be the most informative or useful for understanding what’s going on here. I mean, we could for example also choose to take into account the fact that any actual computer program runs on a physical computer, which is governed by the laws of electromagnetism (in which case the parameter-space might count as continuous again).
I’m not sure if accounting for the floating-point implementation is informative or not in this case.
Like Rohin, I’m not impressed with the information theoretic side of this work.
Specifically, I’m wary of the focus on measuring complexity for functions between finite sets, such as binary functions.
Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n. The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set. I don’t think this captures the kinds of function complexity we care about for NNs.
We’re not saying that discrete complexity measures fully capture what we care about for NNs! We do however think that they are sufficiently relevant to be informative for the bigger picture, even if just as a proxy for what we actually care about.
Most complexity measures give roughly similar values for the (relative) complexity of most objects, so our assumption is that if something is the case for a bunch of different tractable complexity measures, then this is also likely to be the case for whatever the “ideal” complexity measure would be in the relevant case. In particular, if regardless of whether K represents Boolean complexity, or LZ complexity, etc, then this is also likely to be true for the “ideal” complexity measure for neural networks.
Also: since we’re estimating various probabilities by sampling, we basically need to discretise the function space. If you have any concrete suggestions for how to get around this then we’re all ears!
As for the rest of your comment—what you’re saying here seems true to me, but I’m not sure I see how any of this is a counterpoint to anything we’re saying?
Yes, agents with different preferences are incentivised to cooperate provided that the cost of enforcing cooperation is less than the cost of conflict. Agreeing to adopt a shared utility function via acausal trade might potentially be a very cheap way to enforce cooperation, and some agents might do this just based on their prior. However, this is true for any agents with different preferences, not just agents of the type I described. You could use the same argument to say that you are in general unlikely to find two very intelligent agents with different utility functions.