Joar Skalse

Karma: 740

My name is pronounced “YOO-ar SKULL-se” (the “e” is not silent). I’m a PhD student at Oxford University, and I was a member of the Future of Humanity Institute before it shut down. I have worked in several different areas of AI safety research. For a few highlights, see:

Some of my recent research on the theoretical foundations of reward learning is also described in this sequence.

For a full list of all my research, see my Google Scholar.

Joar Skalse Nov 22, 2023, 10:09 AM
1 point
0
in reply to: Daniel Murfet’s comment on: My Criticism of Singular Learning Theory
Yes, I mostly just mean “low test error”. I’m assuming that real-world problems follow a distribution that is similar to the Solomonoff prior (i.e., that data generating functions are more likely to have low Kolmogorov complexity than high Kolmogorov complexity) -- this is where the link is coming from. This is an assumption about the real world, and not something that can be established mathematically.

Joar Skalse Nov 22, 2023, 10:03 AM
1 point
0
in reply to: Alexander Gietelink Oldenziel’s comment on: My Criticism of Singular Learning Theory
I think that it gives us an adequate account of generalisation in the limit of infinite data (or, more specifically, in the case where we have enough data to wash out the influence of the inductive bias); this is what my original remark was about. I don’t think classical statistical learning theory gives us an adequate account of generalisation in the setting where the training data is small enough for our inductive bias to still matter, and it only has very limited things to say about out-of-distribution generalisation.

Joar Skalse Nov 21, 2023, 1:02 AM
LW: 1 AF: 1
0
AF
in reply to: DanielFilan’s comment on: My Criticism of Singular Learning Theory
The assumption that small neural networks are a good match for the actual data generating process of the world, is equivalent to the assumption that neural networks have an inductive bias that gives large weight to the actual data generating process of the world, if we also append the claim that neural networks have an inductive bias that gives large weight to functions which can be described by small neural networks (and this latter claim is not too difficult to justify, I think).

Joar Skalse Nov 21, 2023, 12:44 AM
2 points
0
in reply to: Daniel Murfet’s comment on: My Criticism of Singular Learning Theory
I think the second one by Carroll is quite careful to say things like “we can now understand why singular models have the capacity to generalise well” which seems to me uncontroversial, given the definitions of the terms involved and the surrounding discussion.
The title of the post is Why Neural Networks obey Occam’s Razor! It also cites Zhang et al, 2017, and immediately after this says that SLT can help explain why neural networks have the capacity to generalise well. This gives the impression that the post is intended to give a solution to problem (ii) in your other comment, rather than a solution to problem (i).
Jesse’s post includes the following expression:
$Complex Singularities ⟺ Fewer Parameters ⟺ Simpler Functions ⟺ Better Generalization$
I think this also suggests an equivocation between the RLCT measure and practical generalisation behaviour. Moreover, neither post contains any discussion of the difference between (i) and (ii).

Joar Skalse Nov 21, 2023, 12:25 AM
5 points
3
in reply to: Daniel Murfet’s comment on: My Criticism of Singular Learning Theory
Anyway I’m guessing you’re probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).
Yes, absolutely. However, I also don’t think that (i) is very mysterious, if we view things from a Bayesian perspective. Indeed, it seems natural to say that an ideal Bayesian reasoner should assign non-zero prior probability to all computable models, or something along those lines, and in that case, notions like “overparameterised” no longer seem very significant.
Maybe that has significant overlap with the critique of SLT you’re making?
Yes, this is basically exactly what my criticism of SLT is—I could not have described it better myself!
Again, I think this reduction is not trivial since the link between $λ$ , $ν$ and generalisation error is nontrivial.
I agree that this reduction is relevant and non-trivial. I don’t have any objections to this per se. However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map).

Joar Skalse Nov 20, 2023, 11:48 PM
6 points
3
in reply to: Cleo Nardo’s comment on: My Criticism of Singular Learning Theory
A few things:

1. Neural networks do typically learn functions with low Kolmogorov complexity (otherwise they would not be able to generalise well).
2. It is a type error to describe a function as having low RLCT. A given function may have a high RLCT or a low RLCT, depending on the architecture of the learning machine.
3. The critique is against the supposition that we can use SLT to explain why neural networks generalise well in the small-data regime. The example provides a learning machine which would not generalise well, but which does fit all assumptions made my SLT. Hence, the SLT theorems which appear to prove that learning machines will generalise well when they are subject to the assumptions of SLT must in fact be showing something else.

My point is precisely that SLT does not give us a predictive account of how neural networks behave, in terms of generalisation and inductive bias, because it abstacts away from factors which are necessary to understand generalisation and inductive bias.

Joar Skalse Nov 20, 2023, 11:34 PM
1 point
0
in reply to: Cleo Nardo’s comment on: My Criticism of Singular Learning Theory
To say that neural networks are empirical risk minimisers is just to say that they find functions with globally optimal training loss (and, if they find functions with a loss close to the global optimum, then they are approximate empirical risk minimisers, etc).

I think SLT effectively assumes that neural networks are (close to being) empirical risk minimisers, via the assumption that they are trained by Bayesian induction.

Joar Skalse Nov 20, 2023, 11:59 AM
1 point
0
in reply to: Alexander Gietelink Oldenziel’s comment on: VC Theory Overview
The bounds are not exactly vacuous—in fact, they are (in a sense) tight. However, they concern a somewhat adversarial setting, where the data distribution may be selected arbitrarily (including by making it maximally opposed to the inductive bias of the learning algorithm). This means that the bounds end up being much larger than what you would typically observe in practice, if you give typical problems to a learning algorithm whose inductive bias is attuned to the structure of “typical” problems.

If you move from this adversarial setting to a more probabilistic setting, where you assume a fixed distribution over $C$ or $Δ (X \times {0, 1})$ , then you may be able to prove tighter probabilistic bounds. However, I do not have any references of places where this actually has been done (and as far as I know, it has not been done before).

Joar Skalse Nov 20, 2023, 10:56 AM
3 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: My Criticism of Singular Learning Theory
I already posted this in response to Daniel Murfet, but I will copy it over here:

For example, the agnostic PAC-learning theorem says that if a learning machine $L$ (for binary classification) is an empirical risk minimiser with VC dimension $d$ , then for any distribution $D$ over $X \times {0, 1}$ , if $L$ is given access to at least $Ω ((d / ϵ^{2}) + (d / ϵ^{2}) log (1 / δ))$ data points sampled from $D$ , then it will with probability at least $1 - δ$ learn a function whose (true) generalisation error (under $D$ ) is at most $ϵ$ worse than the best function which $L$ is able to express (in terms of its true generalisation error under $D$ ). If we assume that that $D$ corresponds to a function which $L$ can express, then the generalisation error of $L$ will with probability at least $1 - δ$ be at most $ϵ$ .
This means that, in the limit of infinite data, $L$ will with probability arbitrarily close to 1 learn a function whose error is arbitrarily close to the optimal value (among all functions which $L$ is able to express). Thus, any empirical risk minimiser with a finite VC-dimension will generalise well in the limit of infinite data.

For a bit more detail, see this post.

Joar Skalse Nov 20, 2023, 10:51 AM
LW: 1 AF: 1
0
AF
in reply to: DanielFilan’s comment on: My Criticism of Singular Learning Theory
Does this not essentially amount to just assuming that the inductive bias of neural networks in fact matches the prior that we (as humans) have about the world?
This is basically a justification of something like your point 1, but AFAICT it’s closer to a proof in the SLT setting than in your setting.
I think it could probably be turned into a proof in either setting, at least if we are allowed to help ourselves to assumptions like “the ground truth function is generated by a small neural net” and “learning is done in a Bayesian way”, etc.

Joar Skalse Nov 20, 2023, 10:48 AM
7 points
2
in reply to: Daniel Murfet’s comment on: My Criticism of Singular Learning Theory
In your example there are many values of the parameters that encode the zero function
Ah, yes, I should have made the training data be (1,1), rather than (0,0). I’ve fixed the example now!
Is that a fair characterisation of the argument you want to make?
Yes, that is exactly right!
Assuming it is, my response is as follows. I’m guessing you think $x \mapsto 0$ is simpler than $x \mapsto x^{2}$ because the former function can be encoded by a shorter code on a UTM than the latter.
The notion of complexity that I have in mind is even more pre-theoretic than that; it’s something like ” $x^{2}$ looks like an intuitively less plausible guess than $0$ ”. However, if we want to keep things strictly mathematical, then we can substitute this for the definition in terms of UTM codes.
But this isn’t the kind of complexity that SLT talks about
I’m well aware of that—that is what my example attempts to show! My point is that the kind of complexity which SLT talks about does not allow us to make inferences about inductive bias or generalisation behaviour, contra what is claimed e.g. here and here.
So we agree that Kolmogorov complexity and the local learning coefficient are potentially measuring different things. I want to dig deeper into where our disagreement lies, but I think I’ll just post this as-is and make sure I’m not confused about your views up to this point.
As far as I can tell, we don’t disagree about any object-level technical claims. Insofar as we do disagree about something, it may be more methodolocical meta-questions. I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don’t think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?

Joar Skalse Nov 20, 2023, 10:35 AM
2 points
0
in reply to: Daniel Murfet’s comment on: My Criticism of Singular Learning Theory
For example, the agnostic PAC-learning theorem says that if a learning machine $L$ (for binary classification) is an empirical risk minimiser with VC dimension $d$ , then for any distribution $D$ over $X \times {0, 1}$ , if $L$ is given access to at least $Ω ((d / ϵ^{2}) + (d / ϵ^{2}) log (1 / δ))$ data points sampled from $D$ , then it will with probability at least $1 - δ$ learn a function whose (true) generalisation error (under $D$ ) is at most $ϵ$ worse than the best function which $L$ is able to express (in terms of its true generalisation error under $D$ ). If we assume that that $D$ corresponds to a function which $L$ can express, then the generalisation error of $L$ will with probability at least $1 - δ$ be at most $ϵ$ .
This means that, in the limit of infinite data, $L$ will with probability arbitrarily close to 1 learn a function whose error is arbitrarily close to the optimal value (among all functions which $L$ is able to express). Thus, any empirical risk minimiser with a finite VC-dimension will generalise well in the limit of infinite data.

Joar Skalse Nov 20, 2023, 10:31 AM
2 points
0
in reply to: Daniel Murfet’s comment on: My Criticism of Singular Learning Theory
I’m going to make a few comments as I read through this, but first I’d like to thank you for taking the time to write this down, since it gives me an opportunity to think through your arguments in a way I wouldn’t have done otherwise.
Thank you for the detailed responses! I very much enjoy discussing these topics :)
My impression is that you tend to see this as a statement about flatness, holding over macroscopic regions of parameter space
My intuitions around the RLCT are very much geometrically informed, and I do think of it as being a kind of flatness measure. However, I don’t think of it as being a “macroscopic” quantity, but rather, a local quantity.
I think the rest of what you say coheres with my current picture, but I will have to think about it for a bit, and come back later!

Joar Skalse Nov 20, 2023, 10:16 AM
1 point
0
in reply to: Daniel Murfet’s comment on: My Criticism of Singular Learning Theory
I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see.
Yes, I agree with this. I think my main objections are (1) the fact that it mostly abstacts away from the parameter-function map, and (2) the infinite-data limit.
My view is that the validity of asymptotics is an empirical question, not something that is settled at the blackboard.
I largely agree, though depends somewhat on what your aims are. My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory.

Joar Skalse Nov 20, 2023, 10:14 AM
LW: 2 AF: 1
5
AF
in reply to: Nina Panickssery’s comment on: My Criticism of Singular Learning Theory
That’s interesting, thank you for this!

Joar Skalse Nov 20, 2023, 10:13 AM
3 points
1
in reply to: Daniel Murfet’s comment on: My Criticism of Singular Learning Theory
Yes, I meant specifically on LW and in the AI Safety community! In academia, it remains fairly obscure.

Joar Skalse Nov 20, 2023, 10:13 AM
2 points
1
in reply to: Dmitry Vaintrob’s comment on: My Criticism of Singular Learning Theory
I think this is precisely what SLT is saying, and this is nontrivial!
It is certainly non-trivial, in the sense that it takes many lines to prove, but I don’t think it tells you very much about the actual behaviour of neural networks.
Note that loss landscape considerations are more important than parameter-function considerations in the context of learning.
One of my core points is, precisely, to deny this claim. Without assumptions about the parameter function map, you cannot make inferences from the characteristics of the loss landscape to conclusions about generalisation behaviour, and understanding generalisation behaviour is crucial for understanding learning. (Unless you mean something like “convergence behaviour” when you say “in the context of learning”, in which case I agree, but then you would consider generalisation to be outside the scope of learning.)
For example it’s not clear in your example why f(x) = 0 is likely to be learned
My point is precisely that it is not likely to be learned, given the setup I provided, even though it should be learned.
Learning bias in a NN should most fundamentally be understood relative to the weights, not higher-order concepts like Kolmogorov complexity (though as you point out, there might be a relationship between the two).
There is a relationship between the two, and I claim that this relationship underlies the mechanism behind why neural networks work well compared to other learning machines.
The thing is, the “complexity of f” (your K(f)) is not a very meaningful concept from the point of view of a neural net’s learning
If we want to explain generalisation in neural networks, then we must explain if and how their inductive bias aligns with out (human) priors. Moreover, our human priors are (in most contexts) largely captured by computational complexity. Therefore, we must somewhere, in some way, connect neural networks to computational complexity.
indeed, there is no way to explain why generalizable networks like modular addition still sometimes memorize without understanding that the two are very distinct
Why not? The memorising solution has some nonzero “posterior” weight, so you would expect it to be found with some frequency. Does the empirical frequency of this solution deviate far from the theoretical prediction?

Joar Skalse Oct 16, 2023, 9:03 AM
5 points
0
in reply to: Oliver Sourbut’s comment on: Goodhart’s Law in Reinforcement Learning
including stuff Joar has worked on
That is right! See this paper.

Joar Skalse Jul 10, 2023, 1:28 PM
1 point
0
in reply to: mishka’s comment on: How Smart Are Humans?
which animals cannot do at all, they can’t write computer code or a mathematical paper

This is not obvious to me (at least not for some senses of the word “could”). Animals cannot be motivated into attempting to solve these tasks, and they cannot study maths or programming. If they could do those things, then it is not at all clear to me that they wouldn’t be able to write code or maths papers. To make this more specific; insofar as humans rely on a capacity for general problem-solving in order to do maths and programming, it would not surprise me if many animals also have this capacity to a sufficient extent, but that it cannot be directed in the right way. Note that animals even outperform humans at some general cognitive tasks. For example, chimps have a much better short-term memory than humans.
Moreover, we know a lot about human performance at those tasks, and it’s abysmal, even for top humans, and for AI research as a field.
Abysmal, compared to what? Yes, we can see that it is abysmal compared to what would in principle be information-theoretically possible. However, this doesn’t tell us very much about whether or not it is abysmal compared to what is computationally possible.
The problem of finding the minimal complexity hypothesis for a given set of data is not computationally tractable. For Kolmogorov complexity, it is uncomputable, but even for Boolean complexity, it is at least exponentially difficult (depending a bit on how exactly the problem is formalised). This means that in order to reason effectively about large amounts of data, it is (presumably) necessary to model most of it using low-fidelity methods, and then (potentially) use various heuristics in order to determine what pieces of information deserve more attention. I would therefore expect a “saturated” AI system to also frequently miss things that look obvious in hindsight.
So it seems that, at least, there is quite a bit of room for a large initial boost over the current human-equivalent capacity.
I agree that AI systems have many clear and obvious advantages, and that e.g. simply running them at a higher clock speed will give you a clear boost regardless of what assumptions we make about the “quality” of their cognition compared to that of humans. The question I’m concerned with is whether or not a takeoff scenario is better modeled as “AI quickly bootstraps to incomprehensible, Godlike intelligence through recursive self-improvement”, or whether it is better modeled as “economic growth suddenly goes up by a lot”. All the obvious advantages of AI systems are compatible with the latter.

Joar Skalse Jul 8, 2023, 2:08 PM
1 point
0
in reply to: Cleo Nardo’s comment on: How Smart Are Humans?
So, the claim is (of course) not that intelligence is zero-one. We know that this is not the case, from the fact that some people are smarter than other people.

As for the other two points, see this comment and this comment.