Vinayak Pathak

Karma: 24

Vinayak Pathak 26 Mar 2026 19:17 UTC
1 point
0
AF
in reply to: Cole Wyeth’s comment on: An Introduction to Credal Sets and Infra-Bayes Learnability
I was going to say something similar. After reading the first two posts of the sequence I really thought the role of credal sets in defining regret would be somewhat different.
In particular, consider to be the classical (non-infra) regret for a given policy on a given environment . For a given environment class , we previously considered two notions of learnability depending on the kind of uncertainty we had over . First, under knightian uncertainty, we required that our policy satisfy , and under Bayesian uncertainty, we required .^[1] A credal set gives us a new way of quantifying our uncertainty over . Let be that credal set. Then we could instead require that . This has the property that if your happens to be the set of all distributions then the regret reduces to the usual regret, and if then you get Bayesian regret. Perhaps this is what @Vanessa Kosoy means by “infra-Bayes-regret” in her comment below. If so, I’m curious what results are known for this notion of regret.
1. ^
  I’m ignoring here for simplicity.

Vinayak Pathak 25 Mar 2026 19:01 UTC
1 point
0
AF
on: What is Inadequate about Bayesianism for AI Alignment: Motivating Infra-Bayesianism
Not only is realizability not guaranteed, it is extremely unrealistic because of the computational complexity of the real world. Furthermore, it is impossible for an agent to specify a hypothesis that has greater computational complexity than itself, which is the problem of irreflexivity.
It’s unclear to me exactly when is irreflexivity an actual problem for a learner. I understand that a learner cannot simulate a process that is computationally more complex than the learner itself, but I’m unsure when an exact simulation is necessary for learning.

Consider, for example, the learning problem where some function is assigning integer labels Y to input graphs X and you need to identify the function. Suppose further that your “hypothesis class” consists of two functions: calculates the size of the maximum clique in X and calculates the size of the maximum independent set. Both are NP-complete, so any computationally efficient learner cannot hope to simulate either of them. However, in an “active” learning like set up, where the learner can choose X’s to query, the problem is quite trivial to solve. Indeed, the learner simply constructs a fully connected graph on n vertices and queries for its label. If the label is n, then must be the true hypothesis, otherwise must be the true hypothesis.

Yes, this example uses a definition of “learning” that’s perhaps quite different from what this article thinks of as learning. However, intuitively, I feel that for most reasonable definitions of learning, the computational complexity of individual hypotheses in the hypothesis class cannot be the thing that characterizes the hardness of learning, but rather it has to be some measure of how complex the entire hypothesis class is.

Vinayak Pathak 27 Feb 2025 14:56 UTC
2 points
0
in reply to: Jesse Hoogland’s comment on: Empirical risk minimization is fundamentally confused
Ah, I just noticed it’s an old post. I was just clicking through all the SLT links. :)

Vinayak Pathak 27 Feb 2025 4:15 UTC
7 points
0
on: Empirical risk minimization is fundamentally confused
Having written a few papers about ERM and its variants, I feel personally attacked! I feel obliged to step in and try to defend ERM’s honour.
First of all, I don’t think I would call ERM a learning framework. ERM is a solution to the mathematical problem of PAC learning. Theoretical computer scientists like to precisely define the problem they want to solve before trying to solve it. When they were faced with the problem of learning, they decided that the mathematical problem of PAC learning was a good representation of the real-world problem of learning. Once you decide that PAC learning is what you want to solve, then ERM is the solution. Whether or not PAC learning truly represents the real world problem of learning is debatable and theorists have always agreed about that, with the main shortcoming being the insistence on the worst case probability distribution. But PAC learning is also a pretty solid first attempt at defining what learning means, which, in fact, used to produce useful predictions about practical learning methods before neural networks took over. Neural networks seem to work even in cases where PAC learning can be proved not to, and that means we need to come up with a better definition of what learning means. But PAC learning does have some very neat ideas built into it that whatever the better definition happens to be shouldn’t ignore.
So what’s so great about PAC learning? There’s this famous quote by Vladimir Vapnik: “When solving a problem of interest, do not solve a more general problem as an intermediate step.” PAC learning incorporates this advice in earnest. You point out that learning is inherently about probabilities, which is true. But using this to infer that we first need to model the probability distribution over (X, Y) is precisely the thing Vapnik warns against. In the end, we simply want to make predictions and we should directly model that. You can say that a predictor is a (potentially randomized) algorithm that maps X’s to Y’s. Then your learner’s task is to look at a finite sample and output a predictor that makes good predictions in expectation (where the expectation is taken over the true data distribution and the predictor’s randomness). If it turns out that the learner needs to somehow model the distribution over (X, Y) to come up with a good predictor then so be it, but that shouldn’t be a part of the definition of the problem. But having defined the problem this way, one soon realizes that randomness in f is now unnecessary for the same reason that when trying to guess the outcome of a (biased) coin toss the best guess is deterministic. The best coin toss guess strategy is to simply guess the face that has the higher chance of showing up. Similarly the best predictor for a classification problem is the deterministic predictor that, for each x, outputs the label that has the highest chance of being the true label for x. Outputting a deterministic predictor does not mean we haven’t modelled the inherent probabilities of the learning problem.
Empirical risk minimization does not feel fundamentally confused to me. ERM is, in fact, an amazingly elegant solution to the problem of PAC learning. One could reasonably say that PAC learning is somewhat confused, but learning theorists are working on it!

Vinayak Pathak 20 Feb 2025 3:22 UTC
LW: 5 AF: 2
0
AF
in reply to: Jesse Hoogland’s comment on: Neural networks generalize because of this one weird trick
Thanks, this clarifies many things! Thanks also for linking to your very comprehensive post on generalization.
To be clear, I didn’t mean to claim that VC theory explains NN generalization. It is indeed famously bad at explaining modern ML. But “models have singularities and thus number of parameters is not a good complexity measure” is not a valid criticism of VC theory. If SLT indeed helps figure out the mysteries from the “understanding deep learning...” paper then that will be amazing!
But what we’d really like to get at is an understanding of how perturbations to the true distribution lead to changes in model behavior.
Ah, I didn’t realize earlier that this was the goal. Are there any theorems that use SLT to quantify out-of-distribution generalization? The SLT papers I have read so far seem to still be talking about in-distribution generalization, with the added comment that Bayesian learning/SGD is more likely to give us “simpler” models and simpler models generalize better.

Vinayak Pathak 19 Feb 2025 23:10 UTC
1 point
0
in reply to: Liam Carroll’s comment on: DSLT 0. Distilling Singular Learning Theory
I have also been looking for comparisons between classical theory and SLT that make the deficiencies of the classical theories of learning clear, so thanks for putting this in one place.
However, I find the narrative of “classical theory relies on the number of parameters but SLT relies on something much smaller than that” to be a bit of a strawman towards the classical theory. VC theory already only depends on the number of behaviours induced by your model class as opposed to the number of parameters, for example, and is a central part of the classical theory of generalization. Its predictions still fail to explain generalization of neural networks but several other complexity measures have already been proposed.

Vinayak Pathak 19 Feb 2025 22:36 UTC
1 point
0
AF
on: You’re Measuring Model Complexity Wrong
Neural networks are intrinsically biased towards simpler solutions.
Am I correct in thinking that being “intrinsically biased towards simpler solutions” isn’t a property of neural networks, but a property of the Bayesian learning procedure? The math in the post doesn’t use much about NN’s and it seems like the same conclusions can be drawn for any model class whose loss landscape has many minima with varying complexities?

Vinayak Pathak 19 Feb 2025 22:14 UTC
LW: 3 AF: 1
0
AF
on: Neural networks generalize because of this one weird trick
Perhaps I have learnt statistical learning theory in a different order than others, but in my mind, the central theorem of statistical learning theory is that learning is characterized by the VC-dimension of your model class (here I mean learning in the sense of supervised binary classification, but similar dimensions exist for some more general kinds of learning as well). VC-dimension is a quantity that does not even mention the number of parameters used to specify your model, but depends only on the number of different behaviours induced by the models in your model class on sets of points. Thus if multiple parameter values lead to the same behaviour, this isn’t a problem for the theory at all because these redundancies do not increase the VC-dimension of the model class. So I’m a bit confused about why singular learning theory is a better explanation of generalization than VC-dimension based theories.
On the other hand, one potential weakness of singular learning theory seems to be that its complexity measures depend on the true data distribution (as opposed to VC-dimension that depends only on the model class)? I think what we want from any theory of generalization is that it should give us a prediction process $P$ that takes any learning algorithm as input and predicts whether it will generalize or not. The procedure $P$ cannot require knowledge of the true data distribution because if we knew the data distribution we would not need to learn anything in the first place. If the claim is that it only needs to know certain properties of the true distribution that can be estimated from a small number of samples, then it will be nice to have a proof of such a claim (not sure if that exists). Also note that if $P$ is allowed access to samples, then predicting whether your model generalizes is as simple as checking its performance on the test set.

Vinayak Pathak 16 Feb 2025 17:49 UTC
10 points
1
on: Gauging Interest for a Learning-Theoretic Agenda Mentorship Programme
The participants will be required to choose a project out of a list I provide. They will be able to choose to work solo or in a group.
Is this list or an approximate version of it available right now? Since the application process itself requires a non-trivial time commitment, it might be nice to see the list before deciding whether to apply.

Vinayak Pathak 19 May 2024 2:08 UTC
1 point
0
AF
on: Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
I read the paper, and overall it’s an interesting framework. One thing I am somewhat unconvinced about (likely because I have misunderstood something) is its utility despite the dependence on the world model. If we prove guarantees assuming a world model, but don’t know what happens if the real world deviates from the world model, then we have a problem. Ideally perhaps we want a guarantee akin to what’s proved in learning theory, for example, that the accuracy will be small for any data distribution as long as the distribution remains the same during training and testing.
But perhaps I have misunderstood what’s meant by a world model and maybe it’s simply the set of precise assumptions under which the guarantees have been proved. For example, in the learning theory setup, maybe the world model is the assumption that the training and test distributions are the same, as opposed to a description of the data distribution.

Vinayak Pathak 28 Feb 2021 0:00 UTC
3 points
0
in reply to: Vladimir_Nesov’s comment on: Why do stocks go up?
Hmm, but what if everything gets easier to produce at a similar rate as the consumer basket? Won’t the prices remain unaffected then?