Daniel Murfet

Karma: 1,184

Simple versus Short: Higher-order degeneracy and error-correction

Daniel Murfet11 Mar 2024 7:52 UTC

89 points

5 comments12 min readLW link

Daniel Murfet 22 Oct 2023 21:03 UTC
LW: 31 AF: 17
9
AF
in reply to: ryan_greenblatt’s comment on: Announcing Timaeus
Great question, thanks. tldr it depends what you mean by established, probably the obstacle to establishing such a thing is lower than you think.
To clarify the two types of phase transitions involved here, in the terminology of Chen et al:
- Bayesian phase transition in number of samples: as discussed in the post you link to in Liam’s sequence, where the concentration of the Bayesian posterior shifts suddenly from one region of parameter space to another, as the number of samples increased past some critical sample size $n$ . There are also Bayesian phase transitions with respect to hyperparameters (such as variations in the true distribution) but those are not what we’re talking about here.
- Dynamical phase transitions: the “backwards S-shaped loss curve”. I don’t believe there is an agreed-upon formal definition of what people mean by this kind of phase transition in the deep learning literature, but what we mean by it is that the SGD trajectory is for some time strongly influenced (e.g. in the neighbourhood of) a critical point $w_{α}^{*}$ and then strongly influenced by another critical point $w_{β}^{*}$ . In the clearest case there are two plateaus, the one with higher loss corresponding to the label $α$ and the one with the lower loss corresponding to $β$ . In larger systems there may not be a clear plateau (e.g. in the case of induction heads that you mention) but it may still reasonable to think of the trajectory as dominated by the critical points.
The former kind of phase transition is a first-order phase transition in the sense of statistical physics, once you relate the posterior to a Boltzmann distribution. The latter is a notion that belongs more to the theory of dynamical systems or potentially catastrophe theory. The link between these two notions is, as you say, not obvious.
However Singular Learning Theory (SLT) does provide a link, which we explore in Chen et al. SLT says that the phases of Bayesian learning are also dominated by critical points of the loss, and so you can ask whether a given dynamical phase transition $α \to β$ has “standing behind it” a Bayesian phase transition where at some critical sample size the posterior shifts from being concentrated near $w_{α}^{*}$ to being concentrated near $w_{β}^{*}$ .
It turns out that, at least for sufficiently large $n$ , the only real obstruction to this Bayesian phase transition existing is that the local learning coefficient near $w_{β}^{*}$ should be higher than near $w_{α}^{*}$ . This will be hard to prove theoretically in non-toy systems, but we can estimate the local learning coefficient, compare them, and thereby provide evidence that a Bayesian phase transition exists.
This has been done in the Toy Model of Superposition in Chen et al, and we’re in the process of looking at a range of larger systems including induction heads. We’re not ready to share those results yet, but I would point you to Nina Rimsky and Dmitry Vaintrob’s nice post on modular addition which I would say provides evidence for a Bayesian phase transition in that setting.
There are some caveats and details, that I can go into if you’re interested. I would say the existence of Bayesian phase transitions in non-toy neural networks is not established yet, but at this point I think we can be reasonably confident they exist.
What links here?
- Singular learning theory and bridging from ML to brain emulations by kave (1 Nov 2023 21:31 UTC; 26 points)

Daniel Murfet 20 Nov 2023 18:52 UTC
29 points
12
in reply to: Joar Skalse’s comment on: My Criticism of Singular Learning Theory
I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don’t think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?

That seems probable. Maybe it’s useful for me to lay out a more or less complete picture of what I think SLT does say about generalisation in deep learning in its current form, so that we’re on the same page. When people refer to the “generalisation puzzle” in deep learning I think they mean two related but distinct things:

(i) the general question about how it is possible for overparametrised models to have good generalisation error, despite classical interpretations of Occam’s razor like the BIC
(ii) the specific question of why neural networks, among all possible overparametrised models, actually have good generalisation error in practice (saying this is possible is much weaker than actually explaining why it happens).

In my mind SLT comes close to resolving (i), modulo a bunch of questions which include: whether the asymptotic limit taking the dataset size to infinity is appropriate in practice, the relationship between Bayesian generalisation error and test error in the ML sense (comes down largely to Bayesian posterior vs SGD), and whether hypotheses like relative finite variance are appropriate in the settings we care about. If all those points were treated in a mathematically satisfactory way, I would feel that the general question is completely resolved by SLT.

Informally, knowing SLT just dispels the mystery of (i) sufficiently that I don’t feel personally motivated to resolve all these points, although I hope people work on them. One technical note on this: there are some brief notes in SLT6 arguing that “test error” as a model selection principle in ML, presuming some relation between the Bayesian posterior and SGD, is similar to selecting models based on what Watanabe calls the Gibbs generalisation error, which is computed by both the RLCT and singular fluctuation. Since I don’t think it’s crucial to our discussion I’ll just elide the difference between Gibbs generalisation error in the Bayesian framework and test error in ML, but we can return to that if it actually contains important disagreement.

Anyway I’m guessing you’re probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).

Any theoretical resolution to (ii) has to involve some nontrivial ingredient that actually talks about neural networks, as opposed to general singular statistical models. The only specific results about neural networks and generalisation in SLT are the old results about RLCTs of tanh networks, more recent bounds on shallow ReLU networks, and Aoyagi’s upcoming results on RLCTs of deep linear networks (particularly that the RLCT is bounded above even when you take the depth to infinity).

As I currently understand them, these results are far from resolving (ii). In its current form SLT doesn’t supply any deep reason for why neural networks in particular are often observed to generalise well when you train them on a range of what we consider “natural” datasets. We don’t understand what distinguishes neural networks from generic singular models, nor what we mean by “natural”. These seem like hard problems, and at present it looks like one has to tackle them in some form to really answer (ii).

Maybe that has significant overlap with the critique of SLT you’re making?

Nonetheless I think SLT reduces the problem in a way that seems nontrivial. If we boil the “ML in-practice model selection” story to “choose the model with the best test error given fixed training steps” and allow some hand-waving in the connection between training steps and number of samples, Gibbs generalisation error and test error etc, and use Watanabe’s theorems (see Appendix B.1 of the quantifying degeneracy paper for a local formulation) to write the Gibbs generalisation error as

$G_{g} (n) = L_{0} + \frac{1}{n} (λ + ν)$

where $λ$ is the learning coefficient and $ν$ is the singular fluctuation and $L_{0}$ is roughly the loss (the quantity that we can estimate from samples is actually slightly different, I’ll elide this) then (ii), which asks why neural networks on natural datasets have low generalisation error, is at least reduced to the question of why neural networks on natural datasets have low $L_{0}, λ, ν$ .

I don’t know much about this question, and agree it is important and outstanding.

Again, I think this reduction is not trivial since the link between $λ, ν$ and generalisation error is nontrivial. Maybe at the end of the day this is the main thing we in fact disagree on :)
What links here?
- My Criticism of Singular Learning Theory by Joar Skalse (19 Nov 2023 15:19 UTC; 77 points)

Daniel Murfet 20 Nov 2023 8:16 UTC
LW: 25 AF: 14
11
AF
on: My Criticism of Singular Learning Theory
The easiest way to explain why this is the case will probably be to provide an example. Suppose we have a Bayesian learning machine with 15 parameters, whose parameter-function map is given by

$f (x) = θ_{1} + θ_{2} θ_{3} x + θ_{4} θ_{5} θ_{6} x^{2} + θ_{7} θ_{8} θ_{9} θ_{10} x^{3} + θ_{11} θ_{12} θ_{13} θ_{14} θ_{15} x^{4},$
and whose loss function is the KL divergence. This learning machine will learn 4-degree polynomials. Moreover, it is overparameterised, and its loss function is analytic in its parameters, etc, so SLT will apply to it.

In your example there are many values of the parameters that encode the zero function (e.g. $θ_{1} = θ_{2} = θ_{4} = θ_{7} = θ_{11} = 0$ and all other parameters free) in addition to there being many parameters that encode the function $x^{4}$ (e.g. $θ_{1} = θ_{2} = θ_{4} = θ_{7} = 0$ , variables $θ_{3}, θ_{5}, θ_{6}, θ_{8}, θ_{9}, θ_{10}$ free and $θ_{11} θ_{12} θ_{13} θ_{14} θ_{15} = 1$ ). Without thinking about it more I’m not sure which is actually has local learning coefficient (RLCT) and therefore counts as “more simple” from an SLT perspective.

However, if I understand correctly it’s not this specific example that you care about. We can agree that there is some way of coming up with a simple model which (a) can represent both the functions $x \mapsto 0$ and $x \mapsto x^{2}$ and (b) has parameters $w_{0}^{*}$ and $w_{s q u a r e}^{*}$ respectively representing these functions with local learning coefficients $λ (w_{0}^{*}) > λ (w_{s q u a r e}^{*})$ . That is, according to the local learning coefficient as a measure of model complexity, the neighbourhood of the parameter $w_{0}^{*}$ is more complex than that of $w_{s q u a r e}^{*}$ . I believe your observation is that this contradicts an a priori notion of complexity that you hold about these functions.

Is that a fair characterisation of the argument you want to make?

Assuming it is, my response is as follows. I’m guessing you think $x \mapsto 0$ is simpler than $x \mapsto x^{2}$ because the former function can be encoded by a shorter code on a UTM than the latter. But this isn’t the kind of complexity that SLT talks about: the local learning coefficient $λ (w^{*})$ that appears in the main theorems represents the complexity of representing a given probability distribution $p (x | w^{*})$ using parameters from the model, and is not some intrinsic model-free complexity of the distribution itself.

One way of saying it is that Kolmogorov complexity is the entropy cost of specifying a machine on the description tape of a UTM (a kind of absolute measure) whereas the local learning coefficient is the entropy cost per sample of incrementally refining an almost true parameter in the neural network parameter space (a kind of relative measure). I believe they’re related but not the same notion, as the latter refers fundamentally to a search process that is missing in the former.

We can certainly imagine a learning machine set up in such a way that it is prohibitively expensive to refine an almost true parameter nearby a solution that looks like $x \mapsto 0$ and very cheap to refine an almost true parameter near a solution like $x \mapsto x^{2}$ , despite that being against our natural inclination to think of the former as simpler. It’s about the nature of the refinement / search process, not directly about the intrinsic complexity of the functions.

So we agree that Kolmogorov complexity and the local learning coefficient are potentially measuring different things. I want to dig deeper into where our disagreement lies, but I think I’ll just post this as-is and make sure I’m not confused about your views up to this point.
What links here?
- My Criticism of Singular Learning Theory by Joar Skalse (19 Nov 2023 15:19 UTC; 77 points)

Daniel Murfet 21 Apr 2023 20:59 UTC
24 points
13
on: Should we publish mechanistic interpretability research?
The set of motivated, intelligent people with the relevant skills to do technical alignment work in general, and mechanistic interpretability in particular, has a lot of overlap with the set of people who can do capabilities work. That includes many academics, and students in masters and PhD programs. One way or another they’re going to publish, would you rather it be alignment/interpretability work or capabilities work?
It seems to me that speeding up alignment work by several orders of magnitude is unlikely to happen without co-opting a significant number of existing academics, labs and students in related fields (including mathematics and physics in addition to computer science). This is happening already, within ML groups but also physics (Max Tegmark’s students) and mathematics (e.g. some of my students at the University of Melbourne).
I have colleagues in my department publishing stacks of papers in CVPR, NeurIPS etc., which this community might call capabilities work. If I succeeded in convincing them to do some alignment or mechanistic interpretability work, they would do it because it was intrinsically interesting or likely to be high status. They would gravitate towards the kinds of work that are dual-use. Relative to the status quo that seems like progress to me, but I’m genuinely interested in the opinion of people here. Real success in this recruitment would, among other things, dilute the power of LW norms to influence things like publishing.
On balance it seems to me beneficial to aggressively recruit academics and their students into alignment and interpretability.

Daniel Murfet 11 Mar 2024 18:58 UTC
22 points
12
in reply to: ryan_greenblatt’s comment on: Simple versus Short: Higher-order degeneracy and error-correction
Maybe I can clarify a few points here:
- A statistical model is regular if it is identifiable and the Fisher information matrix is everywhere nondegenerate. Statistical models where the prediction involves feeding samples from the input distribution through neural networks are not regular.
- Regular models are the ones for which there is a link between low description length and low free energy (i.e. the class of models which the Bayesian posterior tends to prefer are those that are assigned lower description length, at the same level of accuracy).
- It’s not really accurate to describe regular models as “typical”, especially not on LW where we are generally speaking about neural networks when we think of machine learning.
- It’s true that the example presented in this post is, potentially, not typical (it’s not a neural network nor is it a standard kind of statistical model). So it’s unclear to what extent this observation generalises. However, it does illustrate the general point that it is a mistake to presume that intuitions based on regular models hold for general statistical models.
- A pervasive failure mode in modern ML is to take intuitions developed for regular models, and assume they hold “with some caveats” for neural networks. We have at this point many examples where this leads one badly astray, and in my opinion the intuition I see widely shared here on LW about neural network inductive biases and description length falls into this bucket.
- I don’t claim to know the content of those inductive biases, but my guess is that it is much more interesting and complex than “something like description length”.

Daniel Murfet 27 Aug 2023 20:54 UTC
LW: 21 AF: 7
6
AF
on: A list of core AI safety problems and how I hope to solve them
4. Goals misgeneralize out of distribution.
See: Goal misgeneralization: why correct specifications aren’t enough for correct goals, Goal misgeneralization in deep reinforcement learning
OAA Solution: (4.1) Use formal methods with verifiable proof certificates^[2]. Misgeneralization can occur whenever a property (such as goal alignment) has been tested only on a subset of the state space. Out-of-distribution failures of a property can only be ruled out by an argument for a universally quantified statement about that property—but such arguments can in fact be made! See VNN-COMP. In practice, it will not be possible to have enough information about the world to “prove” that a catastrophe will not be caused by an unfortunate coincidence, but instead we can obtain guaranteed probabilistic bounds via stochastic model checking.
Based on the Bold Plan post and this one my main point of concern is that I don’t believe in the feasibility of the model checking, even in principle. The state space S and action space A of the world model will be too large for techniques along the lines of COOL-MC which (if I understand correctly) have to first assemble a discrete-time Markov chain by querying the NN and then try to apply formal verification methods to that. I imagine that actually you are thinking of learned coarse-graining of both S and A, to which one applies something like formal verification.
Assuming that’s correct, then there’s an inevitable lack of precision on the inputs to the formal verification step. You have to either run the COOL-MC-like process until you hit your time and compute budget and then accept that you’re missing state-action pairs, or you coarse-grain to some degree within your budget and accept a dependence on the quality of your coarse-graining. If you’re doing an end-run around this tradeoff somehow, could you direct me to where I can read more about the solution?
I know there’s literature on learned coarse-grainings of S and A in the deep RL setting, but I haven’t seen it combined with formal verification. Is there a literature? It seems important.
I’m guessing that this passage in the Bold Plan post contains your answer:

> Defining a sufficiently expressive formal meta-ontology for world-models with multiple scientific explanations at different levels of abstraction (and spatial and temporal granularity) having overlapping domains of validity, with all combinations of {Discrete, Continuous} and {time, state, space}, and using an infra-bayesian notion of epistemic state (specifically, convex compact down-closed subsets of subprobability space) in place of a Bayesian state
In which case I see where you’re going, but this seems like the hard part?
What links here?
- davidad's comment on A list of core AI safety problems and how I hope to solve them by davidad (28 Aug 2023 0:49 UTC; 5 points)

Daniel Murfet 24 Oct 2023 21:58 UTC
20 points
4
in reply to: Akash’s comment on: Announcing Timaeus
Thanks Akash. Speaking for myself, I have plenty of experience supervising MSc and PhD students and running an academic research group, but scientific institution building is a next-level problem. I have spent time reading and thinking about it, but it would be great to be connected to people with first-hand experience or who have thought more deeply about it, e.g.
- People with advice on running distributed scientific research groups
- People who have thought about scientific institution building in general (e.g. those with experience starting FROs in biosciences or elsewhere)
- People with experience balancing fundamental research with product development within an institution
I am seeking advice from people within my institution (University of Melbourne) but Timaeus is not a purely academic org and their experience does not cover all the hard parts.

Daniel Murfet 22 Oct 2023 21:20 UTC
20 points
5
in reply to: Algon’s comment on: Announcing Timaeus
I think it is too early to know how many phase transitions there are in e.g. the training of a large language model. If there are many, it seems likely to me that they fall along a spectrum of “scale” and that it will be easier to find the more significant ones than the less significant ones (e.g. we discover transitions like the onset of in-context learning first, because they dramatically change how the whole network computes).
As evidence for that view, I would put forward the fact that putting features into superposition is known to be a phase transition in toy models (based on the original post by Elhage et al and also our work in Chen et al) and therefore seems likely to be a phase transition in larger models as well. That gives an example of phase transitions at the “small” end of the scale. At the “big” end of the scale, the evidence in Olsson et al that induction heads and in-context learning appears in a phase transition seems convincing to me.
On general principles, understanding “small” phase transitions (where the scale is judged relative to the overall size of the system, e.g. number of parameters) is like probing a physical system at small length scales / high energy, and will require more sophisticated tools. So I expect that we’ll start by gaining a good understanding of “big” phase transitions and then as the experimental methodology and theory improves, move down the spectrum towards smaller transitions.
On these grounds I don’t expect us to be swamped by the smaller transitions, because they’re just hard to see in the first place; the major open problem in my mind is how far we can get down the scale with reasonable amounts of compute. Maybe one way that SLT & developmental interpretability fails to be useful for alignment is if there is a large “gap” in the spectrum, where beyond the “big” phase transitions that are easy to see (and for which you may not need fancy new ideas) there is just a desert / lack of transitions, and all the transitions that matter for alignment are “small” enough that a lot of compute and/or very sophisticated ideas are necessary to study them. We’ll see!

Daniel Murfet 16 Mar 2024 8:44 UTC
19 points
3
on: More people getting into AI safety should do a PhD
I think many early-career researchers in AI safety are undervaluing PhDs.
I agree with this. To be blunt, it is my impression from reading LW for the last year that a few people in this community seem to have a bit of a chip on their shoulder Re: academia. It certainly has its problems, and academics love nothing more than pointing them out to each other, but you face your problems with the tools you have, and academia is the only system for producing high quality researchers that is going to exist at scale over the next few years (MATS is great, I’m impressed by what Ryan and co are doing, but it’s tiny).

I would like to see many more academics in CS, math, physics and adjacent areas start supervising students in AI safety, and more young people go into those PhDs. Also, more people with PhDs in math and physics transitioning to AI safety work.

One problem is that many of the academics who are willing to supervise PhD students in AI safety or related topics are evaporating into industry positions (subliming?). There are also long run trends that make academia relatively less attractive than it was in the past (e.g. rising corporatisation) even putting aside salary comparisons, and access to compute. So I do worry somewhat about how many PhD students in AI safety adjacent fields can actually be produced per year this decade.

Daniel Murfet 6 Jul 2023 20:58 UTC
17 points
1
on: A Defense of Work on Mathematical AI Safety
Thanks for the article. For what it’s worth, here’s the defence I give of Agent Foundations and associated research, when I am asked about it (for background, I’m a mathematician, now working on mathematical aspects of AI safety different from Agent Foundations). I’d be interested if you disagree with this framing.
We can imagine the alignment problem coming in waves. Success in each wave merely buys you the chance to solve the next. The first wave is the problem we see in front of us right now, of getting LLMs to Not Say Naughty Things, and we can glimpse a couple of waves after that. We don’t know how many waves there are, but it is reasonable to expect that beyond the early waves our intuitions probably aren’t worth much.
That’s not a surprise! As physics probed smaller scales, at some point our intuitions stopped being worth anything, and we switched to relying heavily on abstract mathematics (which became a source of different, more hard-won intuitions). Similarly, we can expect that as we scale up our learning machines, we will enter a regime where current intuitions fail to be useful. At the same time, the systems may be approaching more optimal agents, and theories like Agent Foundations start to provide a very useful framework for reasoning about the nature of the alignment problem.
So in short I think of Agent Foundations as like quantum mechanics: a bit strange perhaps, but when push comes to shove, one of the few sources of intuition we have about waves 4, 5, 6, … of the alignment problem. It would be foolish to bet everything on solving waves 1, 2, 3 and then be empty handed when wave 4 arrives.

Daniel Murfet 19 Nov 2023 16:46 UTC
16 points
9
on: My Criticism of Singular Learning Theory
I think that the significance of SLT is somewhat over-hyped at the moment

Haha, on LW that is either already true or at current growth rates will soon be true, but it is clearly also the case that SLT remains basically unknown in the broader deep learning theory community.

Daniel Murfet 18 Jun 2023 16:44 UTC
14 points
4
on: My impression of singular learning theory
I think this is a very nice way to present the key ideas. However, in practice I think the discretisation is actually harder to reason about than the continuous version. There are deeper problems, but I’d start by wondering how you would ever compute c(f) defined this way, since it seems to depend in an intricate way on the details of e.g. the floating point implementation.
I’ll note that the volume codimension definition of the RLCT is essentially what you have written down here, and you don’t need any mathematics beyond calculus to write that down. You only need things like resolutions of singularities if you actually want to compute that value, and the discretisation doesn’t seem to offer any advantage there.

Daniel Murfet 21 Nov 2023 9:00 UTC
11 points
4
in reply to: Joar Skalse’s comment on: My Criticism of Singular Learning Theory
Well neural networks do obey Occam’s razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam’s razor).

I think that expression of Jesse’s is also correct, in context.

However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something profound about (ii) from my other post, and the use of terms like “generalisation” in broad terms in the more expository parts (as opposed to the technical parts) arguably doesn’t make enough effort to prevent them from drawing these inferences.

I have noticed people at the Berkeley meeting and elsewhere believing (ii) was somehow resolved by SLT, or just in a vague sense thinking SLT says something more than it does. While there are hard tradeoffs to make in writing expository work, I think your criticism of this aspect of the messaging around SLT on LW is fair and to the extent it misleads people it is doing a disservice to the ongoing scientific work on this important subject.

I’m often critical of the folklore-driven nature of the ML literature and what I view as its low scientific standards, and especially in the context of technical AI safety I think we need to aim higher, in both our technical and more public-facing work. So I’m grateful for the chance to have this conversation (and to anybody reading this who sees other areas where they think we’re falling short, read this as an invitation to let me know, either privately or in posts like this).

I’ll discuss the generalisation topic further with the authors of those posts. I don’t want to pre-empt their point of view, but it seems likely we may go back and add some context on (i) vs (ii) in those posts or in comments, or we may just refer people to this post for additional context. Does that sound reasonable?

At least right now, the value proposition I see of SLT lies not in explaining the “generalisation puzzle” but in understanding phase transitions and emergent structure; that might end up circling back to say something about generalisation, eventually.

Daniel Murfet 12 Oct 2023 21:06 UTC
10 points
2
in reply to: Chris_Leong’s comment on: You’re Measuring Model Complexity Wrong
Not easily detected. As in, there might be a sudden (in SGD steps) change in the internal structure of the network over training that is not easily visible in the loss or other metrics that you would normally track. If you think of the loss as an average over performance on many thousands of subtasks, a change in internal structure (e.g. a circuit appearing in a phase transition) relevant to one task may not change the loss much.

Daniel Murfet 15 Jul 2023 5:29 UTC
10 points
0
in reply to: Mikhail Samin’s comment on: Towards Developmental Interpretability
That intuition sounds reasonable to me, but I don’t have strong opinions about it.
One thing to note is that training and test performance are lagging indicators of phase transitions. In our limited experience so far, measures such as the RLCT do seem to indicate that a transition is underway earlier (e.g. in Toy Models of Superposition), but in the scenario you describe I don’t know if it’s early enough to detect structure formation “when it starts”.
For what it’s worth my guess is that the information you need to understand the structure is present at the transition itself, and you don’t need to “rewind” SGD to examine the structure forming one step at a time.

Daniel Murfet 19 Nov 2023 16:52 UTC
9 points
5
on: My Criticism of Singular Learning Theory
First of all, SLT is largely is based on examining the behaviour of learning machines in the limit of infinite data

I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see (I think I probably also see the gap between Bayesian learning and SGD as bigger than you do).

I’ve discussed this a bit with my colleague Liam Hodgkinson, whose recent papers https://arxiv.org/abs/2307.07785 and https://arxiv.org/abs/2311.07013 might be more up your alley than SLT.

My view is that the validity of asymptotics is an empirical question, not something that is settled at the blackboard. So far we have been pleasantly surprised at how well the free energy formula works at relatively low $n$ (in e.g. https://arxiv.org/abs/2310.06301). It remains an open question whether this asymptotic continues to provide useful insight into larger models with the kind of dataset size we’re using in LLMs for example.

Daniel Murfet 1 Nov 2023 22:16 UTC
9 points
0
on: Singular learning theory and bridging from ML to brain emulations
There is quite a large literature on “stage-wise development” in neuroscience and psychology, going back to people like Piaget but quite extensively developed in both theoretical and experimental directions. One concrete place to start on the agenda you’re outlining here might be to systematically survey that literature from an SLT-informed perspective.

Daniel Murfet 18 Aug 2023 9:20 UTC
9 points
5
on: Against Almost Every Theory of Impact of Interpretability
Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool.
This dramatically undersells the potential impact of Olsson et al. You can’t dismiss modus ponens as “just regex”. That’s the heart of logic!
For many the argument for AI safety being a urgent concern involves a belief that current systems are, in some rough sense, reasoning, and that this capability will increase with scale, leading to beyond human-level intelligence within a timespan of decades. Many smart outsiders remain sceptical, because they are not convinced that anything like reasoning is taking place.
I view Olsson et al as nontrivial evidence for the emergence of internal computations resembling reasoning, with increasing scale. That’s profound. If that case is made stronger over time by interpretability (as I expect it to be) the scientific, philosophical and societal impact will be immense.

Daniel Murfet 15 Mar 2023 11:47 UTC
9 points
2
in reply to: gwern’s comment on: GPT-4
By the zero-shot hyperparameter work do you mean https://arxiv.org/abs/2203.03466 “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer”? I’ve been sceptical of NTK-based theory, seems I should update.

Daniel Murfet

Sim­ple ver­sus Short: Higher-or­der de­gen­er­acy and er­ror-correction

4. Goals misgeneralize out of distribution.

Simple versus Short: Higher-order degeneracy and error-correction