One crucial question in understanding and predicting the learning process, and ultimately the behavior, of modern neural networks, is that of the shape of their loss landscapes. What does this extremely high dimensional landscape look like? Does training generally tend to find minima? Do minima even exist? Is it predictable what type of minima (or regions of lower loss) are found during training? What role does initial randomization play? Are there specific types of basins in the landscape that are qualitatively different from others, that we might care about for safety reasons?
First, let’s just briefly think about very high dimensional spaces. One somewhat obvious observation is that they are absolutely vast. With each added dimension, the volume of the available space increases exponentially. Intuitively we tend to think of 3-dimensional spaces, and often apply this visual/spatial intuition to our understanding of loss landscapes. But this can be extremely misleading. Parameter spaces are utterly incredibly vast to a degree that our brain can hardly fathom. Take GPT3 for instance. It has 175 billion parameters, or dimensions. Let’s assume somewhat arbitrarily that all parameters end up in a range of [-0.5, 0.5], i.e. live in a 175-billion-dimensional unit cube around the origin of that space (as this is not the case, the real parameter space is actually even much, much larger, but bear with me). Even though every single axis only varies by 1 – let’s just for the sake of it interpret this as “1 meter” – even just taking the diagonal from one corner to the opposite one in this high-dimensional cube, you would get a length of ~420km. So if, hypothetically, you were sitting in the middle of this high dimensional unit cube, you could easily touch every single wall with your hand. But nonetheless, all the corners would be more than 200km distant from you.
This may be mind boggling, but is it relevant? I think it is. Take this realization for instance: if you have two minima in this high dimensional space, but one is just a tiny bit “flatter” than the other (meaning the second derivatives overall are a bit closer to 0), then the attractor basin of this flatter minimum is vastly larger than that of the other minimum. This is because the flatness implies a larger radius, and the volume depends exponentially on that radius. So, at 175 billion dimensions, even a microscopically larger radius means an overwhelmingly larger volume. If, for instance, one minimum’s attractor basin has a radius that is just 0.00000001% larger than that of the other minimum, then its volume will be roughly 40 million times larger (if my Javascript code to calculate this is accurate enough, that is). And this is only for GPT3, which is almost 4 years old by now.
The parameter space is just ridiculously large, so it becomes really crucial how the search process through it works and where it lands. It may be that somewhere in this vast space, there are indeed attractor basins that correspond to minima that we find extremely undesirable – certain capable optimizers perhaps, that have situational awareness and deceptive tendencies. If they do exist, what could we possibly tell about them? Maybe these minima have huge attractor basins that are reliably found eventually (maybe once we switch to a different network architecture, or find some adjustment to gradient descent, or reach a certain model size, or whatever), which would of course be bad news. Or maybe these attractor basins are so vanishingly small that we basically don’t have to care about them at all, because all the computer & search capacity of humanity over the next million years would have an almost 0 chance of ever stumbling onto these regions. Maybe they are even so small that they are numerically unstable, and even if your search process through some incredible cosmic coincidence happens to start right in such a basin, the first SGD step would immediately jump out of it due to the limitations of numerical accuracy on the hardware we’re using.
So, what can we actually tell at this point about the nature of high dimensional loss landscapes? While reading up on this topic, one thing that constantly came up is the fact that, the more dimensions you have, the lower the relative number of minima becomes compared to saddle points. Meaning that whenever the training process appears to slow down and it looks like it found some local minimum, it’s actually overwhelmingly likely that what it actually found is a saddle point, hence the training process never halts but keeps moving through parameter space, even if the loss doesn’t change that much. Do local minima exist at all? I guess it depends on the function the neural network is learning to approximate. Maybe some loss landscapes exist where the loss can just get asymptotically closer to some minimum (such as 0), without ever reaching it. And probably other loss landscapes exist where you actually have a global minimum, as well as several local ones.
Some people argue that you probably have no minima at all, because with each added dimension it becomes less and less likely that a given point is a minimum (because not only does the first derivative of a point have to be 0 for it to be a minimum, also all the second derivatives need to be in on it, and all be positive). This sounds compelling, but given that the space itself also grows exponentially with each dimension, we also have overwhelmingly more points to choose from. If you e.g. look at n-dimensional Perlin Noise, its absolute number of local minima within an n-dimensional cube of constant side length actually increases with each added dimension. However, the relative number of local minima compared to the available space still decreases, so it becomes harder and harder to find them.
I’ll keep it at that. This is already not much of a “quick” take. Basically, more research is needed, as my literature review on this subject yielded way more questions than answers, and many of the claims people made in their blog posts, articles and sometimes even papers seemed to be more intuitive / common-sensical or generalized from maybe-not-that-easy-to-validly-generalize-from research.
One thing I’m sure about however is that almost any explanation of how (stochastic) gradient descent works, that uses 3D landscapes for intuitive visualizations, is misleading in many ways. Maybe it is the best we have, but imho all such explainers should come with huge asterisks, explaining that the rules in very high dimensional spaces may look much different than our naive “oh look at that nice valley over there, let’s walk down to its minimum!” understanding, that happens to work well in three dimensions.
Because critical points are non-isolated, there are more important kinds of “flatness” than having small second derivatives. Neural networks have degenerate loss landscapes: their Hessians have zero-valued eigenvalues, which means there are directions you can walk along that don’t change the loss (or that change the loss by a cubic or higher power rather than a quadratic power). The dominant contribution to how volume scales in the loss landscape comes from the behavior of the loss in those degenerate directions. This is much more significant than the behavior of the quadratic directions. The amount of degeneracy is quantified by singular learning theory’s local learning coefficient (LLC).
In the Bayesian setting, the relationship between geometric degeneracy and inductive biases is well understood through Watanabe’s free energy formula. There’s an inductive bias towards more degenerate parts of parameter space that’s especially strong earlier in the learning process.
If, for instance, one minimum’s attractor basin has a radius that is just 0.00000001% larger than that of the other minimum, then its volume will be roughly 40 million times larger (if my Javascript code to calculate this is accurate enough, that is).
Could you share this code? I’d like to take a look.
Maybe I accidentally overpromised here :D this code is just an expression, namely 1.0000000001 ** 175000000000, which, as wolframalpha agrees, yields 3.98e7.
One thing that confused me about transformers is the question of when (as in, after how many layers) each embedding “flips” from representing the original token to finally representing the prediction of the next token.
By now, I think the answer is simply this: each embedding represents both at the same time (and more). For instance, in GPT3 there are 12,288 embedding dimensions. At first I thought that all of them initially encode the original token, and after going through all the layers they eventually all encode the next token, and somewhere in the layers between this shift must happen. But what, upon some reflection, makes much more sense would be something very roughly like, say:
some 1000 dimensions encode the original token
some other 1000 dimensions encode the prediction of the next token
the remaining 10,288 dimensions encode information about all available context (which will start out “empty” and get filled with meaningful information through the layers).
In practice, things are of course much less clean, and probably most dimensions will have some role in all these things, to different degrees, as of course all of this is learned through gradient descent and hence will be very noisy and gradual. Additionally, there’s the whole positional encoding thing which is also part of the embeddings and makes clear distinctions even more difficult. But the key point remains that a single embedding encodes many things, only one of which is the prediction, and this prediction is always there from the beginning (when it’s still very superficial and bad) and then, together with the rest of the embedding, gets refined more and more throughout the layers.
Another misconception I had was that embedding and unembedding are very roughly symmetric operations that just “translate” from token space to embedding space and vice versa[1]. This made sense in relation to the initial & naive “embeddings represent tokens” interpretation, but with the updated view as described above, it becomes clear that unembedding is rather an “extraction” of the information content in the embedding that encodes the prediction.
One piece of evidence for this updated view is that this paper (thanks to Leon Lang for the hint) found that “Zero layer transformers model bigram statistics”. So, indeed, embedding + unembedding alone already perform some very basic next-token prediction. (Admittedly I’m not sure if this is only the case when the transformer is trained with zero layers, or also in, say, GPT3, when during inference you just skip all the layers)
I would guess that transformer-experienced people (unless they disagree with my description—in that case, please elaborate what I’m still getting wrong) will find all of this rather obvious. But for me, this was a major missing piece of understanding, even after once participating in an ML-themed bootcamp and watching all the 3Blue1Brown videos on transformers several times, where this idea either is not directly explained, or I somehow managed to consistently miss it.
Of course, this is not entirely true to begin with because the unembedding yields a distribution rather than a single token. But my assumption was that, if you embed the word “Good” and then unembed the embedding immediately, you would get a very high probability for “Good” back when in practice (I didn’t verify this yet) you would probably obtain high probabilities for “morning”, “day” etc.
Awkwardly, it depends on whether the model uses tied embeddings (unembed is embed transpose) or has separate embed and unembed matrices. Using tied embedding matrices like this means the model actually does have to do a sort of conversion.
Your discussion seems mostly accurate in the case of having separate embed and unembed, except that I don’t think the initial state is like “1k encode current, 1k encode predictions, rest start empty”. The model can just directly encode predictions for an initial state using the unembed.
One super useful feature of Claude that some may not know about:
Claude is pretty good at creating web apps via artifacts
You can run and use these web apps directly in the Claude UI
You can publish and share these artifacts directly with others
As far as I can tell, the above is even available for non-paying users.
Relatedly: browser bookmarklets can be pretty useful little tools to reduce friction for recurring tasks you do in your browser. It may take <5 minutes to let Claude generate such bookmarklets for you.
This is a web app built and hosted by Claude which creates a customized browser bookmarklet that provides a simple text-to-speech feature. It works like this:
customize the configuration on the linked page
drag the “Speak Selection” button into your bookmarks bar
from then on, on any website, when you mark text and then click the bookmark (or, after having clicked on it once, you can also use the defined hotkey instead), the selected text will be read out to you
Surely there are browser plugins that provide better TTS than this, but consider it a little proof of concept. Also this way it’s free, friction-less, requires no account etc. Claude also claimed that, when using Edge or Safari, higher quality system voices may be available, but I didn’t look into this.
Some other random things that can be done via bookmarklets:
a button cycling through different playback speeds of all videos on the current website, in case you sometimes interact with video players without such a setting in their UI
if you’re fine with having some API key in your bookmarklet, you can automate all kinds of, say, LLM calls
If you’re using Chrome and have enabled the local Gemini nano AI, you can even use that in your bookmarklets without any API key being involved (haven’t tried this yet)
start & show a 5 minute timer in the corner of the page you’re on
show/hide parts of the page, e.g. comments on a blog, Youtube recommendations
highlight-for-screenshot overlay: enable temporarily drawing on the page to highlight things to then take screenshots; maybe slightly lower friction than having to use a separate paint app for that. Usable here (relevant keys after activating: Enter to leave drawing mode, ESC to close overlay, 1-9 to change marker size).
inline imperial<->metric unit converter
For some of these, a browser plugin or tampermonkey script or so may be preferable—but beware fake alternatives. If you just think “I could do X instead” but never actually do it, then maybe just creating a bookmarklet may be the better option after all, even if it’s not the most elegant solution. Happy to hear about your use cases!
Cool! I didn’t know about bookmarklets. I knew Gemini would host little pages and apps made in canvas, so I played around a bit to see how different AI’s handle it. Gemini is like your Claude example. Here is a 5 min timer bookmarklet https://g.co/gemini/share/73048c89f2f2 Perplexity lab made a bookmarklet and a nice html explainer, but sharing is a little less intuitive. There’s a tab for “app” and at the bottom of that page a button to share the url. Here is a RNG (code works but the “drag the button” isn’t (and I was just looking for proof of concept) Random Number Generator Bookmarklet—Free Tool Chatgpt has canvas like Gemini. It should work the same but in my 15 min of testing the shared page hangs up and the bookmarklet doesnt seem to work. But I suppose it could be my work PC is breaking it somehow. Anyway, here is an attempted “read mode” for webpages: ChatGPT—Read Mode Static Grok’s canvas is Grok Studio. seems like it only can be summoned in chat, like Claude. Doesnt seem like you can share the app. Grok suggested: To share publicly, host it on a free platform like GitHub Pages, Glitch, or Replit (upload the file and get a public URL). I can share the chat that generated the bookmarklet though. Also, it doesn’t seem to work but again, proof of concept: Mute all tabs https://grok.com/share/c2hhcmQtNA%3D%3D_e0f91d33-7aba-4c8a-942b-db570b049536
it turns you from a developer into more of a product manager role
but the developers you manage are a) occasionally stupid/unwise and b) extremely fast and never tired
this makes it relatively addictive, because feedback cycles are much shorter than for a “real” product manager, who often has to wait for weeks to see their wishes turn into software, and you have a strong element of randomness in your rewards, with things sometimes turning out surprisingly well one-shot, but sometimes not at all
It can also lead to laziness, as it’s very tempting to getting used to “just letting the AI do it” even in not primarily vibe-coded projects, instead of investing one’s own brainpower
AI agents tend to never/rarely talk back or tell you that something is a bad idea or doesn’t work well with the current architecture; they just do things as best as currently possible. This form of local optimization quickly runs into walls if not carefully mitigated by you.
Part of the problem is that by default the AI has extremely little context and knows little about the purpose, scope and ambition of your project. So when you tell it “do X”, it typically can’t tell whether you mean “do X quick and dirty, I just wand the results asap” or “lay out a 10 step plan to do X in the most sustainable way possible that allows us to eventually reach points Y and Z in the future”. If it gets things wrong in either direction, that tends to be frustrating, but it can’t read your mind (yet).
AI agents that are able to run unit tests and end-2-end tests and see compiler errors are so much more useful than their blind counterparts
If you need some particular piece of software but are unsure if current AIs will be able to deliver, it might make sense to write a detailed, self-contained and as-complete-as-possible specification of it, to then throw it at an AI agent whenever a new model (or scaffolding) comes out. Github Copilot with GPT5 was able to do many more things than I would have imagined, with non-trivial but still relatively limited oversight.
I haven’t tried yet if just letting it to its thing, saying only “continue” after each iteration, may be sufficient. Maybe I put more time into guiding it than would actually be necessary.
That being said: writing a self-contained specification that contains your entire idea of something with all the details nailed down such that there is little room for misunderstandings is surprisingly hard. There are probably cases where just writing the software yourself (if you can) takes less time than fully specifying it.
That being said, “writing down a specification” can also happen interview-style using an AI’s voice mode, so you can do it while doing chores.
For a long time, I used to wonder what causes people to consistently mispronounce certain words even when they are exposed to many people pronouncing them correctly. (which mostly applies to people speaking in a non-native language, e.g. people from continental Europe speaking English)
Some examples that I’ve heard from different people around me over the years:
Saying “rectangel” instead of “rectangle”
Saying “pre-purr” (like prefer, but with a p) instead of “prepare”
Saying something like, uhh, “devil-oupaw” instead of “developer”
Saying “leech” instead of “league”
Saying “immu-table” instead of “immutable”
Saying “cyurrently” instead of “currently”
I did, of course, understand that if you only read a word, particularly in English where pronunciations are all over the place and often unpredictable, you may end up with a wrong assumption of how it’s pronounced. This happened to me quite a lot[1]. But then, once I did hear someone pronounce it, I usually quickly learned my lesson and adapted the correct way of saying it. But still I’ve seen all these other people stick to their very unusual pronunciations anyway. What’s up with that?[2] Naturally, it was always too awkward for me to ask them directly, so I never found out.
Recently, however, I got a rather uncomfortable insight into how this happens when a friend pointed out that I was pronouncing “dude” incorrectly, and have apparently done so for all my life, without anyone ever informing me about it, and without me noticing it.
So, as I learned now, “dude” is pronounced “dood” or “dewd”. Whereas I used to say “dyood” (similar to duke). And while I found some evidence that dyood is not completely made up, it still seems to be very unusual, and something people notice when I say it.
Hence I now have the, or at least one, answer to my age-old question of how this happens. So, how did I never realize? Basically, I did realize that some people said “dood”, and just took that as one of two possible ways of pronouncing that word. Kind of, like, the overly American way, or something a super chill surfer bro might say. Whenever people said “dood” (which, in my defense, didn’t happen all that often in my presence[3]) I had this subtle internal reaction of wondering why they suddenly saw the need to switch to such a heavy accent for a single word.
I never quite realized that practically everyone said “dood” and I was the only “dyood” person.
So, yeah, I guess it was a bit of a trapped prior and it took some well-directed evidence to lift me out of that valley. And maybe the same is the case for many of the other people out there who are consistently mispronouncing very particular words.
But, admittedly, I still don’t wanna be the one to point it out to them.
And when I lie awake at night, I wonder which other words I may be mispronouncing with nobody daring to tell me about it.
e.g., for some time I thought “biased” was pronounced “bee-ased”. Or that “sesame” was pronounced “see-same”. Whoops. And to this day I have a hard time remembering how “suite” is pronounced.
Of course one part of the explanation is survivorship bias. I’m much less likely to witness the cases where someone quickly corrects their wrong pronunciation upon hearing it correctly. Maybe 95% of cases end up in this bucket that remains invisible to me. But still, I found the remaining 5% rather mysterious.
I use written English much more than spoken English, so I am probably wrong about the pronunciation of many words. I wonder if it would help to have a software that would read each sentence I wrote immediately after I finished it (because that’s when I still remember how I imagined it to sound).
EDIT: I put the previous paragraph in Google Translate, and luckily it was just as I imagined. But that probably only means that I am already familiar with frequent words, and may make lots of mistakes with rare ones.
Using coding agents gave me a new appreciation for the Jevons paradox, a concept that received a lot of attention earlier this year when DeepSeek R1′s release in January coincided with a sudden drop in Nvidia’s stock price, possibly as the supposed efficiency gains of the model made many traders assume this would lead to a decrease in hardware demand. The stock eventually bounced back though, with Jevons paradox being cited as one of the reasons, as it predicted that efficiency gains would lead to an increase in hardware demand rather than a decrease.
I recently realized that Github Copilot’s agent mode with GPT5 is way more capable than I would have imagined, and I started using it a lot, starting a bunch of small to medium-sized projects. I’d just start with an empty directory, write a projectOutline.md file to describe what I ultimately want to achieve, and let the agent take it from there (occasionally making some suggestions for refactorings and writing more unit + end2end tests, to keep things stable and scalable). This way it would just take me something like 5-50 prompts and a few hours of work to reach an MVP or prototype state in these projects that otherwise would have taken weeks.
The naive reaction to this would be to assume I would be much faster with my coding projects and hence would have to spend less time on coding. But, as Jevons paradox would predict, the opposite was the case—it just caused me to work on way more projects, many that I otherwise would never have started, and I spent much more time on this than I would have otherwise (over a given time frame). So even though coding became much faster (I may be wrong, but I’m pretty confident this is true in net dev time despite some contrary evidence, and I’m extremely certain it’s true in calendar time, as my output increased ~30x basically overnight—not because my coding speed was that slow beforehand, but because I never prioritized it as it wasn’t worth doing over other activities), the total time I spent programming increased a lot.
This will probably get old quickly (with the current frontier models), as with most projects I might hit a “wall” where the agents don’t do a great job of further iterative improvements, I suppose. But either way, it was interesting to experience this first-hand, how “getting faster at something” caused me to spend much more, rather than less, time on it, as obvious as this effect may be in hindsight.
After first learning about transformers, I couldn’t help but wonder why on Earth this works. How can this totally made-up, complicated structure somehow end up learning how to write meaningful text and having a mostly sound model of our world?
(tl;dr: no novel insights here, just me writing down some thoughts I’ve had after/while learning more about neural nets and transformers.)
When I once asked someone more experienced, they essentially told me “nobody really knows, but the closest thing we have to an answer is ‘the blessing of dimensionality’ - with so many dimensions in your loss landscape, you basically don’t run into local minima but the thing keeps improving if you just throw enough data and compute at it”.
I think this makes sense, and my view on how/why/when deep neural networks work is currently something along the lines of:
there’s some (unknown) minimal network size (or maybe rather “minimal network frontier”, as with different architectures you end up with different minimal sizes) for every problem you want to solve (for a certain understanding of the problem and when you consider it solved), so your network needs to be big enough to even be able to solve the problem
the network size & architecture also determines how much training data you need to get anywhere
basically, you try to find network architectures such that you encode sensible priors about the modality you’re working with that are basically always true while also eliminating a priori-useless weights from your network; this way, the training efforts allow the network to quickly learn important things rather than first having to figure out the priors themselves
for text, you might realize that different parts of the text refer to each other, so need a way to effectively pass information around, and hence you end up with something like the attention mechanism
for image detection, you realize that the prior of any given pixel being relevant for any other given pixel is higher the closer they are, so you end up with something like CNNs, where you start looking at low level features, and throughout the layers of the network, allow it to “convert” the raw pixel data successively to semantic data
in theory, you probably could just use a huge feed forward network (as long as it’s not so huge as to overfit instead of generalizing to anything useful) and it would possibly end up solving problems in similar ways as “smarter” architectures do (but not sure about this), but you would need way more parameters and way more training data to achieve similar results, much of which would be wasted on “low quality parameters” that could just as well be omitted
so, encoding these modality priors into your network architecture spares you probably orders of magnitude of compute compared to naive approaches
while the bitter lesson makes sense, it maybe under-emphasizes the degree to which choosing suitable network architecture + high quality training data matters?
lastly, the question “which problem you’re trying to solve” cannot just be answered on a high level with “I want to minimize loss in next-token prediction”, but the exact problem the network solves depends strongly on the training data; loss minimization is a trade-off between all the things you’re minimizing, so the higher the amount of rambling, gossip, meaningless binary data and so on in your training data is, the more parameters and training time you’ll need just for those, and the less will the network be capable to predict more meaningful tokens.
Related to that last point, I recently worked on a small project where you, as the user, play Pong against an AI. That AI is controlled by a small neural network (something in the order of 2 or 3 hidden layers and a few dozen neurons), initialized randomly, so at first it’s very easy for the human to win. While you play, though, the game collects your behavior as training data and constantly trains the neural network, which eventually learns to mirror you. So after a few minutes of playing, it plays very similar to the human and it becomes much harder to beat it.
One thing I noticed while working on this is that the naive approach to training this AI was far from optimal: much of the training data I collected ended up being pretty irrelevant for playing well! E.g., it’s much more important how the paddle moves while the ball is closing in, and almost entirely irrelevant what you do right after hitting the ball. There were several such small insights, leading me to tweak how exactly training data is collected (e.g. sampling it with lower probability while the ball is moving away than when it’s getting closer), which greatly reduced the time it took for the AI to learn, even with the network architecture staying the same.
Notably, this does not necessarily mean the loss curve dropped more quickly—due to me tweaking the training data, the loss curves before and after doing so related to quite different things. The same loss for higher quality data is much more useful than for noisy or irrelevant data.
There’s just so many degrees of freedom in all of this that it seems very likely that, even if there were not hardware advances at all, research would probably be able to come up with faster/cheaper/better-performing models for a long time.
for text, you might realize that different parts of the text refer to each other, so need a way to effectively pass information around, and hence you end up with something like the attention mechanism
If you are trying to convince yourself that a Transformer could work and to make it ‘obvious’ to yourself that you can model sequences usefully that way, it might be a better starting point to begin with Bengio’s simple 2003 LM and MLP-Mixer. Then Transformers may just look like a fancier MLP which happens to implement a complicated way of doing token-mixing inspired by RNNs and heavily tweaked empirically to eke out a bit more performance with various add-ons and doodads.
(AFAIK, no one has written a “You Could Have Invented Transformers”, going from n-grams to Bengio’s LM to MLP-Mixer to RNN to Set Transformer to Vaswani Transformer to a contemporary Transformer, but I think it is doable and useful.)
For people who like guided meditations: there’s a small YouTube channel providing a bunch of secular AI-generated guided meditations of various lengths and topics. More are to come, and the creator (whom I know) is happy about suggestions. Three examples:
I wouldn’t say these meditations are necessarily better or worse than any others, but they’re free and provide some variety. Personally, I avoid apps like Waking Up and Headspace due to both their imho outrageous pricing model and their surprising degree of monotony. Insight Timer is a good alternative, but the quality varies a lot and I keep running into overly spiritual content there. Plus there’s obviously thousands and thousands of guided meditations on YouTube, but there too it’s hit and miss. So personally I’m happy about this extra source of a good-enough-for-me standard.
Also, in case you ever wanted to hear a guided meditation on any particular subject or in any particular style, I guess you can contact the YouTube channel directly, or tell me and I’ll forward your request.
One crucial question in understanding and predicting the learning process, and ultimately the behavior, of modern neural networks, is that of the shape of their loss landscapes. What does this extremely high dimensional landscape look like? Does training generally tend to find minima? Do minima even exist? Is it predictable what type of minima (or regions of lower loss) are found during training? What role does initial randomization play? Are there specific types of basins in the landscape that are qualitatively different from others, that we might care about for safety reasons?
First, let’s just briefly think about very high dimensional spaces. One somewhat obvious observation is that they are absolutely vast. With each added dimension, the volume of the available space increases exponentially. Intuitively we tend to think of 3-dimensional spaces, and often apply this visual/spatial intuition to our understanding of loss landscapes. But this can be extremely misleading. Parameter spaces are utterly incredibly vast to a degree that our brain can hardly fathom. Take GPT3 for instance. It has 175 billion parameters, or dimensions. Let’s assume somewhat arbitrarily that all parameters end up in a range of [-0.5, 0.5], i.e. live in a 175-billion-dimensional unit cube around the origin of that space (as this is not the case, the real parameter space is actually even much, much larger, but bear with me). Even though every single axis only varies by 1 – let’s just for the sake of it interpret this as “1 meter” – even just taking the diagonal from one corner to the opposite one in this high-dimensional cube, you would get a length of ~420km. So if, hypothetically, you were sitting in the middle of this high dimensional unit cube, you could easily touch every single wall with your hand. But nonetheless, all the corners would be more than 200km distant from you.
This may be mind boggling, but is it relevant? I think it is. Take this realization for instance: if you have two minima in this high dimensional space, but one is just a tiny bit “flatter” than the other (meaning the second derivatives overall are a bit closer to 0), then the attractor basin of this flatter minimum is vastly larger than that of the other minimum. This is because the flatness implies a larger radius, and the volume depends exponentially on that radius. So, at 175 billion dimensions, even a microscopically larger radius means an overwhelmingly larger volume. If, for instance, one minimum’s attractor basin has a radius that is just 0.00000001% larger than that of the other minimum, then its volume will be roughly 40 million times larger (if my Javascript code to calculate this is accurate enough, that is). And this is only for GPT3, which is almost 4 years old by now.
The parameter space is just ridiculously large, so it becomes really crucial how the search process through it works and where it lands. It may be that somewhere in this vast space, there are indeed attractor basins that correspond to minima that we find extremely undesirable – certain capable optimizers perhaps, that have situational awareness and deceptive tendencies. If they do exist, what could we possibly tell about them? Maybe these minima have huge attractor basins that are reliably found eventually (maybe once we switch to a different network architecture, or find some adjustment to gradient descent, or reach a certain model size, or whatever), which would of course be bad news. Or maybe these attractor basins are so vanishingly small that we basically don’t have to care about them at all, because all the computer & search capacity of humanity over the next million years would have an almost 0 chance of ever stumbling onto these regions. Maybe they are even so small that they are numerically unstable, and even if your search process through some incredible cosmic coincidence happens to start right in such a basin, the first SGD step would immediately jump out of it due to the limitations of numerical accuracy on the hardware we’re using.
So, what can we actually tell at this point about the nature of high dimensional loss landscapes? While reading up on this topic, one thing that constantly came up is the fact that, the more dimensions you have, the lower the relative number of minima becomes compared to saddle points. Meaning that whenever the training process appears to slow down and it looks like it found some local minimum, it’s actually overwhelmingly likely that what it actually found is a saddle point, hence the training process never halts but keeps moving through parameter space, even if the loss doesn’t change that much. Do local minima exist at all? I guess it depends on the function the neural network is learning to approximate. Maybe some loss landscapes exist where the loss can just get asymptotically closer to some minimum (such as 0), without ever reaching it. And probably other loss landscapes exist where you actually have a global minimum, as well as several local ones.
Some people argue that you probably have no minima at all, because with each added dimension it becomes less and less likely that a given point is a minimum (because not only does the first derivative of a point have to be 0 for it to be a minimum, also all the second derivatives need to be in on it, and all be positive). This sounds compelling, but given that the space itself also grows exponentially with each dimension, we also have overwhelmingly more points to choose from. If you e.g. look at n-dimensional Perlin Noise, its absolute number of local minima within an n-dimensional cube of constant side length actually increases with each added dimension. However, the relative number of local minima compared to the available space still decreases, so it becomes harder and harder to find them.
I’ll keep it at that. This is already not much of a “quick” take. Basically, more research is needed, as my literature review on this subject yielded way more questions than answers, and many of the claims people made in their blog posts, articles and sometimes even papers seemed to be more intuitive / common-sensical or generalized from maybe-not-that-easy-to-validly-generalize-from research.
One thing I’m sure about however is that almost any explanation of how (stochastic) gradient descent works, that uses 3D landscapes for intuitive visualizations, is misleading in many ways. Maybe it is the best we have, but imho all such explainers should come with huge asterisks, explaining that the rules in very high dimensional spaces may look much different than our naive “oh look at that nice valley over there, let’s walk down to its minimum!” understanding, that happens to work well in three dimensions.
I’d like to point out that for neural networks, isolated critical points (whether minima, maxima, or saddle points) basically do not exist. Instead, it’s valleys and ridges all the way down. So the word “basin” (which suggests the geometry is parabolic) is misleading.
Because critical points are non-isolated, there are more important kinds of “flatness” than having small second derivatives. Neural networks have degenerate loss landscapes: their Hessians have zero-valued eigenvalues, which means there are directions you can walk along that don’t change the loss (or that change the loss by a cubic or higher power rather than a quadratic power). The dominant contribution to how volume scales in the loss landscape comes from the behavior of the loss in those degenerate directions. This is much more significant than the behavior of the quadratic directions. The amount of degeneracy is quantified by singular learning theory’s local learning coefficient (LLC).
In the Bayesian setting, the relationship between geometric degeneracy and inductive biases is well understood through Watanabe’s free energy formula. There’s an inductive bias towards more degenerate parts of parameter space that’s especially strong earlier in the learning process.
I heard that there is no local minima in high-dimensional spaces because there will be almost always paths to global minimum.
Could you share this code? I’d like to take a look.
Maybe I accidentally overpromised here :D this code is just an expression, namely
1.0000000001 ** 175000000000
, which, as wolframalpha agrees, yields 3.98e7.One thing that confused me about transformers is the question of when (as in, after how many layers) each embedding “flips” from representing the original token to finally representing the prediction of the next token.
By now, I think the answer is simply this: each embedding represents both at the same time (and more). For instance, in GPT3 there are 12,288 embedding dimensions. At first I thought that all of them initially encode the original token, and after going through all the layers they eventually all encode the next token, and somewhere in the layers between this shift must happen. But what, upon some reflection, makes much more sense would be something very roughly like, say:
some 1000 dimensions encode the original token
some other 1000 dimensions encode the prediction of the next token
the remaining 10,288 dimensions encode information about all available context (which will start out “empty” and get filled with meaningful information through the layers).
In practice, things are of course much less clean, and probably most dimensions will have some role in all these things, to different degrees, as of course all of this is learned through gradient descent and hence will be very noisy and gradual. Additionally, there’s the whole positional encoding thing which is also part of the embeddings and makes clear distinctions even more difficult. But the key point remains that a single embedding encodes many things, only one of which is the prediction, and this prediction is always there from the beginning (when it’s still very superficial and bad) and then, together with the rest of the embedding, gets refined more and more throughout the layers.
Another misconception I had was that embedding and unembedding are very roughly symmetric operations that just “translate” from token space to embedding space and vice versa[1]. This made sense in relation to the initial & naive “embeddings represent tokens” interpretation, but with the updated view as described above, it becomes clear that unembedding is rather an “extraction” of the information content in the embedding that encodes the prediction.
One piece of evidence for this updated view is that this paper (thanks to Leon Lang for the hint) found that “Zero layer transformers model bigram statistics”. So, indeed, embedding + unembedding alone already perform some very basic next-token prediction. (Admittedly I’m not sure if this is only the case when the transformer is trained with zero layers, or also in, say, GPT3, when during inference you just skip all the layers)
I would guess that transformer-experienced people (unless they disagree with my description—in that case, please elaborate what I’m still getting wrong) will find all of this rather obvious. But for me, this was a major missing piece of understanding, even after once participating in an ML-themed bootcamp and watching all the 3Blue1Brown videos on transformers several times, where this idea either is not directly explained, or I somehow managed to consistently miss it.
Of course, this is not entirely true to begin with because the unembedding yields a distribution rather than a single token. But my assumption was that, if you embed the word “Good” and then unembed the embedding immediately, you would get a very high probability for “Good” back when in practice (I didn’t verify this yet) you would probably obtain high probabilities for “morning”, “day” etc.
Awkwardly, it depends on whether the model uses tied embeddings (unembed is embed transpose) or has separate embed and unembed matrices. Using tied embedding matrices like this means the model actually does have to do a sort of conversion.
Your discussion seems mostly accurate in the case of having separate embed and unembed, except that I don’t think the initial state is like “1k encode current, 1k encode predictions, rest start empty”. The model can just directly encode predictions for an initial state using the unembed.
There has actually been some work visualizing this process, with a method called the “logit lens”.
The first example that I know of: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
A more thorough analysis: https://arxiv.org/abs/2303.08112
You can learn a per-token bias over all the layers to understand where in the model it stops representing the original embedding (or a linear transformation of it) like in https://www.lesswrong.com/posts/P8qLZco6Zq8LaLHe9/tokenized-saes-infusing-per-token-biases
You could also plot the cos-sims of the resulting biases to see how much it rotates.
Do it! I bet slightly against your prediction.
One super useful feature of Claude that some may not know about:
Claude is pretty good at creating web apps via artifacts
You can run and use these web apps directly in the Claude UI
You can publish and share these artifacts directly with others
As far as I can tell, the above is even available for non-paying users.
Relatedly: browser bookmarklets can be pretty useful little tools to reduce friction for recurring tasks you do in your browser. It may take <5 minutes to let Claude generate such bookmarklets for you.
You can also combine these two things, such as here: https://claude.ai/public/artifacts/9c58fb4a-5fae-48ce-aed3-60355bfd033e
This is a web app built and hosted by Claude which creates a customized browser bookmarklet that provides a simple text-to-speech feature. It works like this:
customize the configuration on the linked page
drag the “Speak Selection” button into your bookmarks bar
from then on, on any website, when you mark text and then click the bookmark (or, after having clicked on it once, you can also use the defined hotkey instead), the selected text will be read out to you
Surely there are browser plugins that provide better TTS than this, but consider it a little proof of concept. Also this way it’s free, friction-less, requires no account etc. Claude also claimed that, when using Edge or Safari, higher quality system voices may be available, but I didn’t look into this.
Some other random things that can be done via bookmarklets:
a button cycling through different playback speeds of all videos on the current website, in case you sometimes interact with video players without such a setting in their UI
if you’re fine with having some API key in your bookmarklet, you can automate all kinds of, say, LLM calls
If you’re using Chrome and have enabled the local Gemini nano AI, you can even use that in your bookmarklets without any API key being involved (haven’t tried this yet)
start & show a 5 minute timer in the corner of the page you’re on
show/hide parts of the page, e.g. comments on a blog, Youtube recommendations
highlight-for-screenshot overlay: enable temporarily drawing on the page to highlight things to then take screenshots; maybe slightly lower friction than having to use a separate paint app for that. Usable here (relevant keys after activating: Enter to leave drawing mode, ESC to close overlay, 1-9 to change marker size).
inline imperial<->metric unit converter
For some of these, a browser plugin or tampermonkey script or so may be preferable—but beware fake alternatives. If you just think “I could do X instead” but never actually do it, then maybe just creating a bookmarklet may be the better option after all, even if it’s not the most elegant solution.
Happy to hear about your use cases!
Cool! I didn’t know about bookmarklets. I knew Gemini would host little pages and apps made in canvas, so I played around a bit to see how different AI’s handle it.
Gemini is like your Claude example. Here is a 5 min timer bookmarklet
https://g.co/gemini/share/73048c89f2f2
Perplexity lab made a bookmarklet and a nice html explainer, but sharing is a little less intuitive. There’s a tab for “app” and at the bottom of that page a button to share the url. Here is a RNG (code works but the “drag the button” isn’t (and I was just looking for proof of concept)
Random Number Generator Bookmarklet—Free Tool
Chatgpt has canvas like Gemini. It should work the same but in my 15 min of testing the shared page hangs up and the bookmarklet doesnt seem to work. But I suppose it could be my work PC is breaking it somehow. Anyway, here is an attempted “read mode” for webpages:
ChatGPT—Read Mode Static
Grok’s canvas is Grok Studio. seems like it only can be summoned in chat, like Claude. Doesnt seem like you can share the app. Grok suggested:
To share publicly, host it on a free platform like GitHub Pages, Glitch, or Replit (upload the file and get a public URL).
I can share the chat that generated the bookmarklet though. Also, it doesn’t seem to work but again, proof of concept:
Mute all tabs
https://grok.com/share/c2hhcmQtNA%3D%3D_e0f91d33-7aba-4c8a-942b-db570b049536
Just to see if these bookmarklets were even possible I re-tried in Gemini
-Read mode works: https://g.co/gemini/share/dc55070e0dc4
-RNG, app works “drag to bookmarks” doesnt: https://g.co/gemini/share/024d865cbbae
-Mute all tabs works: https://g.co/gemini/share/5dba86dee603
A couple of weeks ago, I was surprised to find out that you can create artifacts that call the Claude API. Silly example: Chat app with Claude always responding with capitalized text.
Wow that feels almost cruel! Seems to change the Claude personality substantially?
Claude can also invoke instances of itself using the analysis tool (tell it to look for
self.claude
).Some quick thoughts on vibe coding:
it turns you from a developer into more of a product manager role
but the developers you manage are a) occasionally stupid/unwise and b) extremely fast and never tired
this makes it relatively addictive, because feedback cycles are much shorter than for a “real” product manager, who often has to wait for weeks to see their wishes turn into software, and you have a strong element of randomness in your rewards, with things sometimes turning out surprisingly well one-shot, but sometimes not at all
It can also lead to laziness, as it’s very tempting to getting used to “just letting the AI do it” even in not primarily vibe-coded projects, instead of investing one’s own brainpower
AI agents tend to never/rarely talk back or tell you that something is a bad idea or doesn’t work well with the current architecture; they just do things as best as currently possible. This form of local optimization quickly runs into walls if not carefully mitigated by you.
Part of the problem is that by default the AI has extremely little context and knows little about the purpose, scope and ambition of your project. So when you tell it “do X”, it typically can’t tell whether you mean “do X quick and dirty, I just wand the results asap” or “lay out a 10 step plan to do X in the most sustainable way possible that allows us to eventually reach points Y and Z in the future”. If it gets things wrong in either direction, that tends to be frustrating, but it can’t read your mind (yet).
AI agents that are able to run unit tests and end-2-end tests and see compiler errors are so much more useful than their blind counterparts
If you need some particular piece of software but are unsure if current AIs will be able to deliver, it might make sense to write a detailed, self-contained and as-complete-as-possible specification of it, to then throw it at an AI agent whenever a new model (or scaffolding) comes out. Github Copilot with GPT5 was able to do many more things than I would have imagined, with non-trivial but still relatively limited oversight.
I haven’t tried yet if just letting it to its thing, saying only “continue” after each iteration, may be sufficient. Maybe I put more time into guiding it than would actually be necessary.
That being said: writing a self-contained specification that contains your entire idea of something with all the details nailed down such that there is little room for misunderstandings is surprisingly hard. There are probably cases where just writing the software yourself (if you can) takes less time than fully specifying it.
That being said, “writing down a specification” can also happen interview-style using an AI’s voice mode, so you can do it while doing chores.
For a long time, I used to wonder what causes people to consistently mispronounce certain words even when they are exposed to many people pronouncing them correctly. (which mostly applies to people speaking in a non-native language, e.g. people from continental Europe speaking English)
Some examples that I’ve heard from different people around me over the years:
Saying “rectangel” instead of “rectangle”
Saying “pre-purr” (like prefer, but with a p) instead of “prepare”
Saying something like, uhh, “devil-oupaw” instead of “developer”
Saying “leech” instead of “league”
Saying “immu-table” instead of “immutable”
Saying “cyurrently” instead of “currently”
I did, of course, understand that if you only read a word, particularly in English where pronunciations are all over the place and often unpredictable, you may end up with a wrong assumption of how it’s pronounced. This happened to me quite a lot[1]. But then, once I did hear someone pronounce it, I usually quickly learned my lesson and adapted the correct way of saying it. But still I’ve seen all these other people stick to their very unusual pronunciations anyway. What’s up with that?[2] Naturally, it was always too awkward for me to ask them directly, so I never found out.
Recently, however, I got a rather uncomfortable insight into how this happens when a friend pointed out that I was pronouncing “dude” incorrectly, and have apparently done so for all my life, without anyone ever informing me about it, and without me noticing it.
So, as I learned now, “dude” is pronounced “dood” or “dewd”. Whereas I used to say “dyood” (similar to duke). And while I found some evidence that dyood is not completely made up, it still seems to be very unusual, and something people notice when I say it.
Hence I now have the, or at least one, answer to my age-old question of how this happens. So, how did I never realize? Basically, I did realize that some people said “dood”, and just took that as one of two possible ways of pronouncing that word. Kind of, like, the overly American way, or something a super chill surfer bro might say. Whenever people said “dood” (which, in my defense, didn’t happen all that often in my presence[3]) I had this subtle internal reaction of wondering why they suddenly saw the need to switch to such a heavy accent for a single word.
I never quite realized that practically everyone said “dood” and I was the only “dyood” person.
So, yeah, I guess it was a bit of a trapped prior and it took some well-directed evidence to lift me out of that valley. And maybe the same is the case for many of the other people out there who are consistently mispronouncing very particular words.
But, admittedly, I still don’t wanna be the one to point it out to them.
And when I lie awake at night, I wonder which other words I may be mispronouncing with nobody daring to tell me about it.
e.g., for some time I thought “biased” was pronounced “bee-ased”. Or that “sesame” was pronounced “see-same”. Whoops. And to this day I have a hard time remembering how “suite” is pronounced.
Of course one part of the explanation is survivorship bias. I’m much less likely to witness the cases where someone quickly corrects their wrong pronunciation upon hearing it correctly. Maybe 95% of cases end up in this bucket that remains invisible to me. But still, I found the remaining 5% rather mysterious.
Maybe they were intimidated by my confident “dyood”s I threw left and right.
I use written English much more than spoken English, so I am probably wrong about the pronunciation of many words. I wonder if it would help to have a software that would read each sentence I wrote immediately after I finished it (because that’s when I still remember how I imagined it to sound).
EDIT: I put the previous paragraph in Google Translate, and luckily it was just as I imagined. But that probably only means that I am already familiar with frequent words, and may make lots of mistakes with rare ones.
Using coding agents gave me a new appreciation for the Jevons paradox, a concept that received a lot of attention earlier this year when DeepSeek R1′s release in January coincided with a sudden drop in Nvidia’s stock price, possibly as the supposed efficiency gains of the model made many traders assume this would lead to a decrease in hardware demand. The stock eventually bounced back though, with Jevons paradox being cited as one of the reasons, as it predicted that efficiency gains would lead to an increase in hardware demand rather than a decrease.
I recently realized that Github Copilot’s agent mode with GPT5 is way more capable than I would have imagined, and I started using it a lot, starting a bunch of small to medium-sized projects. I’d just start with an empty directory, write a projectOutline.md file to describe what I ultimately want to achieve, and let the agent take it from there (occasionally making some suggestions for refactorings and writing more unit + end2end tests, to keep things stable and scalable). This way it would just take me something like 5-50 prompts and a few hours of work to reach an MVP or prototype state in these projects that otherwise would have taken weeks.
The naive reaction to this would be to assume I would be much faster with my coding projects and hence would have to spend less time on coding. But, as Jevons paradox would predict, the opposite was the case—it just caused me to work on way more projects, many that I otherwise would never have started, and I spent much more time on this than I would have otherwise (over a given time frame). So even though coding became much faster (I may be wrong, but I’m pretty confident this is true in net dev time despite some contrary evidence, and I’m extremely certain it’s true in calendar time, as my output increased ~30x basically overnight—not because my coding speed was that slow beforehand, but because I never prioritized it as it wasn’t worth doing over other activities), the total time I spent programming increased a lot.
This will probably get old quickly (with the current frontier models), as with most projects I might hit a “wall” where the agents don’t do a great job of further iterative improvements, I suppose. But either way, it was interesting to experience this first-hand, how “getting faster at something” caused me to spend much more, rather than less, time on it, as obvious as this effect may be in hindsight.
After first learning about transformers, I couldn’t help but wonder why on Earth this works. How can this totally made-up, complicated structure somehow end up learning how to write meaningful text and having a mostly sound model of our world?
(tl;dr: no novel insights here, just me writing down some thoughts I’ve had after/while learning more about neural nets and transformers.)
When I once asked someone more experienced, they essentially told me “nobody really knows, but the closest thing we have to an answer is ‘the blessing of dimensionality’ - with so many dimensions in your loss landscape, you basically don’t run into local minima but the thing keeps improving if you just throw enough data and compute at it”.
I think this makes sense, and my view on how/why/when deep neural networks work is currently something along the lines of:
there’s some (unknown) minimal network size (or maybe rather “minimal network frontier”, as with different architectures you end up with different minimal sizes) for every problem you want to solve (for a certain understanding of the problem and when you consider it solved), so your network needs to be big enough to even be able to solve the problem
the network size & architecture also determines how much training data you need to get anywhere
basically, you try to find network architectures such that you encode sensible priors about the modality you’re working with that are basically always true while also eliminating a priori-useless weights from your network; this way, the training efforts allow the network to quickly learn important things rather than first having to figure out the priors themselves
for text, you might realize that different parts of the text refer to each other, so need a way to effectively pass information around, and hence you end up with something like the attention mechanism
for image detection, you realize that the prior of any given pixel being relevant for any other given pixel is higher the closer they are, so you end up with something like CNNs, where you start looking at low level features, and throughout the layers of the network, allow it to “convert” the raw pixel data successively to semantic data
in theory, you probably could just use a huge feed forward network (as long as it’s not so huge as to overfit instead of generalizing to anything useful) and it would possibly end up solving problems in similar ways as “smarter” architectures do (but not sure about this), but you would need way more parameters and way more training data to achieve similar results, much of which would be wasted on “low quality parameters” that could just as well be omitted
so, encoding these modality priors into your network architecture spares you probably orders of magnitude of compute compared to naive approaches
while the bitter lesson makes sense, it maybe under-emphasizes the degree to which choosing suitable network architecture + high quality training data matters?
lastly, the question “which problem you’re trying to solve” cannot just be answered on a high level with “I want to minimize loss in next-token prediction”, but the exact problem the network solves depends strongly on the training data; loss minimization is a trade-off between all the things you’re minimizing, so the higher the amount of rambling, gossip, meaningless binary data and so on in your training data is, the more parameters and training time you’ll need just for those, and the less will the network be capable to predict more meaningful tokens.
Related to that last point, I recently worked on a small project where you, as the user, play Pong against an AI. That AI is controlled by a small neural network (something in the order of 2 or 3 hidden layers and a few dozen neurons), initialized randomly, so at first it’s very easy for the human to win. While you play, though, the game collects your behavior as training data and constantly trains the neural network, which eventually learns to mirror you. So after a few minutes of playing, it plays very similar to the human and it becomes much harder to beat it.
One thing I noticed while working on this is that the naive approach to training this AI was far from optimal: much of the training data I collected ended up being pretty irrelevant for playing well! E.g., it’s much more important how the paddle moves while the ball is closing in, and almost entirely irrelevant what you do right after hitting the ball. There were several such small insights, leading me to tweak how exactly training data is collected (e.g. sampling it with lower probability while the ball is moving away than when it’s getting closer), which greatly reduced the time it took for the AI to learn, even with the network architecture staying the same.
Notably, this does not necessarily mean the loss curve dropped more quickly—due to me tweaking the training data, the loss curves before and after doing so related to quite different things. The same loss for higher quality data is much more useful than for noisy or irrelevant data.
There’s just so many degrees of freedom in all of this that it seems very likely that, even if there were not hardware advances at all, research would probably be able to come up with faster/cheaper/better-performing models for a long time.
If you are trying to convince yourself that a Transformer could work and to make it ‘obvious’ to yourself that you can model sequences usefully that way, it might be a better starting point to begin with Bengio’s simple 2003 LM and MLP-Mixer. Then Transformers may just look like a fancier MLP which happens to implement a complicated way of doing token-mixing inspired by RNNs and heavily tweaked empirically to eke out a bit more performance with various add-ons and doodads.
(AFAIK, no one has written a “You Could Have Invented Transformers”, going from n-grams to Bengio’s LM to MLP-Mixer to RNN to Set Transformer to Vaswani Transformer to a contemporary Transformer, but I think it is doable and useful.)
I think you would appreciate this post
For people who like guided meditations: there’s a small YouTube channel providing a bunch of secular AI-generated guided meditations of various lengths and topics. More are to come, and the creator (whom I know) is happy about suggestions. Three examples:
10 minute relaxation
“Feel the AGI”
Nondual awareness
They are also available in podcast form here.
I wouldn’t say these meditations are necessarily better or worse than any others, but they’re free and provide some variety. Personally, I avoid apps like Waking Up and Headspace due to both their imho outrageous pricing model and their surprising degree of monotony. Insight Timer is a good alternative, but the quality varies a lot and I keep running into overly spiritual content there. Plus there’s obviously thousands and thousands of guided meditations on YouTube, but there too it’s hit and miss. So personally I’m happy about this extra source of a good-enough-for-me standard.
Also, in case you ever wanted to hear a guided meditation on any particular subject or in any particular style, I guess you can contact the YouTube channel directly, or tell me and I’ll forward your request.