To your last point: the fact that “being known” spans ~8 orders of magnitude probably makes this pretty likely a Pareto distribution. Or whatever distribution is closest surely shares many of its characteristics. Also the fact that being known helps with being known. Increasing your “being known degree” by 5% is probably not that much more difficult when 100M people know you vs when 100K people know you.
silentbob
Highly underpowered anecdote, but I’ve asked several models for lyrics generation, and Gemini 3 was the first one that managed to add some pretty funny lines, even in non-English languages. Opus 4.5 also definitely showed some humor, but mostly in English, other languages were a bit disappointing in my few attempts.
In the post though, you wrote:
There were plenty of assumptions here to simplify things, including: I assumed the population won’t increase, that the number of deaths per year will be relatively constant until AGI
So if you’re still biting the bullet under these conditions, then I don’t really get why—unless you’re a full-on negative utilitarian, but then the post could just have said “I think I’m e/acc because that’s the fastest way of ending this whole mess”. :P
I don’t want anyone to think I’m trying to publish an objectively correct AI pause calculator. I’m just trying to express my own values on paper and nudge others to do the same.
I mean, that’s fine and all, but if your values truly imply you prefer ending the world now rather than later, when these are the two options in front of you, then that does some pretty heavy lifting. Because without this view, I don’t think your other premises would lead to the same conclusion.
More people experiencing some horrible apocalypse and having their lives cut short sounds bad to me.
If we assume roughly constant population size (or even moderate ongoing growth) and your assumption holds that a pause reduces p(doom) from 10 to 5%, then far fewer people will die in a fiery apocalypse. So however we turn it, I find it hard to see how your conclusion follows from your napkin math, unless I’m missing something. (edit: I notice I jumped back from my hypothetical scenario to the AGI pause scenario; bit premature here, but eventually I’d still like to make this transition, because again, your fiery apocalypse claim above would suggest you should rather be in favor of a pause, and not against it)
(I’d also argue that even if the math checks out somehow, the numbers you end up with are pretty close while all the input values (like the 40 year timeline) surely have large error bars, where even small deviations might lead to the opposite outcome. But I notice this was discussed already in another comment thread)
Imaging pausing did not change p(doom) at all and merely delays inevitable extinction by 10 years. To me that would still be a no brainer—rather have 10 more years. To you, does that really only boil down to 600 million extra deaths and nothing positive, like, say, 80 billion extra years of life gained?
Doesn’t your way of calculating things suggest that, if you had the chance to decide between two outcomes:
Everybody dies instantly now
Everybody dies instantly in 100 years
You’d choose the former because you’d end up at a lower number of people dying?
Great initiative, looking forward to what you eventually report!
I had a vaguely similar thought at first, but upon some reflection found the framing insightful. I hadn’t really thought much about the “AI models might just get selected for the capability of resisting shutdown, whether they’re deliberate about this or not” hypothesis, and while it’s useful to distinguish the two scenarios, I’d personally rather see this as a special case of “resisting shutdown” than something entirely separate.
One more addition: Based on @leogao’s comment, I went a bit beyond the “visualize loss landscape based on gradient” approach, and did the following: I trained 3 models of identical architecture (all using [20, 30, 20] hidden neurons with ReLU) for 100 epochs and then had a look at the loss landscape in the “interpolation space” between these three models (such that model1 would be at (0,0), model2 at (1,0), model3 at (0,1), and the rest just linearly interpolating between their weights). I visualized the log of the loss at each point. My expectation was to get clear minima at (0,0), (1,0) and (0,1), where the trained models are placed, and something elevated between them. And indeed:
Otherwise the landscape does look pretty smooth and boring again.
In fact, even after only 10 epochs and a test loss of >1.2, model 4 already produces something that clearly resembles Mandelbrot, which model 3 failed to achieve even after 100s of epochs:
a factor of 3 loss difference is huge! if you want to claim that smooth actfn is better beyond what’s explained by the loss, you need to compare two models with the same loss but different actfn.
I made a fairer comparison now. Training model 4 (same architecture as model 3, but SiLU instead of ReLU) for only 30 epochs, it achieves a test loss of 0.3435, slightly above the 0.3403 of model 3. Taking these two models to render Mandelbrot, I get these results:
Model 3 (as we’ve seen in the post):
Model 4 with slightly worse test loss than Model 3:
And zooming in a bit to see the fractal details of Model 4:
So I’d say the observation holds—the huge difference between rendered Mandelbrot quality for some reason does not depend so much on the loss here. Even for (roughly) identical loss, they are worlds apart. So my take is that the smoothness of the SiLU activation function somehow enables the fractal iteration to work out much better (but not really sure how else to test this vague assumption).
You mean in ARENA or with this complex number multiplication project? In both cases I was just using Google Colab (i.e. cloud compute) anyway. It probably would have worked in the free tier, but I did buy $10 worth of credits to speed things up a bit, as in the free tier I was occasionally downgraded to a CPU runtime after running the notebook for too long throughout a day. So I never tried this on my own hardware.
For this project, I’m pretty sure it would have worked completely fine locally. For ARENA, I’m not entirely sure, but would expect so too (and I think many people do work through it locally on their device with their own hardware). I think the longest training run I’ve encountered took something like 30 minutes on a T4 GPU in Colab, IIRC. According to Claude, consumer GPUs should be able to run that in a similar order of magnitude. Whereas if you only have some mid-range laptop without a proper graphics card, Claude expects a 10-50x slowdown, so that might become rather impractical for some of the ARENA exercises, I suppose.
One addition: I’ve been privately informed that another interesting thing to look at would be a visualization of C² (rather than only multiplication of a constant complex number with other complex numbers, see Visualizing Learned Functions section).
So I did that. For instance, here’s the square visualization of model2 (the one with [10, 10] hidden neurons):
Again, we see some clear parallel between reality and the model, i.e. colors end up in roughly the right places, but it’s clearly quite a bit off anyway. We also still see a lot of “linearity”, i.e. straight lines in the model predictions as well as the diff heatmap, but this linearity is now seemingly only occurring in “radial” form, towards the center.
Model 0 and 1 look similar / worse. Model 3 ([20, 30, 20] hidden neurons) gets much closer despite still using ReLU:
And model 4 (same but with SiLU), expectedly, does even better:
But ultimately, we see the same pattern of “the larger the model, the more accurate, and SiLU works better than ReLU” again, without any obvious qualitative difference between SiLU and ReLU—so I don’t think these renderings give any direct hint of SiLU performing that much better for actual fractal renderings than ReLU.
Thanks for the comment!
but it would benefit a lot from a more in depth review of the existing literature and methodologies
This is definitely not wrong. Probably depends a bit on what one sees as the goal here though—in my case, after weeks of readings and exercises, I was in dire need of basically just playing around with things more freely. So it was less about “how can I create the most valuable possible lesswrong post”, and more “I’m going to do this thing anyway, and perhaps documenting the project in a lesswrong post is better than not doing so”. I’m not totally confident about that though.
Thanks a lot for the list! Very interesting points that give me some ideas of where to go next.
the piecewise approximation thing is a pretty widely accepted opinion in ML (...)
local approximation of the loss landscape as approximately quadratic is pretty widely acceptedSounds like good news to me! Certainly not the worst thing if the thing I found happens to be widely accepted. (also makes sense now that in both cases, Claude predicted these outcomes (even though I only asked it after getting to these findings—but asked as if I didn’t know what to expect) - I guess it was less about Claude being very smart then and more about these things being relatively well established anyway)
a factor of 3 loss difference is huge! if you want to claim that smooth actfn is better beyond what’s explained by the loss, you need to compare two models with the same loss but different actfn
Fair point! My coverage of that in the post wasn’t great. What I didn’t mention there is that during the rendering of the two videos of model3 and model4 “learning Mandelbrot”, model4 had the shape (including “fractal depth”) down way before even getting close to the loss that model3 had at the end. So even with much lower loss, the ReLU Mandelbrot looked much worse than that of SiLU. But I’ll look into that again and make a proper comparison.
Somewhat relatedly, when I started growing my hair long, I got exclusively positive feedback about it. It would have been easy to take this as evidence that clearly this was a good decision and this is just the better hair style for me. But then again, personal feedback like this tends to be very strongly filtered. Firstly, as in your example, the vast majority of people who disagree will just say nothing rather than telling me “I think this looks worse than before”. Secondly, there were a few cases where people saw me after a longer time, said something like “Oh, your hair is longer!” and then after a brief pause added something like “Looks good!”—I suspect many of these cases were just the person realizing that pointing that out without giving a compliment would seem rude or awkward, so they quickly made sure to say something nice about it.
Visual Exploration of Gradient Descent (many images)
Some quick thoughts on vibe coding:
it turns you from a developer into more of a product manager role
but the developers you manage are a) occasionally stupid/unwise and b) extremely fast and never tired
this makes it relatively addictive, because feedback cycles are much shorter than for a “real” product manager, who often has to wait for weeks to see their wishes turn into software, and you have a strong element of randomness in your rewards, with things sometimes turning out surprisingly well one-shot, but sometimes not at all
It can also lead to laziness, as it’s very tempting to getting used to “just letting the AI do it” even in not primarily vibe-coded projects, instead of investing one’s own brainpower
AI agents tend to never/rarely talk back or tell you that something is a bad idea or doesn’t work well with the current architecture; they just do things as best as currently possible. This form of local optimization quickly runs into walls if not carefully mitigated by you.
Part of the problem is that by default the AI has extremely little context and knows little about the purpose, scope and ambition of your project. So when you tell it “do X”, it typically can’t tell whether you mean “do X quick and dirty, I just wand the results asap” or “lay out a 10 step plan to do X in the most sustainable way possible that allows us to eventually reach points Y and Z in the future”. If it gets things wrong in either direction, that tends to be frustrating, but it can’t read your mind (yet).
AI agents that are able to run unit tests and end-2-end tests and see compiler errors are so much more useful than their blind counterparts
If you need some particular piece of software but are unsure if current AIs will be able to deliver, it might make sense to write a detailed, self-contained and as-complete-as-possible specification of it, to then throw it at an AI agent whenever a new model (or scaffolding) comes out. Github Copilot with GPT5 was able to do many more things than I would have imagined, with non-trivial but still relatively limited oversight.
I haven’t tried yet if just letting it to its thing, saying only “continue” after each iteration, may be sufficient. Maybe I put more time into guiding it than would actually be necessary.
That being said: writing a self-contained specification that contains your entire idea of something with all the details nailed down such that there is little room for misunderstandings is surprisingly hard. There are probably cases where just writing the software yourself (if you can) takes less time than fully specifying it.
That being said, “writing down a specification” can also happen interview-style using an AI’s voice mode, so you can do it while doing chores.
Thanks for the post! What makes me skeptical of utilizing this is that it usually takes me anywhere between 10 and 45 minutes to fall asleep, so when going to bed, I have no way of knowing when to get up even if I had perfect knowledge about my sleep cycles.
I also wonder if sleep inertia is actually as bad as it seems. Does it meaningfully impact the rest of the day? Or do you just feel groggy for 15 minutes but then return to normal? Or is it more about that last bit of sleep (your unfinished sleep cycle) being a waste of time and it makes sense to skip it?
Science bench is made by a Christian Stump. LLMs are literally stumped.
Great write-up, thanks Stephen! I really appreciate the effort this must have taken. I’m currently working through the ARENA curriculum, so reading your post was both a great refresher of earlier concepts as well as a good outlook on some of the things still ahead of me.
Some minor issues I came across:
Maybe it’s just me, but I am not quite able to parse this sentence: “Duplicate token heads: are active at the second position where “John” (S2) and attend to the first position where “John” is (S1).”—is there just an “is” missing ahead of (S2)? But even then I’m not even sure what S1 and S2 mean in this context (“Subject” I suppose?). I don’t think the meaning of S1 and S2 (or the S in S-inhibition) are introduced anywhere.
I think Figure 9 has a typo, “key v1 causing value v1”, here the first v1 should probably be k1?
“Step 8. Back to words:” has a double space (edit: and “1. Keys as pattern detectors” as well)
Probably a nit-pick, but in Interpretability with SAEs you write “A typical vector of neuron activations such as the residual stream”—I believe the residual stream is technically not a vector of neuron activations, but rather a vector influenced by neuron activations (among other things, like the embedding matrix), right?
Thanks again!
Indeed, when I encounter strangers who behave in unusual ways I sometimes make an effort not to look like I notice them even though I do, as “behaves unusual” tends to make them unpredictable and usually I’m not interested in “provoking” them. Sure, that person climbing a fence in plain sight of the public may just be some friendly rationalist to whom I could express my curiosity about their endeavors, but they may also be some kind of unhinged person without self control, what do I know.
So, maybe I would even reframe invisibility—in some settings at least—to something like “don’t care & don’t trust & can’t be bothered to engage”.