Writing down other people’s thoughts is an underrated activity!
MichaelDickens
Yeah the obvious reason to predict more success on LW is that the post is implicitly pessimistic about AI companies’ ability to solve alignment. The reason I expected less success is that the post is making a simplified, arguably naive* argument where I could’ve made many caveats, or said a lot more about the real-world complexities of what I’m proposing, but I left those bits out because ultimately I didn’t think they were important enough. LW users (including me) tend to write comments pointing out those things I didn’t talk about.
In retrospect I suppose it’s unsurprising that this post was well-received on LW. It might just be risk-aversion combined with the fact that I’m not very good at predicting which posts LW users will like.
*in the colloquial sense of “not understanding how the world works”, not the mathematician’s sense
(Yes, I know this is an old post.)
One of the great things about LessWrong is you can still comment on old posts, even from decades ago!
To add to this:
I would speculate that there are more than a hundred factorization algorithms that are both more efficient than the general number field sieve and an equal or shorter inferential distance away, but that haven’t been found. If I’m right, then it’s unsurprising that we found GNFS by stumbling around in a high-dimensional search space.
This is the latest of several occasions in which I mentally predicted that a post would be well-received on the EA Forum and poorly received on LessWrong, and then it ended up being the other way around.
I’d expect the US government to demand the weights, GPUs, and all research for national security reasons
I think there’s a pretty good chance this wouldn’t happen (maybe 50⁄50 but I haven’t really thought about it). The US government is mostly not AGI-pilled. Things might be different a year from now.
Also, even if they wanted to, I think the government would have a hard time putting together an org that’s as good at AI development as OpenAI, Anthropic, or Google.
A related thesis you could have is: “if you’re a frontier AI company and you’re IPO’ing soon, you should put safeguards in place to give yourself the option to shut down the company if things look too dangerous.”
There was news recently about how apparently Elon Musk organized the SpaceX governance in such a way that shareholders aren’t allowed to sue him. If he can pull that off, I bet AI companies can find a way to create similar safety protections, if they’re sufficiently motivated.
A frontier AI company should shut down
I feel like that is not the mechanism in my case? When I’m doing something fun, I’m not secretly experiencing boredom but unaware of it. I’m just not experiencing boredom. Meditation causes me to be bored, rather than revealing the boredom that was already there.
If I put on the “we need empirical feedback from neural nets to make progress on alignment” hat, along with my “prudence” hat, I’m thinking things more like, “okay let’s stop scaling now, and just work really hard on figuring out how exactly capabilities emerged between e.g., GPT-3 and GPT-4. Like, what exactly can we predict about GPT-4 based on GPT-3? Can we break down surprising and abrupt less-scary capabilities into understandable parts, and generalize from that to more-scary capabilities?” Basically, I’m hoping for a bunch more proof of concept that Anthropic is capable of understanding and controlling current systems, before they scale blindly. If they can’t do it now, why should I expect they’ll be able to do it then?
Looking back in 2026, this strikes me as a sad missed opportunity. We’re at a point where the next capabilities jump will probably not get us anything as easily describable as “gain the ability to multiply 3-digit numbers”. Instead it will be something fuzzy and hard to operationalize.
There was a window where we could do empirical work in the genre of “figure out how exactly model capabilities improve with successive generations”, but that work is much harder to do now—it might not even be possible at all.
Harebrained alignment idea: LLMs can’t be trusted to assist with alignment research because it’s too easy to get them to say what you want to hear (e.g. make you think you’ve solved a safety problem when you haven’t). Therefore, AI companies should train a distinct LLM that doesn’t go through RLHF or any other process by which it’s reinforced based on how much people like its responses.
This doesn’t fix the problem that next-token-generators are architecturally better at generating plausible-sounding statements than true statements, but it does help with sycophancy etc.
Another related idea is to tune an LLM really hard toward criticism: make it work as hard as it can to come up with reasons why you’re wrong.
I used to meditate daily, but I stopped because it was a source of suffering. Specifically, it was boring. When my meditation time was coming up, I found myself not wanting to do it; while meditating, I frequently found myself wishing I was doing something else instead. And I never noticed any significant positive effects. I experienced a mild reduction in anxiety, but not enough to outweigh the suffering of the meditation itself.
Yeah, it’s a judgment call as to whether trying to solve alignment empirically is more or less doomed than trying to coordinate to not build ASI. I don’t have a clear argument either way; the best I’ve come up with is a list of heuristic reasons why I believe the “empirical alignment” approach is more doomed, which I wrote here.
I can’t speak for people who actually work on theoretical alignment, but my perspective is:
Yes, developing theory without the ability to empirically test your theories is really hard and does not have a good historical track record.
To do empirical work on aligning ASI, we have to build the thing that kills us, which means we die.
The seeming impossibility of theoretical alignment work isn’t a good argument that we should do empirical work instead. The two options are: we do the thing that’s really hard and probably won’t work, or we do the thing that kills us. I prefer the former.
I should clarify that there is a gap between “Anthropic says they are going to conduct research on how to implement a pause” and “Anthropic is actively conducting research on how to implement a pause”, and I am not confident that the former leads to the latter.
How valuable are weak AI safety regulations?
Clearly the inauthentic behavior detection is not good. I follow a few finance people and most of their popular tweets have someone in the replies with an identical name and profile picture but different handle, who tweets something like “Here’s my SECRET TRICK for 50% annual returns with zero risk!” It is a mystery why Twitter’s algorithm can’t detect these extremely obvious bots.
In my experience, there aren’t many such debates where the correct answer readily leaps to the surface if you look for it. It’s more that the correct answer is obvious once you find it, but if you’re in a poor epistemic environment, the answer is hard to find even if you’re looking for it.
For example, the literal veracity of Christianity is hard to determine if everyone you know is a devout Christian, even if you’re not biased at all by wanting to believe what other people believe.
I sat on an email for two weeks because I didn’t have any coherent thoughts about how to reply. A few days ago I replied to say “I don’t have any coherent thoughts about this.” The other person seemed to appreciate it.
Minimal chance that any AI company shuts down (at least under current circumstances). But I figured it was worth publicly acknowledging that an AI company should shut down, even though I expect they won’t.