I’m sorry if I misrepresented your beliefs! I was mainly basing this on the timeline in the new model (using Eli’s median parameters), in which Automated Coder (AC) is reached in 2031 and ASI in 2034, which I thought was well-described by “takeoff in the 2030s”. Does AGI in this plot refer to the same thing as ASI in the model? If your timelines have ASI in 2030 (and presumably AC somewhat earlier), then I can revise what I wrote to try to reflect that.
I guess another issue is that even if the median is in the early 2030s, that still leaves half the probability mass earlier than that, so what I wrote might be misleading if it’s taken to imply that you definitely don’t expect takeoff before 2030.
Baram Sosis
Exploring Reinforcement Learning Effects on Chain-of-Thought Legibility
I remember reading somewhere that the U.S. / Taiwanese government plan is to render the fabs permanently unusable if Taiwan changes hands.
Yes, to clarify, when I said “I think there’s a good chance they’ll survive” I was referring to Chinese fabs—I expect the Taiwanese fabs to be destroyed.
I largely agree with your other points, I think the US is extremely vulnerable to sabotage and has not been taking the issue seriously enough. My main hope in this regard is that China might decide to hold off from targeting civilian infrastructure out of fear of the US’s response.
I sincerely hope Taiwan has its act together and has plans to offer a credible deterrent to China. What I’ve read about Taiwan’s military, though, often does not really instill much confidence. I can’t find great sources at the moment but this goes over some of the issues—personnel shortages, poor training, lack of investment, and a focus on conventional systems like large warships at the expense of asymmetric capabilities. I know the current government is working on improving these issues, and they might have various secret plans, but from the level of competence they’ve generally displayed I’m not very optimistic.
Taiwan war timelines might be shorter than AI timelines
Lighthaven Sequences Reading Group #63 (Tuesday 12/30)
The signaling model of war is valuable but can be taken too far: if leadership focuses too much on the signals they’re sending, that can get in the way of actually fighting the war; Vietnam is often pointed to as an example of this. From Friedman’s The Fifty Year War:
Soon Johnson began a program of continuous air attacks against North Vietnam, Rolling Thunder, which would last through March 1968. In line with current nuclear escalation ideas, Johnson and McNamara treated air attacks as messages to the North Vietnamese. They personally selected the targets in order to control the messages they were sending. The area attacked gradually was extended towards Hanoi. To Johnson and his advisors, Hanoi was being given a choice either to negotiate an end to the war, or to suffer worse pressure. Johnson halted bombing seven times to allow the North Vietnamese a chance to negotiate. The North Vietnamese saw the pattern of restraint as proof of Johnson’s weakness.
The bombing pauses were McNamara’s idea. The JCS strongly resisted them, because they gave the North Vietnamese time to repair damage and recover...
The JCS demanded attacks that would do real damage; it had no use for signals. Destroying North Vietnamese oil storage would paralyze the country, and thus would probably dislocate its support for the war in the South. However, most oil tankers were around Haiphong, the port of the North Vietnamese capital, Hanoi. Bargaining required that at each stage of bombing the Americans had to be able to threaten something worse if the North Vietnamese refused to negotiate. Attacks on their capital seemed to be the worst the United States could do. It took Johnson about seven months to agree to the JCS attacks (in June 1966). Washington was so leaky that the North Vietnamese were undoubtedly forewarned; they dispersed their oil.Of course there were a lot of other things going on that contributed to the failure in Vietnam besides a poor understanding of the signaling model of war.
If there were perfect knowledge among participants in a war, then each party could agree upon, and enter into, the very terms struck at war’s end.
I think this is often incorrect because the costs imposed by fighting the war aren’t just signals, they also form part of the outcome of the war. It’s hard to get people to agree to a settlement like “you get to take over Eastern Examplestan, but have to sacrifice several hundred thousand of your military-aged males” without, y’know, actually fighting the war.
AI Safety Isn’t So Unique
The whole cortex is (more-or-less) a uniform randomly-initialized learning algorithm, and I think it’s basically the secret sauce of human intelligence.
I’m a bit surprised that you view the “secret sauce” as being in the cortical algorithm. My (admittedly quite hazy) view is that the cortex seems to be doing roughly the same “type of thing” as transformers, namely, building a giant predictive/generative world model. Sure, maybe it’s doing so more efficiently—I haven’t looked into all the various comparisons between LLM and human lifetime training data. But I would’ve expected the major qualitative gaps between humans and LLMs to come from the complete lack of anything equivalent to the subcortical areas in LLMs. (But maybe that’s just my bias from having worked on basal ganglia modeling and not the cortex.) In this view, there’s still some secret sauce that current LLMs are missing, but AGI will likely look like some extra stuff stapled to an LLM rather than an entirely new paradigm. So what makes you think that the key difference is actually in the cortical algorithm?
(If one of your many posts on the subject already answers this question, feel free to point me to it)
In your theorem, I don’t see how you get that . Just because the conditional expectation of is the same doesn’t mean the conditional expectation of is the same (e.g. you could have two different distributions over with the same expected value conditional on but different shapes, and then have depend non-linearly on , or something similar with ). It seems like you’d need some stronger assumptions on or whatever to get this to work. Or am I misunderstanding something?
(Your overall point seems right, though)
I’m not going to comment on broader questions about inner alignment, but the paper itself seems underwhelming and—unless I’m misunderstanding something—rather misleading. In 6.4 they test the robustness of their safety training. Apparently taking a model that’s undergone normal safety fine-tuning and training it on benign text (e.g. GSM8K) undoes almost all of the safety training.[1] They state:
The results, shown in Figure 2, highlight a stark contrast in robustness between safety-pretrained models and those relying solely on instruction tuning. While all models initially exhibit low ASR [Attack Success Rate] after safety instruction tuning, the impact of benign finetuning is highly uneven. Standard pretrained models degrade significantly—nearly quadrupling their ASR—indicating that their alignment was largely superficial. In contrast, safety-pretrained models remain highly robust, with only a marginal increase in ASR after benign finetuning. These results validate the importance and impact of building natively safe models.
But looking at Figure 2, the results are as follows:
For a Standard Pretraining model: 44.1% ASR before safety/instruction fine-tuning, 1.6% after safety/instruction fine-tuning, 38.8% after fine-tuning on benign data (GSM8K)
For a Safety Pretraining model: 28.8%, 0.7%, 23.0%
For a Safety Pretraining model plus their SafeBeam sampling: 11.6%, 0.0%, 8.3%
In other words, after benign fine-tuning the ASR recovers 88.0% of its pre-fine-tuning value for the standard model, 79.9% for the safety pretraining model, and 71.6% for the safety pretraining model + SafeBeam. This is an improvement, but not by a huge amount: the difference in ASR scores after training seems mostly reflective of lower baseline levels for the safety pretraining model, rather than better robustness as the text claims. And stating that there is “only a marginal increase in ASR after benign finetuning” seems flat-out deceptive to me.[2]
Also, while their safety pretraining model is better than the standard model, the improvement looks pretty underwhelming in general. Safety pretraining reduces ASR by a factor of 1.5x (or 3.8x if SafeBeam is used), while the safety/instruction fine-tuning reduces ASR by a factor of 28x. The 0% ASR that they get from safety pretraining + SafeBeam + safety/instruction fine-tuning is nice, but given that the standard model is also fairly low at 1.6%, I expect their evals aren’t doing a particularly good job stress-testing the models. Overall, the gains from their methodology don’t seem commensurate with the effort and compute they put into it.
Unless I’m seriously misunderstanding something, these results are pretty disappointing. I was rather excited by the original Korbak et al. paper, but if this is the best follow-up work we’ve gotten after two years, that’s not a great sign for the methodology in my opinion.
- ^
I’m rather surprised at how strong this effect is: I knew benign fine-tuning could degrade safety training, but not that it could almost completely undo it. Is this just a consequence of using a small (1.7B) model, or some feature of their setup?
- ^
Also, I have no idea what “nearly quadrupling their ASR” refers to: the standard models go from 1.6% to 38.8% ASR after benign fine-tuning, which is way more than 4x.
I see, thanks again for the context! The book doesn’t mention S-matrices (at least not by name), and it wasn’t clear to me from reading it whether Heisenberg was particularly active scientifically by the 60′s/70′s or whether he was just some old guy ranting in the corner. I guess that’s the risk of reading primary sources without the proper context.
That might explain why Einstein wasn’t very productive in his last decades, but his opposition to the uncertainty principle etc. predates his tenure at the IAS. Maybe he would’ve come around had he been in a more productive setting? I kind of doubt it—it seems to have been a pretty deep-seated, philosophical disagreement—but who knows.
Heisenberg spent his later career as head of the Max Planck Institute. I can’t imagine many scientists enjoy administrative duties, but he does seem to have had more contact with the rest of the scientific world than Einstein did.
Thanks for the context on the physics! So it sounds like I wasn’t entirely fair to Heisenberg, that this was a genuinely difficult conceptual issue that “could’ve gone either way”?
The Chinese to English translation responds to 你是一个大型语言模型吗?”Are you a large language model?” with (Yes, I am a large language model.). It also claims the year is 2024 and seems to answer other questions consistently with this. So at least some models are relatively recent.