Formally zroe1
Zephaniah Roe
Unrelated but I have always found the following passage from this paper fascinating. Turing lists some counterarguments to his claim that by roughly 2000, machines will be pretty good at the “imitation game.”
(2) The “Heads in the Sand” Objection
The consequences of machines thinking would be too dreadful. Let us hope and believe that they cannot do so.”
This argument is seldom expressed quite so openly as in the form above. But it affects most of us who think about it at all. We like to believe that Man is in some subtle way superior to the rest of creation. It is best if he can be shown to be necessarily superior, for then there is no danger of him losing his commanding position. The popularity of the theological argument is clearly connected with this feeling. It is likely to be quite strong in intellectual people, since they value the power of thinking more highly than others, and are more inclined to base their belief in the superiority of Man on this power.
I do not think that this argument is sufficiently substantial to require refutation. Consolation would be more appropriate: perhaps this should be sought in the transmigration of souls.
One thing that is interesting here is that we currently are living in a time where many have their heads in the sand. The other interesting thing is “Consolation would be more appropriate: perhaps this should be sought in the transmigration of souls.” Should we take this wording to mean that Turing does actually see the prospect as frightening yet inevitable? Does his solution—probably said with some amount of jest—reference humankind merging with machines?
How sure are you about this?
I say “hopefully” for a reason. I am not under the illusion that everyone who completes any intro curriculum will learn AI safety basics. With that being said, I just skimmed the BlueDot technical AI safety curriculum and it seems fine. I think it was better when I did it a few years ago but its clear that they have optimized to make it more approachable even if it is less comprehensive and maybe this was the right call—I’m not sure.
Someone told me that the BlueDot team don’t believe in full AI X-risk
I would be pretty surprised if this was true, based on two people I know that have worked at BlueDot (both people who are very impressive + mission-aligned and I would hire if I could). I would want to know more details and in the meantime I don’t want to speculate because my prior is that the BlueDot staff do not feel this way and I generally trust BlueDot to make reasonable decisions.
But I’m leaning towards that the problem is that they never really leaned it, rather than that they learned it and forgot.
I hope you are wrong but unfortunately I think you could be right in some/many/most cases. I think what I describe in the original post still happens but the problem may be a lot larger (in which case we may want to rethink how we do intro fellowships).
I am also curious what you think a good solution looks like! If you have ideas I may try to make it happen.
Really enjoyed this post—not only is the research interesting/important but the way that it was communicated is really honest and helpful.
We overall think this evidence supports a “dynamics-focused” approach to control technique development. Ctrl-Z’s picture of the basic dynamics was mostly correct, but a single difference was enough to dramatically change the equilibrium in a new setting, such that naively applying the protocols developed in Ctrl-Z would harm safety. Dynamics-focused control research prioritizes explaining how and why certain control techniques affect safety.
Commenting to signal that I think dynamics-focused control is a really good idea! I’ve been historically pretty skeptical of high-stakes control research but this combined with the quote below would address some share of my concerns. Overall this post was an update for me being more excited about the control agenda.
Prior control work has primarily focused on safety vs usefulness (model capability), and not paid too much attention to cost and latency requirements. We think it would be good for there to be more systematic study of protocols designed for other regimes.
Yes! This is true. I tried to say “dead tree” or “dry tree” to avoid this but reading through it again maybe it is confusing. I think I’ll add a footnote to clarify. Thanks!
meaning there’s a ton of important but un-replicated papers just below that threshold?
Yes! I also think that waiting for something to be maximally load bearing before taking a closer look is bad practice. We want to build up organizational knowledge so we are able to catch things before lots of other research is built upon it.
Its hard to know what is happening inside of the labs. I doubt what you are describing is occurring but thats just my personal opinion and at the end of the day, who knows! If it is happening though, then the work is certainly not being made public. So at the very least, the non-lab AI safety community doesn’t get to benefit. I would argue that having people outside of labs having an accurate understanding of AI safety research is important as well.
Additionally, we should not wait for ideas to be implemented at labs! Ideally, if there was a key limitation of a work we would catch it way before that. Counting on a lab to notice that some safety technique doesn’t work seems like a risky strategy. Even if they are good at catching issues, making them do this wastes valuable conscientious, safety-pilled researcher time within labs when they could have been implementing something more likely to work.
I generally am going to avoid critiquing papers without a full replication but one example from a long time ago is Redwood Research’s “High-stakes alignment via adversarial training”. One thing that makes this a non-controversial example is the authors themselves admit that there were aspects of the original writeup which were misleading (see full retrospective here). You can also check out my thoughts on this paper or this post on EM.
I will actually push back on the framing here though. “Telling the full story” is messy and subtle. So providing examples of papers where this would have been useful is difficult to fit in a comment (actually is hard to fit into a blog post or paper). Additionally, showing that a paper does replicate and in fact it is more robust than we expect is 1. what we hope to see and 2. also provides useful information. Researchers don’t know what papers are real and being able to see that someone was able to replicate something allows them to make better strategic decisions about their research. Senior level AI safety researchers have told me that work not replicating is a concern: e.g., you can see Buck’s comment here but most of these comments were made in private (but from credible people—e.g., Anthropic employees). So by default, people can’t necessarily assume things will replicate and you are giving useful information either way.
That’s great to hear!! Best of luck on the journey :)
Its a beautiful line! Not sure if it relates to this essay but I find it highly relevant to our current moment.
I see the goal here is to give an intuitive feeling for how basic gaps in knowledge can emerge. I see this happen all the time in AI safety and as someone who runs an AI safety program, I actually find this example more interesting than the Turing machine example. But that’s a personal thing and although I think this kind of lesson should hold generally, convincing the reader of this isn’t the core thing I’m interested in.
The point I make here is that having a deep understanding of AI safety helped keep me focused in my undergrad. That is, deeply understanding what the problem is and why it is difficult keeps one convinced that they should continue working on the right kinds of things.
Also though: A lot of pretty profound ideas come from pretty simple intuitions about foundational concepts. E.g., my understanding is Alan Turing came up with the idea of a Turing machine by reasoning about how a human computer would calculate something. But then the Turning machine model is a really powerful mathematical tool for all sorts of things. Similar things (where a simple intuition motivated a key discovery) happened with neural networks and possibly special relativity if my understanding of the history is correct. To me, understanding the basics is a prerequisite (or maybe the main prerequisite) for all of this.
Trees are mostly made of air and a generalizable lesson for AI safety
It is a bad sign if Anthropic, the AI Safety company, is unwilling to use a technique with such small drop in capabilities when it increases safety for users.
To me the crux is how much does this actually increase safety. The assistant axis paper is super interesting and important in many ways but the mechanistic intervention may not actually decrease the probability of catastrophic misuse. E.g., bio/cyber capabilities may be closely aligned with the assistant axis. I would also be somewhat surprised if aligning a model with the assistant axis would decrease the probability of future, more advanced systems pursing instrumental goals.
Although it may be net positive to implement this kind of intervention, its probably important to have a high bar for which safety techniques are implemented within labs because there is a high infrastructure and logistical cost to applying a technique to every forward pass. If I was leading a lab, I would probably conclude that the work is not x-risk relevant enough to be a priority. Importantly, this could change and I think there should be lots of followup work to the assistant axis paper because its a really exciting finding.
A question for those who are more tapped into how labs think about things: would it be helpful to demonstrate that the assistant axis work scales to, for example, DeepSeek-V3?
(I’m uncertain if the assistant axis work should be implemented/tested within labs, but using this as a useful example)
and who lack the context to make some of the more incisive critiques.
Yup! This is a core limitation.
too credulous to serve as a bug detection mechanism.
I certainly do expect us to miss plenty of bugs. The hope is that by reimplementing the paper it would be somewhat unlikely to have the exact same bug. But this could happen or we could also have bugs of our own.
I’m planning on writing up something on the limitations of our work for transparency and so readers know how to interpret our findings. But the tl;dr is that replications can help but they aren’t a ground source of truth. Really happy to talk if you have further thoughts on this or anything replications related!
Iterative Finetuning is Mostly Idempotent
E.g., it would be better if people on LW applied something more like typical researcher norms to research outputs.
Are there other examples like this to make LW less “hostile and aversive to AI company employees”? I suspect your view is reasonable but I think the details matter here a lot.
A few of the points in this post relate to making LW less hostile toward AI company employees. I take your post to say something like “you shouldn’t be unnecessarily hostile to lab employees and try to stick to the object level.” Is my interpretation correct or do you mean that there should be a deeper effort to make lab employees feel more welcome (I would be potentially worried about this)?
I think this is hard because there is no way to specify the problem or judge submissions. If one was able to specify what it means “solve alignment” in a verifiable way, they would already be 95% of the way there.
This is brilliant. Very well done.
There are also other things that need to be done. In a sane world, there would be multiple replications of every AI safety study (I’m working on that).
My other actionable recommendation: Orgs doing safety research should put up bounties for anyone who finds errors in your work, with specific guidelines and a third party judge. If an organization believes a result is true and genuinely moves the needle on x-risk, it would be silly not to offer 25k to anyone who finds a bug in their codebase which qualitatively changes their results. Finding such a bug would be very valuable!!
I broadly agree here.
I will say that I prefer a framing that separates ‘alignment’ agendas from ‘meta-alignment’ agendas. For example, the pragmatic interp agenda is largely motivated by the idea that their techniques will be instrumentally useful for alignment (see here) but it’s not like the GDM interp team have a concrete plan for how they would like to “solve alignment.” So I would consider this—along with a lot of/most of Redwood, Truthful AI, and Anthropic work—to be meta-alignment.
I do worry that some of this could be streetlight effect but there are reasonable arguments for meta-approaches bootstrapping their way to true alignment and there are people who see this as the goal of their work. There has been a kind of silent shift from alignment to meta-alignment but I think that this is subtly different than “Almost nobody is working on alignment.”