The Security Mindset, S-Risk and Publishing Prosaic Alignment Research

Note: I no longer endorse this post as strongly as I did when publishing it. Although not publishing blatantly capability-advancing research is of course beneficial, I now agree with Neel Nanda’s criticism and endorse this position. My primary reason for this was exactly as stated in his comment; I was simply too new to the field to correctly judge how potentially exfohazardous given research could be. Additionally, I was hellbent on never even potentially advancing capabilities in even the slightest or most indirect form to an extent that I now believe negatively impact my progress overall.

I still think the analysis in the post could be useful, but I wanted to include this note at the beginning of the post in the case that someone else in my position happens to find it.

Introduction

When converging upon useful alignment ideas, it is natural to want to share them, get feedback, run experiments and iterate in the hope of improving them, but doing this suddenly can bare contradiction to the security mindset. The Waluigi Effect, for example, ended with a rather grim conclusion:

If this Semiotic–Simulation Theory is correct, then RLHF is an irreparably inadequate solution to the AI alignment problem, and RLHF is probably increasing the likelihood of a misalignment catastrophe. Moreover, this Semiotic–Simulation Theory has increased my credence in the absurd science-fiction tropes that the AI Alignment community has tended to reject, and thereby increased my credence in s-risks.”

Could the same apply to other prosaic alignment techniques? What if they do end up scaling to superintelligence?

It’s easy to internally justify publishing in spite of this. Reassuring comments always sound reasonable at the time, but aren’t always robust upon reflection. While developing new ideas one may tell themselves “This is really interesting! I think this could have an impact on alignment!”, and while justifying publishing them “It’s probably not that good of an idea anyway, just push through and wait until someone points out the obvious flaw in your plan.” Completely and detrimentally inconsistent.

Arguments 1 and 3

Many have shared their musings about this before. Andrew Saur does so in his post “The Case Against AI Alignment”, Andrea Miotti explains here why they believe RLHF-esque research is a net-negative for existential safety (see Paul Christiano’s response), and Christiano provides an alternative perspective in his “Thoughts on the Impact of RLHF Research”. Generally, these arguments seem to fall into three different categories (or their inversions):

  1. This research will further capabilities more than it will alignment

  2. This alignment technique if implemented could actually elevate s/​x-risks

  3. This alignment technique if implemented will increase the commercializability of AI, feeding into the capabilities hype cycle, indirectly contributing to capabilities more than alignment.

To clarify, I am speaking here in terms of impact, not necessarily in terms of some static progress metric (e.g. one year of alignment progress is not necessarily equivalent to one year of capabilities progress in terms of impact).

In an ideal world, we could trivially measure potential capabilities/​alignment advances and make simple comparisons. Sadly this is not that world, and realistically most decisions are going to be intuition-derived. Worse yet is that argument 2 is even murkier than 1 and 3. Trying to pose responses to question 2 is quite literally trying to prove that a solution you don’t know the implications of won’t result in an outcome we can hardly begin to imagine through means we don’t understand.

In reference to 1 and 3 (which seem to me addressable by similar solutions), Christiano has proposed a simple way to model this dilemma:

  • There are A times as many researchers who work on existential safety as there are total researchers in AI. I think a very conservative guess is something like A = 1% or 10% depending on how broadly you define it; a more sophisticated version of this argument would focus on more narrow types of safety work.

  • If those researchers don’t worry about capabilities externalities when they choose safety projects, they accelerate capabilities by about B times as much as if they had focused on capabilities directly. I think a plausible guess for this is like B = 10%.

  • The badness of accelerating capabilities by 1 month is C times larger than the goodness of accelerating safety research by 1 month. I think a reasonable conservative guess for this today is like 10. It depends on how much other “good stuff” you think is happening in the world that is similarly important to safety progress on reducing the risk from AI—are there 2 other categories, or 10 other categories, or 100 other categories? This number will go up as the broader world starts responding and adapting to AI more; I think in the past it was much less than 10 because there just wasn’t that much useful preparation happening.

Adapting this model to work for single publications instead of the entire field:

  • There will be A times as many researchers working on utilizing the contents of your publication for minimizing existential safety as there are researchers using it to enhance AI capabilities

  • B remains the same but applied specifically to the contents of your publication

  • C remains the same but applied specifically to the contents of your publication

This is going to look very different every time it’s performed (e.g. an agent foundations publication is likely to have a larger A value than some agenda that involves turning an LLM into an agent). Reframing as ratios:

  • (Number of researchers applying contents to alignment) /​ (Number of researchers applying contents to capabilities)

  • (Capabilities progress derived from your publication if dedicated to alignment) /​ (Capabilities progress derived from your publication if dedicated to capabilities)

  • (Goodness of accelerating alignment by dint of progress derived from your publication) /​ (Badness of accelerating capabilities by dint of progress derived from your publication)

The issue is that hard numbers are essentially impossible to conjure, and intuitions attached to important ideas are rarely honest, let alone correct. So how can we use a model like this to make judgements about the safety of publishing alignment research? If the result of you sharing your publication being favorable to alignment is dependent on one of these nigh unknowable factors turning in your favor, maybe it isn’t safe to share openly. Also, again referring to Yudkowsky’s security mindset literature; analysis by one individual with a robust security mindset is not evidence of it being safe to publish. If you’ve analyzed your research with the framework above, I urge you to think carefully about who to share it with, what feedback you hope to receive from them, and as painful as it might be; the worst case scenario for the malicious use of your ideas.

Something along the lines of Conjecture’s infohazard policy seems reasonable to apply to alignment ideas with a high probability of satisfying any of the aforementioned three arguments, and is something I would be interested in drafting.

Argument 2

This failure mode is embodied primarily by this post, which in short aims to convey that partial alignment could inflate s-risk likelihoods, an example of which is The Waluigi Effect which seems to imply that this is true for LLMs aligned using RLHF. I assume this is of greater concern for prosaic approaches, and considerably less so for formal ones.

S-risk to me seems to be a considerably less intuitive concept to think about than x-risk. As I mentioned earlier, I appeared to be modeling AGI ruin in a binary manner, by which I would consider ‘the good outcome’ one in which we all live happily ever after as immortal citizens of Digitopia, and the bad outcome as one in which we were all dissolved into some flavor of goo. Perhaps this becomes the case as we drift toward the later years of post-AGI existence, but I now see the terrifying and wonderful spectrum of short term outcomes, including good ones that involve suffering and bad ones that do not. I had consumed s-risk literature prior but was so far behind in my thinking compared to my reading that it took actually applying the messages of this literature to a personal scenario to intellectualize it.

The root of this unintuitiveness can likely be found somewhere North of “Good Outcomes can still Entail Suffering” but East of “Reduction of Extinction Risk can Result in an Increase in S-Risk (and Vice Versa)”. In order to overcome this you do at some point need to quantify the goodness/​badness of some increment in extinction risk with a unit also applicable to the same increment in s-risk. This looks analogous to the capabilities vs alignment paradigm we have currently whereby it’s difficult to affect the position of one without affecting the other. Not just in terms of publishing but also in terms of focusing research efforts I think this is a critical bit of thinking that needs to be done and needs to be done really well with great rigor. If this has been done already please let me know and I will update this section of the post. I presume answers to this conundrum already exist in studies of axiology, and I have a writeup planned for this soon.

Conclusion

As capabilities progresses further and more rapidly than ever before, maintaining a security mindset when publishing potentially x/​s-risk inducing research is critical, and doesn’t necessarily tax overall alignment progress that greatly, as rigorous but high value assessments can be done quickly and are required to be deferred to be effective. Considering the following three arguments before publishing could have powerful long term implications:

  1. This research will further capabilities more than it will alignment

  2. This alignment technique if implemented could actually elevate s/​x-risks

  3. This alignment technique if implemented will increase the commercializability of AI, feeding into the capabilities hype cycle, indirectly contributing to capabilities more than it will to alignment.