[Question] Has private AGI research made independent safety research ineffective already? What should we do about this?

This post is a variation on “Private alignment research sharing and coordination” by porby. You can consider my question as signal-boosting that post.

AGI research is becoming private. Research at MIRI is nondisclosed-by-default for more than four years now. OpenAI stopped publishing details of their work[1], Hassabis also talked about this here.

Does this mean that independent AI safety research begins to suffer from knowledge asymmetry and becomes ineffective?

There are two directions possible directions of knowledge asymmetry:

  • State-of-the-art scaling results or even novel architectures are not published, and interpretability researchers use outdated models in their work. Hence, these results may not generalise to bigger model scales and architectures. The counterargument here is that in what comes to scale, relatively low-hanging interpretability fruit is still possible to pick even when analysing toy models. In what comes to architectures, transformer details may not matter that much for interpretability (however, Anthropic’s SoLU (Softmax Linear Unit) work seems to be the evidence against this statement: relatively minor architectural change has led to significant changes in the interpretability characteristics of the model); and if one of the AGI labs stumbles upon a major “post-transformer” breakthrough, this is going to be an extremely closely-guarded secret which will not spread with rumours to AI safety labs, and hence joining the “knowledge space” of these AI safety labs won’t help independent AI safety researchers.

  • Some safety research has already been done but was not published, either because of its relative infohazardousness or because it references private capability research, as discussed in the previous point. Oblivious to this work, AI safety researchers may reinvent the wheel rather than work on the actual frontier.

Private research space

If the issue described above is real, maybe the community needs some organisational innovation to address it.

Perhaps it could be some NDA + “noncompete” agreement + publishing restriction program led by some AI safety lab (or a consortium of AI safety labs) joining which is not the same as joining the lab itself in the usual sense (no reporting, no salary), but which grants access to the private knowledge space of the lab(s). The effect for the lab will be as if they enlarged their research group, the extra part of which is not controllable and is not guaranteed to contribute to their agenda, but costs them little: there are only costs for supporting the legal and IT infrastructure of the “private research space”.

There are some concerns that put the usefulness of this system in question, though.

First, there may be too few people who will be willing to join this private research space without joining the lab(s) themselves. Academics, including PhD students, want to publish. Only independent AI safety researchers may be eligible, which is a much smaller pool of people. Furthermore, the reasons why people opt for being independent AI safety researchers sometimes correlate with low involvement or low capability (of conducting good research), so these people who join the private research space may not be able to push the frontier by much.

Second, the marginal usefulness of the work done by independent people in the private research space may be offset by the marginally higher risk of information leakage.

On the other hand, if the private research space is organised not by a single AI safety lab but by a consortium (or organised by a single lab but more labs join this research space later), the first concern above becomes irrelevant and the second concern becomes harder to justify.

Furthermore, the consortium approach has the extra advantage of refining, standardising and sharing good inforhazard management practices.

A couple more points in favour of the idea of private research space:

  • This might be particularly interesting for US-based AI safety labs with short timelines which makes trying to hire people overseas ineffective because the relocation process (H-1B) is inadequately slow. Likewise, researchers with short timelines may consider seeking positions in US-based labs ineffective, for the same reason. (Disclaimer: I personally have very short timelines, 50% chance of AGI by 01.01.2028[2].)

  • Even if not explicitly about knowledge asymmetry, being part of a private communication space and seeing what other people are doing and talking about could permit researchers to choose their own agenda more effectively. This could prevent rework (when two independent groups or individuals start research on the same topic in parallel) and foster collaborations and interdisciplinary convergence of several distinct research agendas.

In a comment to his post, porby confessed that he doesn’t feel that his proposal (which is very close to my proposal laid out above) “fills the hole in coordination well”. I disagree, I feel that a system like this actually fills the coordination problem well (definitely better than the status quo).

Crucially, porby wrote that “it did not appear to actually solve the problems any specific person I contacted was having”. I think the key change in perspective to make is that we are not in the business of solving specific problems people that researchers have individually (or as a small research group), but collective coordination problem, i. e., the problem with the design of the collective, civilisational project of development non-catastrophic AI, which I discuss next.

R&D strategy for non-catastrophic AI

Discussing the concerns and needs of individual researchers and AI safety labs is ineffectual without considering the bigger system in which we all fit, namely the civilisational project of developing AI that will not catastrophically damage the potential of humanity.

Considering things at this level is very hard because this is a weird type of system in which most parties truly have the same goal (stated above) but are also secretive and mutually compete for a number of reasons.

Also, this civilisational project is strongly influenced by the economic and geopolitical power struggles to the degree that it would be perhaps ineffectual to try to analyse the project separately from economic and geopolitical models, and, hence, devise a civilisational R&D strategy for non-catastrophic AI without also global economic and geopolitical strategies. However, at this point, we realise that there is not enough coordination in the world to agree on and execute such a global strategy. And there should not only be government-level coordination but also economic coordination among companies and prestige coordination among individuals.

Coordination problems aside, I personally didn’t think much about how the global (AGI) strategy (and therefore the shape of the civilisational project) should look.

Still, I have some intuition about the right design of the R&D project that should hold regardless of the strategy being pursued. I feel that carving out capabilities research and safety (alignment) research into different fields, agendas, teams, or entire labs is methodologically wrong and ineffective. As well as other people (1, 2, 3), I’ve found that often, interesting and promising alignment research is inseparable from capabilities research[3].

I think the implication for AGI labs should be that there should be no separate “safety teams”, every team should be effectively a “safety” team, and their results should be primarily judged on whether they increase our confidence in AGI going well, and only then on whether they make a step towards AGI. (But who will listen to me...)

But the implication for AI safety labs is that they should also turn into AGI labs, with the only difference from “traditional” AGI labs in how they organise teams and evaluate projects and research results.

The third implication, for AI safety researchers, is that I feel that the work that they can safely do in the current situation, i. e., without a way to join a private research space led by some AI safety labs, is mostly ineffective. Which is the point I started this post from.


I’m pretty sure (or, I should say, I hope) that these global AGI R&D strategy and coordination questions have been discussed extensively by the “leading figures” in the field and AI safety organisations. I don’t see why the outcomes of these discussions themselves should be infohazardous, although I could find little trace of thinking about these problems on LW. If you know relevant discussions in some old posts or obscure comment discussions, please share.


This question is cross-posted on EA Forum.

  1. ^

    In contradiction to this, I’m pretty sure I saw a tweet from Sam Altman in the last few months where he dunked the idea of a “closed” approach to developing AI strategy (or alignment?). However, I cannot find this tweet now, so this might be a false memory, or someone else wrote that tweet, or Sam Altman deleted that tweet.

  2. ^

    Also, following Altman, I think that “reaching the AGI bar” will look as imprecise and, importantly, inconsequential in the moment, as passing the Turing test felt imprecise around 2021-2023. Some people are convinced that LaMDA and GPT 3.5 have already passed the Turing test, or could be rather trivially tweaked to do so, as of late 2021; others think the Turing test still stands. But universally, passing the Turing test didn’t feel like a big deal for anyone (as it may have felt to people in 1950, 2000, or even 2015). So, I expect both these effects to repeat wrt. AGI. Anyways, it doesn’t preclude us from estimating when AGI is (was) reached “on average”, as we can probably conclude that the Turing test was passed “on average” in 2022.

  3. ^

    This is less so for interpretability research, though interpretability is more of an enabler than an alignment approach in itself. My judgement of alignment research to be “interesting” and “promising” is conditional on my knowledge that other people do capability research. In a world where “we” (i. e., all people in the civilisational project) would spend 100% of our resources on mechanistic interpretability research of the SoTA AI systems until hitting the point of severe diminishing returns (i. e., potentially for years or even a decade) after every breakthrough capabilities idea, this would not be so, and I would be content with putting aside “capabilities+alignment” research. However, this is a fantasy situation even if we had near-perfect civilisational coordination systems in place.