An open letter to SERI MATS program organisers

(Independent) alignment researchers need to be exceptionally good at philosophy of science

In order to be effective, independent AI alignment/​safety/​x-risk researchers should be unusually competent in philosophy of science, epistemology, and methodology of research (the latter imports some strategic considerations, too, as I’ll illustrate shortly), relative to other fields of research.

The reasons for this are:

(1) The field is pre-paradigmatic.

(2) There is a number of epistemic complications with this type of research, many of which are detailed by Adam Shimi here (also see my comment to this post). I’d add, on top of that, that the very concept of risk is poorly understood and risk science is often conducted poorly (usually because one doesn’t realise they are doing risk science).

(3) The object of research relates to ethics (i.e., philosophy, unless one subscribes to ethical naturalism), so researchers should be able to properly synthesise science and philosophy.

(4) Alignment work within AGI labs tends to be more empirical and iterative, which partially ameliorates some of the epistemic challenges from (2), but independent researchers usually can’t afford this (also, this approach doesn’t make sense for them to pursue strategically), so these challenges remain unaddressed.

At the same time, I’m frustrated when I see a lot of alignment/​safety research on LessWrong that resembles philosophy much more than science, towering reasoning on top of intuitions and unprincipled ontologies rather than more established scientific theories or mechanical models. This is a dead end and one of the important reasons why this work doesn’t stack. Contrasted with the particular demand for philosophy of science, this situation seems to harbour a huge opportunity for improvement.

Alignment researchers should think hard about their research methodology and strategy

Talking about research strategy, I also often see lapses, such as (but not limited to):

  • People take ML courses (which are often largely about ML engineering, which doesn’t help unless one is planning to go into mechanical interpretability research) when they plan to become alignment researchers (when they really should pay much more attention to cognitive science).

  • People fail to realise that leading AGI labs, such as OpenAI and Conjecture (I bet DeepMind and Anthropic, too, even though they haven’t publicly stated this) do not plan to align LLMs “once and forever”, but rather use LLMs to produce novel alignment research, which will almost certainly go in package with novel AI architectures (or variations on existing proposals, such as LeCun’s H-JEPA). This leads people to pour way too much attention to LLMs and think about their alignment properties. People who are new to the field tend to fall into this trap but veterans such as LeCun and Pedro Domingos point out that ML paradigms come and go. LeCun publicly predicted that in 5 years, nobody would be using autoregressive LLMs anymore and this prediction doesn’t seem that crazy to me.

  • Considering the above, and also some other strategic factors, it’s questionable whether doing “manual” mechanical interpretability makes much sense at this point. I see in the previous SERI MATS cohort, there was a debate about this, titled “How much should be invested in interpretability research?”, but apparently there is no recording. I think strategically, only automated and black-box approaches to interpretability make practical sense to develop now. (Addition: Hoagy’s comment reminded me that “manual” mechanical interpretability is still somewhat important as a means to ground scientific theories of DNNs. But this informs what kind of mechanical interpretability one should be doing: not just “random”, because “all problems in mechanical interpretability should be solved sooner or later” (no, at least not manually), but choosing the mechanical experiments to test particular predicts of this-or-that scientific theory of DNNs or Transformers.)

  • Independent researchers sometimes seem to think that if they come up with a particularly brilliant idea about alignment, leading AGI labs will adopt it. Not realising that this won’t happen no matter what partially for social reasons, and partially because these labs already plan to deploy proto-AGIs to produce and/​or check research (perhaps paired with humans) on a superhuman level, so they already don’t trust alignment research produced by “mere humans” that much and would rather wait to check it with their LLM-based proto-AGIs, expecting the risks of that manageable and optimal, all things considered, in the current strategic situation. (Note: I may be entirely wrong about this point, but this is my perception.)

What to do about this?

At universities, students and young researchers are expected to learn good philosophy of science primarily “covertly”, by “osmosis” from their advisors, professors, and peers, plus the standard “Philosophy” or “Philosophy of Science” courses. This “covert” strategy of upskilling in philosophy of science is itself methodologically dubious and strikes me as ineffective, e.g., it hinges too much on how good at philosophy of science is the mentor/​advisor themselves, which may vary greatly.

In programs like SERI MATS, the “osmosis” strategy of imparting good philosophy of science to people becomes even more problematic because the interaction and cooperation time is limited: in university, a student regularly meets with their advisor and other professors for several years, not several months.

Therefore, direct teaching via seminars seems necessary.

I surveyed the seminar curriculum of the last cohort and it barely touches on philosophy of science, epistemology, methodology and strategy of research. Apart from the debate that I mentioned above, which touches on the strategy, “How much should be invested in interpretability research?”, there is also a seminar about “Technical Writing” which is about methodology, and that’s it.

I think there should be much more seminars about philosophy of science, methodology of science, and strategy of research in the program. Perhaps, not at the expense of other seminar content, but via increasing the number of seminar hours. Talks about ethics, and in particular about ethical naturalism (such as Oliver Scott Curry’s “Morality as Cooperation”), or interaction of ethics and science more generally, also seem essential.

My own views on philosophy of science and methodology of research in application to AI Safety are expressed in the post “A multi-disciplinary view on AI safety research”. It’s similar in spirit to Dan Hendryks’ “Interdisciplinary AI Safety”, based on the “Pragmatic AI Safety” agenda, but there are also some differences that I haven’t published yet and these differences are beyond the scope of this letter to detail.

It seems that Santa Fe Institute is organizationally interested in AI and is hosting a lot of relevant seminars themselves, so I imagine they would be interested to collaborate with SERI MATS on this. They are also good at philosophy of science.