[Question] Where are the people building AGI in the non-dumb way?

I am somewhat baffled by the fact that I have never ran into somebody who is actively working on developing a paradigm of AGI which is targeted at creating a system that the system is just inherently transparent to the operators.

If you’re having a list sorting algorithm like QuickSort, you can just look at the code and then get lots of intuitions about what kinds of properties the code has. An AGI would of course be much, much more complex than QuickSort, but I am pretty sure that there is a program that you can write down that has the same structural property of being interpretable in this way, where the algorithm also happens to define an AGI.

And this seems to be especially the case when you consider that when building the system we can build it in such a way that we have many components and these components have sub-components such that in the end, we have some pretty small set of instructions that does some specific task that is then understandable. And if you understand this component, you can probably understand how this set of instructions behaves in some larger module that uses this set of instructions.

Everything interpretability tries to do, we would just get for free in this kind of paradigm. Moreover, we could design the system in such a way that we have additional good properties. Instead of using SGD in order to find just some set of weights that performs well, that we then interpret, we could just constrain the kinds of algorithms we design in such a way that they are as interpretable as possible, such that we are subjected so strongly to the will of SGD and what algorithms it finds.

Maybe these people exist (if you are one please say hello), but I have talked to probably between 20 and 40 people who would describe themselves as doing AI alignment research and never came something like this up even remotely.

Basically, this is my current research agenda now. I’m not necessarily saying this is definitely the best thing that will save everyone and everybody should do this, but if zero people do this, it seems pretty strange to me. So I’m wondering if there are some standard arguments that I have not come across yet, while this kind of thing is actually really stupid to do.

There are two counter-arguments to this that I’m aware of, that I don’t think in themselves justify not working on this.

  1. This seems like a really hard program and might take just way too long and then we’re already all dead by the time we would have built AGI in this way.

  2. This kind of paradigm comes with the inherent problem that because the code is interpretable it becomes probably easy to see once you get a really capable algorithm that is basically an AGI. In that case, any person on the team that understands the code well enough can just take the code and do some unilateral madness. So you need to find a lot of people that are aligned enough such that they could work on this, which might be extremely difficult.

Though I’m not even sure how much of a problem point 2 is, because that seems to be a problem in any paradigm, no matter what we do, we probably end up being able to build unaligned AGI before we know how to align it. But maybe it is especially pronounced in this kind of approach. Though consider how much effort we need to invest in order to bridge the gap from being able to build an unaligned AI to being able to build an aligned AI, in any paradigm. I think that time might be especially short in this paradigm.

I feel like what MIRI is doing doesn’t quite count. At least from my limited understanding, they are trying to identify problems that are likely to come up in highly intelligent systems and solve these problems in advance, but not necessarily advancing <interpretable/​alignable> capabilities in the way that I am imagining. Though I do, of course, have no idea about what they’re doing in terms of the research that they do not make public.