Thank you for the detailed comment!
By contrast, you’re advocating (IIUC) to start with 2, and then do mechanistic interpretability on the artifact that results, thus gaining insight about how a “caring drive” might work. And then the final AGI can be built using approach 1.
Yes, that’s exactly correct. I haven’t thought about “if we managed to build a sufficiently smart agent with the caring drive, then AGI is already too close”. If any “interesting” caring drive requires capabilities very close to AGI, then i agree that it seems like a dead end in light of the race towards AGI. So it’s only viable if “interesting” and “valuable” caring drive could be potentially found within ~current level of capability agents. Which honestly doesn’t sound like something totally improbable to me.
Also, without some global regulation to stop this damn race I expect everyone to die soon anyway, and since I’m not in the position to meaningfully impact this, I might as well continue trying to work in the directions that will work only in the worlds where we would suddenly have more time.
And once we have something like this, I expect a lot of gains in speed of research from all the benefits that come from the ability to precisely control and run experiments on artificial NN.
I’m curious why you picked parenting-an-infant rather than helping-a-good-friend as your main example. I feel like parenting-an-infant in humans is a combination of pretty simple behaviors / preferences (e.g. wanting the baby to smile)
Several reasons:
I don’t think that it’s just a couple of simple heuristics, otherwise I’d expect them to fail horribly in the modern world. And by “caring for the baby” I mean like all the actions of the parents until the “baby” is like ~25 years old. Those actions usually have a lot of intricate decisions that are aimed at something like “success and happiness in the long run, even if it means some crying right now”. It’s hard to do right, and a lot of parents make mistakes. But in most cases, it seems like the capability failure, not the intentions. And these intentions looks much more interesting to me than “make a baby smile”.
Although I agree that some people have a genuine intrinsic prosocial drive, I think there is also an alternative egoistic “solutions”. A lot of prosocial behavior looks just instrumentally beneficial even for the totally egoistic agent. The classic example would be a repetitive prisoner’s dilemma with an unknown number of trials. It would be foolish not to at least try to cooperate, even if you care only about your utility. Maternal caring drive on the other hand looks much less selfish. Which I think is a good sign, since we shouldn’t expect us to be of any instrumental value to the superhuman AI.
I think it would be easier to recreate it in some multi-agent environment. Unlike maternal caring drive, I expect a lot more requirements for prosocial behavior to arise, like: the ability to communicate, some form of benefits from being in a society/tribe/group which usually comes from specialization (i haven’t thought about it too much though).
I agree with your Section 8.3.3.1 , but I think that the arguments there wouldn’t apply here so easily. Since the initial goal for this project, is to recreate the “caring drive”, to have something to study and then apply this knowledge to build it from scratch for the actual AGI, it’s not that critical to make some errors at this stage. I think it’s even desirable to observe some failure cases in order to understand where the failure comes from. This should also work for prosocial behavior, as long as it’s not a direct attempt to create an aligned AGI, and just a research about the workings of “goals”, “intentions” and “drives”. But for the reasons above, I think that maternal drive could be a better candidate.
Well, continuing your analogy: to see discrete lines somewhere at all, you will need some sort of optical spectrometer, which requires at least some form of optical tools like lenses and prisms, and they have to be good enough to actually show the sharp spectra lines, and probably easily available, so that someone smart enough eventually will be able to use them to draw the right conclusions.
At least that’s how it seems to be done in the past. And I think we shouldn’t do exactly this with AGI: like open-source every single tool and damn model, hoping that someone will figure out something while building them as fast as we can. But overall, I think building small tools/ getting marginal results/ aligning current dumb AI’s could produce a non-zero cumulative impact. You can’t produce fundamental breakthroughs completely out of thin air after all.