Is it time to start thinking about what AI Friendliness means?

Some background:

I have followed the writing of Eliezer on AI and AI safety with great interest (and mostly, I agree with his conclusions).

I have done my share of programming.

But, I confess, most of the technical side of AI alignment is beyond my current level of understanding (currently I am reading and trying to understand the sequence on brain-like AGI safety).

I do, however, find the ethical side of AI alignment very interesting.

In 2004, Eliezer Yudkowsky wrote a 38-page paper on Coherent Extrapolated Volition, or CEV. An attempt to create a philosophy of Friendliness, to somewhat formalize our understanding of how we would want a Friendly AI to behave.

In calculating CEV, an AI would predict what an idealized version of us would want, “if we knew more, thought faster, were more the people we wished we were, had grown up farther together”. It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge. This initial dynamic would be used to generate the AI’s utility function.

There are many objections to CEV. I have browsed the posts tagged CEV, and in particular enjoyed the list of CEV-tropes, a slightly tongue-in-cheek categorization of common speculations (or possibly misconceptions) about CEV.

So I think it is rather uncontroversial to say that we do not understand Friendliness yet. Not enough to actually say what we would want a Friendly AI to do once is is created and becomes a superintelligence.

Or perhaps we do have a decent idea of what we would want it do, but not how we would formalize that understanding in a way that doesn’t result in some perverse instantiation of our ethics (as some people argue CEV would. Some versions of CEV, anyway—CEV is underspecified. There are many possible ways to implement CEV).

In the above-mentioned paper on CEV, Eliezer Yudkowsky writes the following warning.

Arguing about Friendliness is easy, fun, and distracting. Without a technical solution to FAI, it doesn’t matter what the would-be designer of a superintelligence wants; those intentions will be irrelevant to the outcome. Arguing over Friendliness content is planning the Victory Party After The Revolution—not just before winning the battle for Friendly AI, but before there is any prospect of the human species putting up a fight before we go down. The goal is not to put up a good fight, but to win, which is much harder. But right now the question is whether the human species can field a non-pathetic force in defense of six billion lives and futures.

While I can see the merits of this warning, I do have some objections to it, and I think that some part of out effort might be well-spent talking about Friendliness.

Part of the argument, I feel, is that building a GAI that is safe and does “what we want it to do” is orthogonal to “what we want it to do”. That we just build a general-purpose super-agent that can fullfill any utility function, that load our utility function into it.

I’m not sure I agree.
After all, Friendliness is not some static function.
As the AI grows, so will its understanding of Friendliness.
That’s a rather unusual behavior for a utility function, isn’t it? To start with a “seed” that grows and (and changes?) with improved understanding of the world and humanity’s values.
Perhaps there is some synergy in considering exactly how that would affect our nascent AI.
Perhaps there is some synergy in, from the beginning, considering the exact utility function we would want our AI to have, the exact thing we want it to do, rather than focusing on building an AI that could have all possible utility functions.
Perhaps an improved understanding of Friendliness would improve the rest of our alignment efforts.

Even if it were true that we only need to understand Friendliness at the last moment, before FAI is ready to launch:

We don’t know how long it would take to solve the technical problem of AI alignment. But we don’t know how long it would take to solve the ethical problem of AI alignment, either. Why do you assume it’s an easy problem to solve?

Or perhaps it would take time to convince other humans of the validity of our solution, to, if you will, align humans and stakeholders of various AI projects, or those who have influence over AI research, to our understanding of Friendliness.

I also have this, perhaps very far-fetched idea, that an improved understanding of Friendliness might be of benefit to humans, even if we completely set aside the possiblity of superhuman AI.

After all, if we agree that there is a set of values, a set of behaviors that we would want to a superintelligence acting in humanity’s best interest to have, why wouldn’t I myself choose to hold these values and do these behaviors?
If there is a moral philosophy that we agree is if not universal then best approximation to human-value-universality, why wouldn’t humans find it compelling? More compelling, perhaps, then any existing philosophy or value system, if they truly thought about it?
If we would want superhuman minds to be aligned to some optimal implementation of human values, why wouldn’t we want human minds to be aligned to the very same values?

(ok, this part was indeed far-fetched and I can imagine many counter-arguments to it. I apologize for getting ahead of myself)

Nevertheless, I do not suggest that those working on the technical side of the AI alignment redirect their efforts to think about Friendliness. After all, the technical part of alignment is very difficult and very important.

Personally, as someone who finds the ethical side of AI alignment far more compelling, I (as of now, anyway, before I have received feedback on this post) intend to attempt to write a series of posts further exploring the concept of Friendliness.

Epistemically, this post is a shot in the dark. Am I confused? Am I wasting my time while I should be doing something completely different? I welcome you to correct my understanding, or to offer counterpoints.